|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Tsvwg] [SCTP checksum problems]Jonathan, Thanks for your comments. We are aware that we really don't know what the error model for the end to end transport is and we took a conservative approach - we do an end-to-end data check above what TCP offers and not aligned with the TCP packets. We assume that storage boxes will be better built that other middle boxes and hardware accelerators in the endpoints will not cause trouble. For a completely garbled link we think that a combination of a good, CRC and format checks will keep us from passing around corrupted data and solid recovery mechanisms will keep us from failing the QoS expected. The one question that we could not get any decent answer to is - how would mechanisms other than CRC perform - mainly how will a connection protected by cryptographic authenticators perform on connections with errors. Regards, Julo Jonathan Stone <jonathan@dsg.stanford.edu> on 18/04/2001 17:50:24 Please respond to Jonathan Stone <jonathan@dsg.stanford.edu> To: Julian Satran/Haifa/IBM@IBMIL cc: Randall Stewart <rrs@cisco.com>, "WENDT,JIM (HP-Roseville, ex1)" <jim_wendt@hp.com>, ips@ece.cmu.edu, tsvwg@ietf.org, "'Craig Partridge'" <craig@aland.bbn.com>, Jonathan Wood <Jonathan.Wood@sun.com>, xieqb@cig.mot.com, Jonathan Stone <jonathan@dsg.stanford.edu> Subject: Re: [Tsvwg] [SCTP checksum problems] Julian, I skimmed your i-d late last night. I have not gone through the analysis of different CRCs. I'd like to compare it to Raj Jain's analysis of the IEEE 802 CRC-32 in http://www.cis.ohio-state.edu/~jain/papers/xie1.html; which I think speaks to the siglne-bit-error point Craig has already raised. The question I'd raise is a more fundamental one: whether link-level bit and burst error rates are the appropriate model for an Internet transport-level sum in the first place. Craig Partridge and I examined that in our SIGCOMM 2000 paper. We monitored packets at a number of points in the Internet, and looked for packet with checksum mismatches-- packets where recomputing the checksum did not match the content of the checksum field. We also looked for transport-level (TCP) retransmissions of the damaged packets. Let's call packets where a recomputation of the TCP (or UDP) checksum does not match the contents of the checksum field a "mismatch". We observed mismatch rates of roughly 1 in 4,000 (average); best-case around 1 in 30,000. that's 5 or 6 orders of magnitude higher than the link-level error rates you cite. By comparing the checksum-mismatches against a TCP-level retransmission, we were able to estimate how much damage occured to the mismatches. Keep in mind that these errors were caught at the TCP layer: they have already passed a link-level CRC check, usually the 802.3 CRC-32. The very high observed error rate suggests thet these errors occur outside the protection of the MAC-layer CRC. For the iSCSI analysis, a fair synopsis is that half the packets were so thoroughly curdled we couldn't even guess at what caused thenn damage. there are more details in the SIGCOMM paper. (There, I focused more on analyzing how the standard TCP sum would fare. I am in the midst of recomputing total burst length and hamming distances, for a polynomial-xor description of the errors rather than the `minimum edit distance'.) That characterization of errors is very different to the independent bit-error and correlated-burst-error models used in the ID. I think our data supports three conclusions relevant to this discussion. (You may of course disagree.) First, the Internet contains a variety of error sources above the MAC-level:i between two MAC-layer interface cards inside a router; or inside an end-host, between its MAC-layer card and its TCP (or SCTP, or UDP, or other transport protocol). Second, error rates from these sources occur at rates several orders of magnitude higher than current link-level errors. Third, the damage done by these error sources just does not match the individual-bit/ single-burst model common in coding theory, and often used to characterize link errors. While we did observe a (very) few single-bit errors and short bursts, we also observed a lot of much longer bursts. Approximately half the damaged packets were so thoroughly curdled that more than half the bytes were incorrect. (We also found similar rates and patterns in packet traces from Vern Paxson; those are included in our SIGCOMM 2000 paper.) It may be helpful to think of *some* of these errors as due to (for example) a single-bit error affecting a DMA pointer: flipping an address bit can cause a large change in the data stream going to or from the network interface. If the bit position flipped is high enough, it could even skip to another packet altogether. Its not clear how the analysis and conclusions in your draft stand up, if instead of a link-level single-bit/burst error model, we ed substitute the error characteristics and rates we observed in `in the wild' Internet traffic-- that is, error rates some 5 or 6 orders of magnitude higher, and where the errors cause either multiple bursts per packet, or (if modeled as a single polynomial) vary from a few dozen bits, up to a substantial fraction of the packet length. The order-of-magnitude changes in error rate will obviously have an impact. I haven't thought in detail about whether the conclusions about specific CRC polynomial choices hold up. One final point is the computational cost of software CRCs. If you buy our conclusion that the Internet contains very significant error sources outside of "network interface cards". Then, outboard acceleration of either checksums or CRCs is somewhat suspect: error checks done inside the network card simply doesn't cover those error sources. Software CRC calculations are typically much slower than either ones-complement, Fletcher, or Adler sums: Dave Feldmeier's paper suggests roughly four times slower, for a total 32-bit check, even for generator polynomials selected to minimze nonzero coefficients (i.e,. few taps). For the IEEE 802 CRC, its faster to do a table-lookup, which is slower again. I dont know whether the iSCSI community has considered that issue, or where they/you stand on it.
Home Last updated: Tue Sep 04 01:04:59 2001 6315 messages in chronological order |