|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] TCP checksum escapes and iSCSI error recovery designAll, In designing the iSCSI error recovery mechanisms there has been considerable focus on general deficiency of the TCP checksum. It seems that what iSCSI error recovery should really be based upon is the overall profile of TCP checksum escapes for the networks that will carry iSCSI traffic ("checksum escapes" being defined as those cases where corrupted data is passed through TCP and delivered to the upper layer protocol as good). The design of iSCSI error recovery needs to start with a clear and agreed upon set of assumptions regarding the profile for TCP checksum escapes across space and time. Thus, the question is not simply "How often is corrupted data passed through TCP and delivered to iSCSI", but more along the lines of "What is the distribution of TCP checksum escapes across network paths and through time?" ISCSI error recovery should not be over-designed nor under-designed and will be fundamentally different if the assumption is that checksum escapes or bursts of such occur every 5 minutes or 5 hours or 5 days, and if the assumption is that there are a few bad paths or that all paths are bad sometimes. It would make sense for iSCSI to have minimal error recovery if the spatial profile (discussed below) of TCP errors were bi-modal (i.e. there are good paths and bad paths). IOW, if the checksum escapes happen deterministically always or not at all, iSCSI can simply drop the connection on a digest failure because the operation switched into error mode. OTOH, we cannot adopt this simplistic approach if TCP can be in the shades of gray between the two modes. Attached email discussion indicates that load levels and memory usage models along the path play a key role in determining this aspect. Employing an iSCSI-level digest/CRC appears to be the wise approach regardless of the TCP operational model (bimodal or otherwise), to detect escapes when they happen. IOW, the principle of "trust but verify" is apt to be applied here. The question is what is the appropriate action for iSCSI to take if the verification detects data corruption. To start this discussion toward a set of assumptions regarding TCP checksum escape profiles and an appropriate iSCSI error recovery design, this email includes responses from several individuals working in this area (Vern Paxson, Craig Partridge, Jonathan Stone), along with links to their original papers. Regards, Jim Wendt Networked Storage Architecture Network Storage Solutions Organization Hewlett-Packard Company / Roseville, CA Tel: +1 916 785-5198 Fax: +1 916 785-0391 Email: jim_wendt@hp.com <mailto:jim_wendt@hp.com> ---------------------------------------------------------- Here is a copy of my original email to Vern Paxson, Craig Partridge, Jonathan Stone, and Jeff Chase: Hi All, I'm e-mailing this to Vern, Craig, Jonathan, and Jeff in hopes of gathering information around TCP checksum failure profiles and Internet data failure characteristics in general. I'm Jim Wendt and I've recently started working on both the iSCSI error recovery and RDMA/TCP-Framing activities. I'm emailing you directly because of your specific research and experience regarding TCP checksum failures and behavior. My immediate interest is in regards to TCP checksum failure rates (undetected errors) and how iSCSI should handle these undetected errors. As you probably already know, the iSCSI protocol is a mapping of SCSI device protocols onto TCP (or SCTP eventually). I have read the three papers (When the CRC and TCP Checksum Disagree, Performance of Checksums and CRCs over Real Data - both versions, End-to-End Internet Packet Dynamics). The iSCSI WG has come to the conclusion that the TCP checksum provides an insufficient level of data protection, and that a CRC-32 will be used for data integrity verification on each iSCSI PDU (the specific CRC polynomial to employ is being studied now). Thus, the assumption is that if corrupted payload data does pass through TCP and is undetected by the TCP checksum, then the corruption will be detected at the iSCSI PDU level via the CRC-32 check. What we are considering now is what level and philosophy of error handling and recovery is put into iSCSI. It seems to me that fundamentally different error handling behaviors would be put into iSCSI based on a known rate or profile of occurrence of bad PDUs (those PDUs that are passed up through TCP as being OK, but for which the iSCSI level CRC-32 indicates a data error). Thus, iSCSI error handling might be defined differently if the expected PDU error rate is one every 5 minutes as opposed to one every 5 hours (or 5 days). Also, the error handling might be different if errors are bursty (i.e. a sequence of bad PDUs) rather than evenly spread through time. I would appreciate hearing your thoughts (and any supporting data references I could employ) regarding the nature of TCP checksum failure profiles across time and space (i.e. network paths). More specifically: * How does the Internet work in the face of TCP error escapes? If there truly is a relatively high level of application level data corruption (perhaps 1 in 16M packets), then how does the Internet and associated applications manage to function? * What is the distribution of errors across network paths? Are most network paths very good (low TCP error escapes) with only a few paths that are really bad, or are TCP error escapes more evenly distributed across all network paths? If the spatial profile is that there are good and bad paths, then do the error escapes rates on these two classes of paths correspond to the high/low bounds for TCP error escapes (1 in 16 million / 1 in 10 billion)? * What is the distribution of errors through time? Do TCP error escapes occur individually and randomly through time, or are TCP error escapes more bi-modal where most of the time there are no errors and occasionally there is a clump or burst of TCP error escapes? If the temporal profile for TCP error escapes is that there are good periods and infrequent but severe bad periods, then what is the duty cycle for these periods (how bad/good for how long), and what are the error escape rates during these periods? * What about a stronger TCP checksum? I don't believe that anyone every actually employed RFC1146 (TCP Alternate Checksum). Has there been any recent thinking about actually improving TCP's end-to-end data integrity checking. I suppose that the existing middle box infrastructure won't allow for this. However, I'm considering submitting a draft and starting to push for a stronger TCP checksum or CRC, but I would like to get feedback from all of you on the technical feasibility and possible acceptance of this proposal before taking it to the public forums. * What about a stronger SCTP checksum? SCTP currently uses the Adler-32 checksum. Perhaps an optional stronger CRC-32 should be defined for SCTP. Has there been any thinking in this direction? Again, I am considering pursuing this with the SCTP folks but would appreciate any feedback you have to offer first. Thanks, Jim Jim Wendt Networked Storage Architecture Network Storage Solutions Organization Hewlett-Packard Company / Roseville, CA Tel: +1 916 785-5198 Fax: +1 916 785-0391 Email: jim_wendt@hp.com <mailto:jim_wendt@hp.com> ---------------------------------------------------------- Craig Partridge writes: Hi Jim: Here's my quick set of answers, which Vern, Jonathan and others can refine. >* How does the Internet work in the face of TCP error escapes? If there >truly is a relatively high level of application level data corruption >(perhaps 1 in 16M packets), then how does the Internet and associated >applications manage to function? The answer is that applications appear to function because people don't notice the errors or resort to backups, or whatever. It isn't clear exactly how we're managing to survive. (Back in the old days when NFS had these problems, people used to assume a disk failure had occurred when in fact the network had trashed their data). >* What is the distribution of errors across network paths? Are most network >paths very good (low TCP error escapes) with only a few paths that are >really bad, or are TCP error escapes more evenly distributed across all >network paths? If the spatial profile is that there are good and bad paths, >then do the error escapes rates on these two classes of paths correspond to >the high/low bounds for TCP error escapes (1 in 16 million / 1 in 10 >billion)? The answer is that there are good and bad paths. On good paths you'll probably see less escapes than 1 in 10 billion -- you can probably treat them as essentially error free. If you start seeing a modest number of errors, then either (a) there's a broken router in your path or (b) there's a broken end system. If we had a better way of identifying the source of errors, I'd say that if you ever see an error from a host, declare it broken. >* What is the distribution of errors through time? Do TCP error escapes >occur individually and randomly through time, or are TCP error escapes more >bi-modal where most of the time there are no errors and occasionally there >is a clump or burst of TCP error escapes? If the temporal profile for TCP >error escapes is that there are good periods and infrequent but severe bad >periods, then what is the duty cycle for these periods (how bad/good for how >long), and what are the error escape rates during these periods? They're not random but I don't think we know enough about the timing to say. My hunch is that a lot are based on end system load (that high loads on network cards tends to encourage certain classes of bus errors). >* What about a stronger TCP checksum? I don't believe that anyone every >actually employed RFC1146 (TCP Alternate Checksum). Has there been any >recent thinking about actually improving TCP's end-to-end data integrity >checking. I suppose that the existing middle box infrastructure won't allow >for this. However, I'm considering submitting a draft and starting to push >for a stronger TCP checksum or CRC, but I would like to get feedback from >all of you on the technical feasibility and possible acceptance of this >proposal before taking it to the public forums. That's a tricky question and not one I've thought enough about to have a strong view -- except to say that it isn't all the checksum's fault -- Jonathan's done work showing that putting the checksum at the end dramatically improves its efficacy in certain situations. >* What about a stronger SCTP checksum? SCTP currently uses the Adler-32 >checksum. Perhaps an optional stronger CRC-32 should be defined for SCTP. >Has there been any thinking in this direction? Again, I am considering >pursuing this with the SCTP folks but would appreciate any feedback you have >to offer first. There's considerable reason to believe that Adler-32 is a lousy checksum -- it uses more bits (32 vs. 16) and will probably detect fewer errors than either Fletcher-16 or the TCP checksum. If one is going to invest energy in fixing a checksum, fixing SCTP's checksum is probably the first priority. Craig ---------------------------------------------------------- Vern Paxson writes: Just a couple of follow-on points: - I suspect the Internet survives because the bulk of the traffic isn't all that critical (Web pages, particularly large items like images and perhaps video clips), so when they're corrupted, nothing breaks in a major way. - One point to consider is that if you use IPSEC, then you get very strong protection from its integrity guarantees - Vern ---------------------------------------------------------- Jonathan Stone writes: In message <200104092135.f39LZVS21092@daffy.ee.lbl.gov>Vern Paxson writes Jim, Craig's answer is an excellent summary. Fixing the SCTP checksum is a priority; I have a chapter in my thesis addresing some of the issues there, and I believe Craig and I plan to write a paper. Stronger TCP checksums are an interesting idea, but the lag on deploying new TCP features seems to be 5 years or more. If I re-did the study, i would try and log headers of ``good'' packets (those where the checksum matches) as well as the entire packets with checksum errors. With only data on the `bad' packets (which is partly to meet privacy concerns, which were a big hurdle for us), it's hard to give good answers to your questions about the time or path dependency of bad checksums. One thing we can say is that there appear to be certain `bad' hosts which emit very high rates of packets with incorrect checksums. One host we noticed in the Stanford trace sent 2 packets every second for around three hours, totalling about 20% of the total packets in that trace. We can't say whether that's a host problem or a path problem; but either way that error rate would worry me, if I were using I-SCSI on that host (or path). That said, we did find what looks to be a router with a bad memory bit, which is clearly path-dependent (though hitting that particular memory word may be time- and load-dependent as well). Further, some of the bad checksums appear to be due to software bugs in specific OS releases. As the Internet evolves (old OS revs replaced by newer ones), the rate of those specific errors will evolve over time. Last, the DORM trace in our SIGCOMM 2000 paper is the only trace where our monitoring point was directly adjacent to end-hosts, with no intervening IP routers. On that Ethernet segment, i saw about 1 in 80,000 packets with a valid frame CRC, but a bad IP checksum -- often a deletion of a 16-bit word from the IP address. I tried to `repair' some IP headers (e.g., by inserting what seemed to be a missing 16 bits of an IP source address). After the repair, the IP checksum seemed to be correct. That points to a problem ocurring after the sending IP layer computeds its IP header checksum, but before the frame-level CRC is computed.: That suggests an error either in the NIC device driver, in its DMA enigne, or somewhere on the NIC itself (dropped from a FIFO?). That led Craig and I to suggest that where possible, checksums be computed early and verified late. That's a caution about a potential downside of outboard checksumming: it cover less of the path to an application buffer than is covered by software checksums. Software checksums can catch errors which occur between the driver and the NIC (bus errors, DMA, FIFO overruns, what-have-you); but outboard hardware checksums do not. (IEN-45 mentions this issue, but the two-pass scheme suggested there doubles bus utilization for a given I/O rate.) Putting middle boxes in the path just exacerbates this. Whether or not to use outboard checksumming is entirely up to NIC designers. We merely raise the issue that it checks less of the datapath than is covered by software checksums. >Just a couple of follow-on points: > > - I suspect the Internet survives because the bulk of the > traffic isn't all that critical (Web pages, particularly > large items like images and perhaps video clips), so when > they're corrupted, nothing breaks in a major way. > > - One point to consider is that if you use IPSEC, then you > get very strong protection from its integrity guarantees For a software IPSEC implementation. A hardware implementation with outboard crypto hardware could potentially fall foul of the same kinds of local-to-source-host errors, (DMA errors or whatever the ultimate cause is) which our data indicates some NICs suffer from. If the data has already been curdled by the time the encrypting accelerator sees it, I don't see how IPsec actually enhances integrity. ---------------------------------------------------------- Vern Paxson writes: > > - One point to consider is that if you use IPSEC, then you > > get very strong protection from its integrity guarantees > > For a software IPSEC implementation. Good point! Vern ---------------------------------------------------------- Craig Partridge writes: I'd like to clarify my statement to Jim about Adler-32 to be a bit more clear about what the issues are. I've also added Mark Adler to the cc list as I'd promised Mark to get him results when we had them, and while the results aren't quite cooked, if the Jim is going to circulate a discussion, Mark should see it first. If you're designing a checksum, there are certain features you'd like it to have. Here's a starting point of a formal definition. Given the set V of all possible bit vectors and a checksum function C(), what we'd like is: prob(C(v1) == C(v2)) is 1/2**(sizeof C()) that is given any two v1, v2 being different elements of V, the chance that their checksum will collide is the best possible, namely 1 over 2 raised to the power of the bitwidth of the result of C(). Three sub points: 1. This is not quite the same as what the cryptographic checksum folks want. They actually want it to be very hard [for some computational approximation of hard], given C(v1) to find a v2 such that C(v1) == C(v2). For network checksums, we don't care as we're protecting from errors, not attacks. 2. If we do not pick v1 and v2 at random, but according to some distribution rule of likely packet sizes, packet contents, etc, we'd still like the equation to be true. We don't want to be vulnerable to certain traffic patterns, etc. 3. You can compare the effectiveness of checksums by how close they come to this ideal -- that is, how effectively do they use their range of values? OK, so let's talk about Adler-32. Adler-32 is a neat idea -- it seeks to improve the performance of the Fletcher checksum by summing modulo a prime number, rather than 255 or 256. However, it sums bytes (8-bit quantities) into 16-bit fields. As a result, the high bits of the 16-bit fields take some time to fill (they only get filled by propogating carries from lower bits) and until the packet is quite big (thousands of bytes) you don't get enough mixing in the high bits. Two problems: (a) we're not fully using the 16-bit width, so for smaller packets the chance of collision is much greater than 1/2**sizeof(C) simply because some bits in the checksum are always (or with very high probability) set to 0; and (b) it looks like (and Jonathan's still working on this) that the law of large numbers will cause the values to cluster still further [I think of this behavior as the result of just looking at all the bits instead of just the low order bits mod a prime, we're vulnerable to the fact that the sums are not evenly distributed, by the law of large numbers] We're still working on whether the core idea behind Adler-32 (namely working modulo a prime) is as powerful as it seems, but it is clear that to have a hope of making it comparable to the TCP checksum or Fletcher, you have to sum 16-bit quantities into 16-bit fields. Craig ---------------------------------------------------------- Jonathan writes: In message <200104101505.f3AF5BZ23145@aland.bbn.com>Craig Partridge writes >I'd like to clarify my statement to Jim about Adler-32 to be a bit more clear >about what the issues are. I've also added Mark Adler to the cc list as >I'd promised Mark to get him results when we had them, and while the results >aren't quite cooked, if the Jim is going to circulate a discussion, Mark >should see it first. > >If you're designing a checksum, there are certain features you'd like it >to have. Here's a starting point of a formal definition. Given the set >V of all possible bit vectors and a checksum function C(), what we'd like is: > > prob(C(v1) == C(v2)) is 1/2**(sizeof C()) > >that is given any two v1, v2 being different elements of V, the chance that >their checksum will collide is the best possible, namely 1 over 2 raised to >the power of the bitwidth of the result of C(). Jim, Craig: Just to be picky, as I'm working right now on definitions of some of these issues for my thesis: One can list other desirable properties, like wanting each bit of the checksum field to have informational entropy 1/2. Craig's aggregate definition falls out from that, with a few extra assumptions. Also, less formally, desiring each bit of the input data to contribute equally to flipping the the final state of each output bit. that is where Adler32 runs into trouble when given short inputs. >Three sub points: > > 1. This is not quite the same as what the cryptographic checksum folks > want. They actually want it to be very hard [for some computational > approximation of hard], given C(v1) to find a v2 such that C(v1) == C(v2). > For network checksums, we don't care as we're protecting from errors, > not attacks. There are two versions of formal cryptographic invertibility; the other criteria is that it be computationally intractable to find *any* v1 and v2 such that C(v1) = C(v2). Crypto folks would generally like both. > 2. If we do not pick v1 and v2 at random, but according to some distributi >on > rule of likely packet sizes, packet contents, etc, we'd still like the > equation to be true. We don't want to be vulnerable to certain > traffic patterns, etc. > > 3. You can compare the effectiveness of checksums by how close they > come to this ideal -- that is, how effectively do they use their > range of values? > >OK, so let's talk about Adler-32. Adler-32 is a neat idea -- it seeks to >improve the performance of the Fletcher checksum by summing modulo a prime >number, rather than 255 or 256. > >However, it sums bytes (8-bit quantities) into 16-bit fields. As a result, >the high bits of the 16-bit fields take some time to fill (they only get >filled by propogating carries from lower bits) and until the packet >is quite big (thousands of bytes) you don't get enough mixing in the high >bits. Two problems: (a) we're not fully using the 16-bit width, so for >smaller packets the chance of collision is much greater than 1/2**sizeof(C) >simply because some bits in the checksum are always (or with very >high probability) set to 0; and (b) it looks like (and Jonathan's still >working on this) that the law of large numbers will cause the values to >cluster still further [I think of this behavior as the result of just >looking at all the bits instead of just the low order bits mod a prime, we're >vulnerable to the fact that the sums are not evenly distributed, by >the law of large numbers] To be fair, and to give the whole picture, there are a couple of points here that should be expanded. The first is that for large input data lengths, we can show that the distribution of both 16-bit halves of the Adler-32 sum should acually be well distributed. That holds true for addition mod M, of any repeated independent observations of a random variable. (A proof appears in the appendix of the 1998 ToN paper you cited already; although that version of the proof may not formally state all the necessary assumptions about indepedent observations.) However, for networking in general and SCTP in particular, there are fairly modest hard upper boundes on the maximum input length. SCTP forbids fragmentation. For the increasingly-pervasive Ethernet frame length of 1500 bytes that means SCTP checksums have no more than 1480 input bytes. The data we have -- and I am still working on it -- say that's too short to get good coverage of either the two sum in SCTP-- the purely additive commutative sum, or the higher-order running sum of the commutative sum (which is position-dependent) Craig's description gives a good intuition for the first, commutative sum. For the position-dependent sum (the running sum of the first sum), another good intution for the computational modelling I've done is to imagine a hash-table where we hash, independently, all possible values at all possible offsets in a 1480-byte SCTP packet. The intuition isn't so much "law of large numbers" as that SCTP draws its per-byte values from one particular corner of a two-dimensional space (the hash-table vs. all possible bytes); so it ends up with an uneven coverage of the space of all hash values. >We're still working on whether the core idea behind Adler-32 (namely working >modulo a prime) is as powerful as it seems, but it is clear that to have >a hope of making it comparable to the TCP checksum or Fletcher, you have to >sum 16-bit quantities into 16-bit fields. Just so the reasoning is clear, another alternative is to sum 8-bit inputs into 8-bit accumulators, modulo 251. Given 32 bits of available checksum field, the 16-bit sums are preferable. Again, this is for short inputs. For large inputs of several tens of Kbytes, the Adler32 sum should do much better (at 64Kbytes it should give very uniform coverage). At those input sizes, the comparison comes down to Adler32 having 15 pairs of values which are congruent, mod 65521, whereas a Fletcher sum wiht 16-bit inputs would have only one; versus better `stirring' from a prime modulus. I dont know what the distribution of sizes of zlib-compressed files is. If they are generally large, then our work may not be applicable to Adler-32 its original designed purpose. I also don't know the history of why SCTP chose the Adler-32 sum rather than, say, CRC-32. The gossip I hear from IETF-going friend in the Bay Area is that there was concern about the performance of CRC-32; a direct bit-by-bit shift-and-add was seen as too slow. I hear there was also a supposition that an efficient table-lookup version would require very large tables (128 kbits?) and that tables that arge were prohibitive for PDAs and small handheld devices. I am *not* suggesting that is an accurate report; it probably isn't. But if there's any grain of truth in it, both four-bit and eight-bit table-lookup algorithms for CRC32 exist. Table lookup size need not be an issue. Perhaps we should draw the SCTP authors -- C. Sharp? -- into this discussion as well. ---------------------------------------------------------- Jonathan writes: A small terminological correction here: In message <200104101505.f3AF5BZ23145@aland.bbn.com>Craig Partridge writes > [...] > >so for >smaller packets the chance of collision is much greater than 1/2**sizeof(C) >simply because some bits in the checksum are always (or with very >high probability) set to 0; and (b) it looks like (and Jonathan's still >working on this) that the law of large numbers will cause the values to >cluster still further [I think of this behavior as the result of just >looking at all the bits instead of just the low order bits mod a prime, we're >vulnerable to the fact that the sums are not evenly distributed, by >the law of large numbers] I think its acutally a central limit theorem, not the law of large numbers. For addition modulo M, the "central limit theorem" says that summation of many independent identically-distributed random variables will tend to a uniform distribution, not a normal distribution (as does the weighted sum) ---------------------------------------------------------- Jonathan writes: Jim, I forwarded my previos message to Chip Sharp. (is the rfc2960 address, chsharp@cisco.com, still valid?) He may wish to comment on the SCTP issues as well. ---------------------------------------------------------- ---------------------------------------------------------- Links to papers: End-to-End Internet Packet Dynamics, Vern Paxson <http://citeseer.nj.nec.com/cache/papers2/cs/11598/ftp:zSzzSzftp.ee.lbl.govz SzpaperszSzvp-pkt-dyn-ton99.pdf/paxson97endtoend.pdf> When the CRC and TCP Checksum Disagree, Jonathan Stone, Craig Partridge <http://www.acm.org/sigcomm/sigcomm2000/conf/paper/sigcomm2000-9-1.pdf> Performance of Checksums and CRCs over Real Data, Jonathan Stone, Michael Greenwald, Craig Partridge, Jim Hughes <http://citeseer.nj.nec.com/cache/papers2/cs/1909/ftp:zSzzSzftp.dsg.stanford .eduzSzpubzSzpaperszSzsplice-paper.pdf/performance-of-checksums-and.pdf> ----------------------------------------------------------
Home Last updated: Tue Sep 04 01:05:07 2001 6315 messages in chronological order |