|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] TCP checksum escapes and iSCSI error recovery design
All,
In designing the iSCSI error recovery mechanisms there has been considerable
focus on general deficiency of the TCP checksum. It seems that what iSCSI
error recovery should really be based upon is the overall profile of TCP
checksum escapes for the networks that will carry iSCSI traffic ("checksum
escapes" being defined as those cases where corrupted data is passed through
TCP and delivered to the upper layer protocol as good). The design of iSCSI
error recovery needs to start with a clear and agreed upon set of
assumptions regarding the profile for TCP checksum escapes across space and
time. Thus, the question is not simply "How often is corrupted data passed
through TCP and delivered to iSCSI", but more along the lines of "What is
the distribution of TCP checksum escapes across network paths and through
time?"
ISCSI error recovery should not be over-designed nor under-designed and will
be fundamentally different if the assumption is that checksum escapes or
bursts of such occur every 5 minutes or 5 hours or 5 days, and if the
assumption is that there are a few bad paths or that all paths are bad
sometimes.
It would make sense for iSCSI to have minimal error recovery if the spatial
profile (discussed below) of TCP errors were bi-modal (i.e. there are good
paths and bad paths). IOW, if the checksum escapes happen deterministically
always or not at all, iSCSI can simply drop the connection on a digest
failure because the operation switched into error mode. OTOH, we cannot
adopt this simplistic approach if TCP can be in the shades of gray between
the two modes. Attached email discussion indicates that load levels and
memory usage models along the path play a key role in determining this
aspect. Employing an iSCSI-level digest/CRC appears to be the wise approach
regardless of the TCP operational model (bimodal or otherwise), to detect
escapes when they happen. IOW, the principle of "trust but verify" is apt to
be applied here. The question is what is the appropriate action for iSCSI to
take if the verification detects data corruption.
To start this discussion toward a set of assumptions regarding TCP checksum
escape profiles and an appropriate iSCSI error recovery design, this email
includes responses from several individuals working in this area (Vern
Paxson, Craig Partridge, Jonathan Stone), along with links to their original
papers.
Regards,
Jim Wendt
Networked Storage Architecture
Network Storage Solutions Organization
Hewlett-Packard Company / Roseville, CA
Tel: +1 916 785-5198
Fax: +1 916 785-0391
Email: jim_wendt@hp.com <mailto:jim_wendt@hp.com>
----------------------------------------------------------
Here is a copy of my original email to Vern Paxson, Craig Partridge,
Jonathan Stone, and Jeff Chase:
Hi All,
I'm e-mailing this to Vern, Craig, Jonathan, and Jeff in hopes of gathering
information around TCP checksum failure profiles and Internet data failure
characteristics in general. I'm Jim Wendt and I've recently started working
on both the iSCSI error recovery and RDMA/TCP-Framing activities. I'm
emailing you directly because of your specific research and experience
regarding TCP checksum failures and behavior.
My immediate interest is in regards to TCP checksum failure rates
(undetected errors) and how iSCSI should handle these undetected errors. As
you probably already know, the iSCSI protocol is a mapping of SCSI device
protocols onto TCP (or SCTP eventually). I have read the three papers (When
the CRC and TCP Checksum Disagree, Performance of Checksums and CRCs over
Real Data - both versions, End-to-End Internet Packet Dynamics).
The iSCSI WG has come to the conclusion that the TCP checksum provides an
insufficient level of data protection, and that a CRC-32 will be used for
data integrity verification on each iSCSI PDU (the specific CRC polynomial
to employ is being studied now). Thus, the assumption is that if corrupted
payload data does pass through TCP and is undetected by the TCP checksum,
then the corruption will be detected at the iSCSI PDU level via the CRC-32
check.
What we are considering now is what level and philosophy of error handling
and recovery is put into iSCSI. It seems to me that fundamentally different
error handling behaviors would be put into iSCSI based on a known rate or
profile of occurrence of bad PDUs (those PDUs that are passed up through TCP
as being OK, but for which the iSCSI level CRC-32 indicates a data error).
Thus, iSCSI error handling might be defined differently if the expected PDU
error rate is one every 5 minutes as opposed to one every 5 hours (or 5
days). Also, the error handling might be different if errors are bursty
(i.e. a sequence of bad PDUs) rather than evenly spread through time.
I would appreciate hearing your thoughts (and any supporting data references
I could employ) regarding the nature of TCP checksum failure profiles across
time and space (i.e. network paths). More specifically:
* How does the Internet work in the face of TCP error escapes? If there
truly is a relatively high level of application level data corruption
(perhaps 1 in 16M packets), then how does the Internet and associated
applications manage to function?
* What is the distribution of errors across network paths? Are most network
paths very good (low TCP error escapes) with only a few paths that are
really bad, or are TCP error escapes more evenly distributed across all
network paths? If the spatial profile is that there are good and bad paths,
then do the error escapes rates on these two classes of paths correspond to
the high/low bounds for TCP error escapes (1 in 16 million / 1 in 10
billion)?
* What is the distribution of errors through time? Do TCP error escapes
occur individually and randomly through time, or are TCP error escapes more
bi-modal where most of the time there are no errors and occasionally there
is a clump or burst of TCP error escapes? If the temporal profile for TCP
error escapes is that there are good periods and infrequent but severe bad
periods, then what is the duty cycle for these periods (how bad/good for how
long), and what are the error escape rates during these periods?
* What about a stronger TCP checksum? I don't believe that anyone every
actually employed RFC1146 (TCP Alternate Checksum). Has there been any
recent thinking about actually improving TCP's end-to-end data integrity
checking. I suppose that the existing middle box infrastructure won't allow
for this. However, I'm considering submitting a draft and starting to push
for a stronger TCP checksum or CRC, but I would like to get feedback from
all of you on the technical feasibility and possible acceptance of this
proposal before taking it to the public forums.
* What about a stronger SCTP checksum? SCTP currently uses the Adler-32
checksum. Perhaps an optional stronger CRC-32 should be defined for SCTP.
Has there been any thinking in this direction? Again, I am considering
pursuing this with the SCTP folks but would appreciate any feedback you have
to offer first.
Thanks,
Jim
Jim Wendt
Networked Storage Architecture
Network Storage Solutions Organization
Hewlett-Packard Company / Roseville, CA
Tel: +1 916 785-5198
Fax: +1 916 785-0391
Email: jim_wendt@hp.com <mailto:jim_wendt@hp.com>
----------------------------------------------------------
Craig Partridge writes:
Hi Jim:
Here's my quick set of answers, which Vern, Jonathan and others can
refine.
>* How does the Internet work in the face of TCP error escapes? If there
>truly is a relatively high level of application level data corruption
>(perhaps 1 in 16M packets), then how does the Internet and associated
>applications manage to function?
The answer is that applications appear to function because people don't
notice the errors or resort to backups, or whatever. It isn't clear
exactly how we're managing to survive. (Back in the old days when NFS
had these problems, people used to assume a disk failure had occurred when
in fact the network had trashed their data).
>* What is the distribution of errors across network paths? Are most
network
>paths very good (low TCP error escapes) with only a few paths that are
>really bad, or are TCP error escapes more evenly distributed across all
>network paths? If the spatial profile is that there are good and bad
paths,
>then do the error escapes rates on these two classes of paths correspond to
>the high/low bounds for TCP error escapes (1 in 16 million / 1 in 10
>billion)?
The answer is that there are good and bad paths. On good paths you'll
probably see less escapes than 1 in 10 billion -- you can probably treat
them as essentially error free. If you start seeing a modest number of
errors, then either (a) there's a broken router in your path or (b)
there's a broken end system. If we had a better way of identifying the
source of errors, I'd say that if you ever see an error from a host,
declare it broken.
>* What is the distribution of errors through time? Do TCP error escapes
>occur individually and randomly through time, or are TCP error escapes more
>bi-modal where most of the time there are no errors and occasionally there
>is a clump or burst of TCP error escapes? If the temporal profile for TCP
>error escapes is that there are good periods and infrequent but severe bad
>periods, then what is the duty cycle for these periods (how bad/good for
how
>long), and what are the error escape rates during these periods?
They're not random but I don't think we know enough about the timing to say.
My hunch is that a lot are based on end system load (that high loads
on network cards tends to encourage certain classes of bus errors).
>* What about a stronger TCP checksum? I don't believe that anyone every
>actually employed RFC1146 (TCP Alternate Checksum). Has there been any
>recent thinking about actually improving TCP's end-to-end data integrity
>checking. I suppose that the existing middle box infrastructure won't
allow
>for this. However, I'm considering submitting a draft and starting to push
>for a stronger TCP checksum or CRC, but I would like to get feedback from
>all of you on the technical feasibility and possible acceptance of this
>proposal before taking it to the public forums.
That's a tricky question and not one I've thought enough about to have
a strong view -- except to say that it isn't all the checksum's fault --
Jonathan's done work showing that putting the checksum at the end
dramatically
improves its efficacy in certain situations.
>* What about a stronger SCTP checksum? SCTP currently uses the Adler-32
>checksum. Perhaps an optional stronger CRC-32 should be defined for SCTP.
>Has there been any thinking in this direction? Again, I am considering
>pursuing this with the SCTP folks but would appreciate any feedback you
have
>to offer first.
There's considerable reason to believe that Adler-32 is a lousy checksum --
it uses more bits (32 vs. 16) and will probably detect fewer errors than
either Fletcher-16 or the TCP checksum. If one is going to invest energy
in fixing a checksum, fixing SCTP's checksum is probably the first priority.
Craig
----------------------------------------------------------
Vern Paxson writes:
Just a couple of follow-on points:
- I suspect the Internet survives because the bulk of the
traffic isn't all that critical (Web pages, particularly
large items like images and perhaps video clips), so when
they're corrupted, nothing breaks in a major way.
- One point to consider is that if you use IPSEC, then you
get very strong protection from its integrity guarantees
- Vern
----------------------------------------------------------
Jonathan Stone writes:
In message <200104092135.f39LZVS21092@daffy.ee.lbl.gov>Vern Paxson writes
Jim,
Craig's answer is an excellent summary. Fixing the SCTP checksum is a
priority; I have a chapter in my thesis addresing some of the issues
there, and I believe Craig and I plan to write a paper.
Stronger TCP checksums are an interesting idea, but the lag on
deploying new TCP features seems to be 5 years or more.
If I re-did the study, i would try and log headers of ``good'' packets
(those where the checksum matches) as well as the entire packets with
checksum errors. With only data on the `bad' packets (which is partly
to meet privacy concerns, which were a big hurdle for us),
it's hard to give good answers to your questions about the time or path
dependency of bad checksums.
One thing we can say is that there appear to be certain `bad' hosts
which emit very high rates of packets with incorrect checksums. One
host we noticed in the Stanford trace sent 2 packets every second for
around three hours, totalling about 20% of the total packets in that
trace. We can't say whether that's a host problem or a path problem;
but either way that error rate would worry me, if I were using I-SCSI
on that host (or path). That said, we did find what looks to be a
router with a bad memory bit, which is clearly path-dependent
(though hitting that particular memory word may be time- and load-dependent
as well).
Further, some of the bad checksums appear to be due to software
bugs in specific OS releases. As the Internet evolves (old OS revs
replaced by newer ones), the rate of those specific errors will
evolve over time.
Last, the DORM trace in our SIGCOMM 2000 paper is the only trace where
our monitoring point was directly adjacent to end-hosts, with no
intervening IP routers. On that Ethernet segment, i saw about 1 in
80,000 packets with a valid frame CRC, but a bad IP checksum --
often a deletion of a 16-bit word from the IP address. I tried
to `repair' some IP headers (e.g., by inserting what seemed
to be a missing 16 bits of an IP source address).
After the repair, the IP checksum seemed to be correct.
That points to a problem ocurring after the sending IP layer computeds
its IP header checksum, but before the frame-level CRC is computed.:
That suggests an error either in the NIC device driver, in its DMA
enigne, or somewhere on the NIC itself (dropped from a FIFO?).
That led Craig and I to suggest that where possible, checksums be
computed early and verified late. That's a caution about a potential
downside of outboard checksumming: it cover less of the path to an
application buffer than is covered by software checksums. Software
checksums can catch errors which occur between the driver and the NIC
(bus errors, DMA, FIFO overruns, what-have-you); but outboard hardware
checksums do not. (IEN-45 mentions this issue, but the two-pass
scheme suggested there doubles bus utilization for a given I/O rate.)
Putting middle boxes in the path just exacerbates this.
Whether or not to use outboard checksumming is entirely up to
NIC designers. We merely raise the issue that it checks less of
the datapath than is covered by software checksums.
>Just a couple of follow-on points:
>
> - I suspect the Internet survives because the bulk of the
> traffic isn't all that critical (Web pages, particularly
> large items like images and perhaps video clips), so when
> they're corrupted, nothing breaks in a major way.
>
> - One point to consider is that if you use IPSEC, then you
> get very strong protection from its integrity guarantees
For a software IPSEC implementation.
A hardware implementation with outboard crypto hardware could
potentially fall foul of the same kinds of local-to-source-host
errors, (DMA errors or whatever the ultimate cause is) which our data
indicates some NICs suffer from. If the data has already been
curdled by the time the encrypting accelerator sees it,
I don't see how IPsec actually enhances integrity.
----------------------------------------------------------
Vern Paxson writes:
> > - One point to consider is that if you use IPSEC, then you
> > get very strong protection from its integrity guarantees
>
> For a software IPSEC implementation.
Good point!
Vern
----------------------------------------------------------
Craig Partridge writes:
I'd like to clarify my statement to Jim about Adler-32 to be a bit more
clear
about what the issues are. I've also added Mark Adler to the cc list as
I'd promised Mark to get him results when we had them, and while the results
aren't quite cooked, if the Jim is going to circulate a discussion, Mark
should see it first.
If you're designing a checksum, there are certain features you'd like it
to have. Here's a starting point of a formal definition. Given the set
V of all possible bit vectors and a checksum function C(), what we'd like
is:
prob(C(v1) == C(v2)) is 1/2**(sizeof C())
that is given any two v1, v2 being different elements of V, the chance that
their checksum will collide is the best possible, namely 1 over 2 raised to
the power of the bitwidth of the result of C().
Three sub points:
1. This is not quite the same as what the cryptographic checksum folks
want. They actually want it to be very hard [for some computational
approximation of hard], given C(v1) to find a v2 such that C(v1) ==
C(v2).
For network checksums, we don't care as we're protecting from errors,
not attacks.
2. If we do not pick v1 and v2 at random, but according to some
distribution
rule of likely packet sizes, packet contents, etc, we'd still like the
equation to be true. We don't want to be vulnerable to certain
traffic patterns, etc.
3. You can compare the effectiveness of checksums by how close they
come to this ideal -- that is, how effectively do they use their
range of values?
OK, so let's talk about Adler-32. Adler-32 is a neat idea -- it seeks to
improve the performance of the Fletcher checksum by summing modulo a prime
number, rather than 255 or 256.
However, it sums bytes (8-bit quantities) into 16-bit fields. As a result,
the high bits of the 16-bit fields take some time to fill (they only get
filled by propogating carries from lower bits) and until the packet
is quite big (thousands of bytes) you don't get enough mixing in the high
bits. Two problems: (a) we're not fully using the 16-bit width, so for
smaller packets the chance of collision is much greater than 1/2**sizeof(C)
simply because some bits in the checksum are always (or with very
high probability) set to 0; and (b) it looks like (and Jonathan's still
working on this) that the law of large numbers will cause the values to
cluster still further [I think of this behavior as the result of just
looking at all the bits instead of just the low order bits mod a prime,
we're
vulnerable to the fact that the sums are not evenly distributed, by
the law of large numbers]
We're still working on whether the core idea behind Adler-32 (namely working
modulo a prime) is as powerful as it seems, but it is clear that to have
a hope of making it comparable to the TCP checksum or Fletcher, you have to
sum 16-bit quantities into 16-bit fields.
Craig
----------------------------------------------------------
Jonathan writes:
In message <200104101505.f3AF5BZ23145@aland.bbn.com>Craig Partridge writes
>I'd like to clarify my statement to Jim about Adler-32 to be a bit more
clear
>about what the issues are. I've also added Mark Adler to the cc list as
>I'd promised Mark to get him results when we had them, and while the
results
>aren't quite cooked, if the Jim is going to circulate a discussion, Mark
>should see it first.
>
>If you're designing a checksum, there are certain features you'd like it
>to have. Here's a starting point of a formal definition. Given the set
>V of all possible bit vectors and a checksum function C(), what we'd like
is:
>
> prob(C(v1) == C(v2)) is 1/2**(sizeof C())
>
>that is given any two v1, v2 being different elements of V, the chance that
>their checksum will collide is the best possible, namely 1 over 2 raised to
>the power of the bitwidth of the result of C().
Jim, Craig:
Just to be picky, as I'm working right now on definitions of some of
these issues for my thesis:
One can list other desirable properties, like wanting each bit of the
checksum field to have informational entropy 1/2. Craig's aggregate
definition falls out from that, with a few extra assumptions.
Also, less formally, desiring each bit of the input data to contribute
equally to flipping the the final state of each output bit.
that is where Adler32 runs into trouble when given short inputs.
>Three sub points:
>
> 1. This is not quite the same as what the cryptographic checksum folks
> want. They actually want it to be very hard [for some computational
> approximation of hard], given C(v1) to find a v2 such that C(v1) ==
C(v2).
> For network checksums, we don't care as we're protecting from errors,
> not attacks.
There are two versions of formal cryptographic invertibility; the
other criteria is that it be computationally intractable to find *any*
v1 and v2 such that C(v1) = C(v2). Crypto folks would generally like
both.
> 2. If we do not pick v1 and v2 at random, but according to some
distributi
>on
> rule of likely packet sizes, packet contents, etc, we'd still like the
> equation to be true. We don't want to be vulnerable to certain
> traffic patterns, etc.
>
> 3. You can compare the effectiveness of checksums by how close they
> come to this ideal -- that is, how effectively do they use their
> range of values?
>
>OK, so let's talk about Adler-32. Adler-32 is a neat idea -- it seeks to
>improve the performance of the Fletcher checksum by summing modulo a prime
>number, rather than 255 or 256.
>
>However, it sums bytes (8-bit quantities) into 16-bit fields. As a result,
>the high bits of the 16-bit fields take some time to fill (they only get
>filled by propogating carries from lower bits) and until the packet
>is quite big (thousands of bytes) you don't get enough mixing in the high
>bits. Two problems: (a) we're not fully using the 16-bit width, so for
>smaller packets the chance of collision is much greater than 1/2**sizeof(C)
>simply because some bits in the checksum are always (or with very
>high probability) set to 0; and (b) it looks like (and Jonathan's still
>working on this) that the law of large numbers will cause the values to
>cluster still further [I think of this behavior as the result of just
>looking at all the bits instead of just the low order bits mod a prime,
we're
>vulnerable to the fact that the sums are not evenly distributed, by
>the law of large numbers]
To be fair, and to give the whole picture, there are a couple of
points here that should be expanded.
The first is that for large input data lengths, we can show that the
distribution of both 16-bit halves of the Adler-32 sum should acually
be well distributed. That holds true for addition mod M, of any
repeated independent observations of a random variable. (A proof
appears in the appendix of the 1998 ToN paper you cited already;
although that version of the proof may not formally state
all the necessary assumptions about indepedent observations.)
However, for networking in general and SCTP in particular, there are
fairly modest hard upper boundes on the maximum input length.
SCTP forbids fragmentation. For the increasingly-pervasive Ethernet
frame length of 1500 bytes that means SCTP checksums have no more
than 1480 input bytes.
The data we have -- and I am still working on it -- say that's too
short to get good coverage of either the two sum in SCTP-- the purely
additive commutative sum, or the higher-order running sum of the
commutative sum (which is position-dependent)
Craig's description gives a good intuition for the first, commutative
sum. For the position-dependent sum (the running sum of the first sum),
another good intution for the computational modelling I've done is to
imagine a hash-table where we hash, independently, all possible values
at all possible offsets in a 1480-byte SCTP packet. The intuition
isn't so much "law of large numbers" as that SCTP draws its per-byte
values from one particular corner of a two-dimensional space (the
hash-table vs. all possible bytes); so it ends up with an uneven
coverage of the space of all hash values.
>We're still working on whether the core idea behind Adler-32 (namely
working
>modulo a prime) is as powerful as it seems, but it is clear that to have
>a hope of making it comparable to the TCP checksum or Fletcher, you have to
>sum 16-bit quantities into 16-bit fields.
Just so the reasoning is clear, another alternative is to sum 8-bit
inputs into 8-bit accumulators, modulo 251. Given 32 bits of
available checksum field, the 16-bit sums are preferable.
Again, this is for short inputs. For large inputs of several tens of
Kbytes, the Adler32 sum should do much better (at 64Kbytes it should
give very uniform coverage). At those input sizes, the comparison comes
down to Adler32 having 15 pairs of values which are congruent, mod
65521, whereas a Fletcher sum wiht 16-bit inputs would have only one;
versus better `stirring' from a prime modulus.
I dont know what the distribution of sizes of zlib-compressed files
is. If they are generally large, then our work may not be applicable
to Adler-32 its original designed purpose.
I also don't know the history of why SCTP chose the Adler-32 sum
rather than, say, CRC-32. The gossip I hear from IETF-going friend in
the Bay Area is that there was concern about the performance of
CRC-32; a direct bit-by-bit shift-and-add was seen as too slow. I
hear there was also a supposition that an efficient table-lookup
version would require very large tables (128 kbits?) and that
tables that arge were prohibitive for PDAs and small handheld devices.
I am *not* suggesting that is an accurate report; it probably isn't.
But if there's any grain of truth in it, both four-bit and eight-bit
table-lookup algorithms for CRC32 exist. Table lookup size need not
be an issue. Perhaps we should draw the SCTP authors -- C. Sharp? --
into this discussion as well.
----------------------------------------------------------
Jonathan writes:
A small terminological correction here:
In message <200104101505.f3AF5BZ23145@aland.bbn.com>Craig Partridge writes
>
[...]
>
>so for
>smaller packets the chance of collision is much greater than 1/2**sizeof(C)
>simply because some bits in the checksum are always (or with very
>high probability) set to 0; and (b) it looks like (and Jonathan's still
>working on this) that the law of large numbers will cause the values to
>cluster still further [I think of this behavior as the result of just
>looking at all the bits instead of just the low order bits mod a prime,
we're
>vulnerable to the fact that the sums are not evenly distributed, by
>the law of large numbers]
I think its acutally a central limit theorem, not the law of large
numbers. For addition modulo M, the "central limit theorem" says that
summation of many independent identically-distributed random variables
will tend to a uniform distribution, not a normal distribution
(as does the weighted sum)
----------------------------------------------------------
Jonathan writes:
Jim,
I forwarded my previos message to Chip Sharp. (is the rfc2960
address, chsharp@cisco.com, still valid?)
He may wish to comment on the SCTP issues as well.
----------------------------------------------------------
----------------------------------------------------------
Links to papers:
End-to-End Internet Packet Dynamics, Vern Paxson
<http://citeseer.nj.nec.com/cache/papers2/cs/11598/ftp:zSzzSzftp.ee.lbl.govz
SzpaperszSzvp-pkt-dyn-ton99.pdf/paxson97endtoend.pdf>
When the CRC and TCP Checksum Disagree, Jonathan Stone, Craig Partridge
<http://www.acm.org/sigcomm/sigcomm2000/conf/paper/sigcomm2000-9-1.pdf>
Performance of Checksums and CRCs over Real Data, Jonathan Stone, Michael
Greenwald, Craig Partridge, Jim Hughes
<http://citeseer.nj.nec.com/cache/papers2/cs/1909/ftp:zSzzSzftp.dsg.stanford
.eduzSzpubzSzpaperszSzsplice-paper.pdf/performance-of-checksums-and.pdf>
----------------------------------------------------------
Home Last updated: Tue Sep 04 01:05:07 2001 6315 messages in chronological order |