|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] RE: SNACK and recovery> My objection to the complexity inherent in StatSN/SNACK/SACK > is in part motivated by the experience of those that run SCSI > commands over FC on high speed links. In this FC context > retained status on the target is simply not supported. But you > (and many on the list) know this -- I'm not sure what to make > of your points above, are we just agreeing on this fact? And at the moment SNACK is not required by the iSCSI specification. Such a target can choose to continue not to retain status and hence reject all SNACKs (although the result may well be that the TCP connection closes). Whether we have too many error recovery options and mechanisms is a separate issue - as long as the SNACK mechanism is optional, targets that find it burdensome don't have to implement it. > >- Does a 16-bit TCP checksum catch enough of > >the corruption events to make it acceptable to > >take drastic measures like aborting a backup > >when a 32 bit CRC fails on a response that > >made it through the 16 bit checksum? > > Is it correct to ignore link-level error correction? No, if the link corrects the error, it's not a corruption event visible to an end system because it cannot cause a TCP retransmit or iSCSI error recovery of any sort. > I tried to be clear in my opinion and its basis, but I don't > claim specific tape experience. You have the paper by Stone > and Partridge, could we agree to a number within the range > that they set out. What do you like, say 1 in 5 billion > packets have a TCP cksum failure? This begins an interesting math adventure. Let me play devils advocate here, and accept the 1 in 5 billion number (of failures undetected by the TCP checksum) to start, and assume that the CRC catches essentially 100% of those failures. FWIW, the 1 in 5 billion rate translates to about twice a day for 1k packets in one direction of a saturated gigabit link, so if the 1 in 5 billion number is correct, the case for data CRCs is crystal clear (this is not related to Jon's point because there's a lot more data flowing than status). > Now, what to say about tapes. Just naive conjecture on my > part but here goes. Assume a 20 gig disk being backed up to > tape over iSCSI; what xfer size do we like for the write CDBs? > Would one Meg be OK for one write command? Would that then be > 20480 responses covered by StatSNs to backup the 20 gig? 1 Meg seems way too large. Let's try 32K - this is 1/32nd of 1 Meg and makes everything happen 32 times as often. > Assuming that each response is a distinct TCP segment and > ignoring the fact that the corrupt data may not actually be > in the iSCSI header part of the TCP segment. Then one backup > would fail for every 244,140 attempts. Assuming that we do > the backup every day, that means we must redo the backup (for > this specific error case) once every 668 years (ignoring leap > year days). Divide by 32 and we get once every 21 years. Now, let's try to back up a terabyte - that's 50 times 20 Gig and the failure occurs once every 5 months - that's not good, but if 150 sites try to do this every night, on average there will be one failure a night. That's often enough to be a real problem If one a terabyte a day seems excessive (it's not, but ...) let's try it once a week. Across 1000 sites, the average is again around a failure a day. I'm not claiming that my numbers are any more realistic than Jon's. Does anyone on the list want to paint the "tape expert" target on their back and tell us where on this range the numbers that correspond to reality lie? > Maybe, and are there are other rare errors to consider? Where > is the line drawn (or how many pages of error recovery state > diagrams are enough? :-) Believe it or not, that's an open issue. There are a number of folks toiling away off-line on figuring out just what it takes to fully describe error recovery based on the current state of things (or some modifications that allow the task to be completed in a reasonable amount of time). With luck, we'll be able to expose the result of their hard work to the group in the near future so that between the list and the Nashua meeting we can have an informed discussion about what is necessary in the way of error recovery. Thanks, --David --------------------------------------------------- David L. Black, Senior Technologist EMC Corporation, 42 South St., Hopkinton, MA 01748 +1 (508) 435-1000 x75140 FAX: +1 (508) 497-8500 black_david@emc.com Mobile: +1 (978) 394-7754 ---------------------------------------------------
Home Last updated: Tue Sep 04 01:05:09 2001 6315 messages in chronological order |