|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: iSCSI ERT: data SACK/replay buffer/"semi-transport"SNACK is here for two reasons - Status retry (which is cheap) and Data retry as a side benefit. CRC errors are not that rare (although we don't have real data the simulation with file systems seem to indicate that numbers could be as high a 0.0002%). A restart of link - is expensive (slow start) and even if they are far lower for many applications a slow start is a painfull event. Removing them from the spec is not a path we should take lightly. Julo "Jon Hall" <jhall@emc.com> on 02/04/2001 16:13:35 Please respond to "Jon Hall" <jhall@emc.com> To: ips@ece.cmu.edu cc: Subject: Re: iSCSI ERT: data SACK/replay buffer/"semi-transport" I agree with Somesh. And would go farther -- the complexity that results from retaining enough target-side state to respond to a SACK/SNACK request is non-trivial and needs clear justification. Intuitively, a CRC that discovers an error in an iSCSI pdu header (that the TCP cksum missed) seems like it should be a rare event. What is the frequency of this event? IMO the answer to this question should be written into the protocol spec -- assuming that it substantiates the benefit of SACK/SNACK. Otherwise, the SACK/SNACK pdu should be removed. -Jon julian_satran@il.ibm.com writes: > >Somesh, > >As I stated earlier - the DataSN was created to detect missing data PDUs. >SNACK is needed to recover missing StatusSN and missing dataSN is only a >bonus if the target wants to support it. It is a trivial mechanism and I >think it should stay. > >Julo > >"Somesh Gupta" <someshg@yahoo.com> on 31/03/2001 02:25:52 > >Please respond to someshg@yahoo.com > >To: Julian Satran/Haifa/IBM@IBMIL, ips@ece.cmu.edu >cc: >Subject: RE: iSCSI ERT: data SACK/replay buffer/"semi-transport" > > > > >Sorry to have been missing for a while. Hope you will >appreciate my being back in action :-). It was a fairly >clear consensus in Orlando that applications broke up >their transfers into reasonably small chunks i.e. they >did not have very long running transfers. > >Therefore the consensus was that a command level recovery >mechanism was sufficient instead of an ack/sack for each >data PDU. > >The SACK mechanism was a post Orlando invention. Without >an ack mechanism (for every data PDU), the SACK mechanism >just imposes additional burden on either end of the session, >without really much benefit. > >The benefit of having SACK is of saving bandwidth in case >the data part of the data PDU failed an integrity check >(but passed TCP checksum). This is a rare enough case that >as a percentage, the bandwidth loss from retransmitting >all the data associated with a read or write command is >very very small. > >In addition, it avoids the complexity of restarting >something from the middle, as compared to from the begining. > >To me it seems that there is significant simplicity (from >implementation, reliability and recovery process) from >having smaller data transfer per command. > >I would really like to get rid of the SACK command. > >Somesh > >> -----Original Message----- >> From: owner-ips@ece.cmu.edu [mailto:owner-ips@ece.cmu.edu]On Behalf Of >> julian_satran@il.ibm.com >> Sent: Wednesday, March 28, 2001 6:57 AM >> To: ips@ece.cmu.edu >> Subject: RE: iSCSI ERT: data SACK/replay buffer/"semi-transport" >> >> >> >> >> Mallikarjun, >> >> Last summer I thought that recovery within a connection should be left to >> TCP. It is simple and could be made available through IPsec (if no new >> option of any form can be added). >> >> Two things killed this: >> >> The requirement to have a data encapsulation that can pass through >> application proxies (like a storage router) >> The "NO WAY" message we got from IESG-Security on a CRC only IPSec >> header >> >> >> As for the ACK - I am very much in favor of it (it is a no brainer) and >> implementations are in fact allowed to drop even unacked data. >> >> I am bound by the Orlando meeting decision to drop it. Except the regular >> "oppose everything" crowd the two vocal opponents where Somesh Gupta and >> Matt Wakeley. >> >> David may want or not to re-open the issue - I am not going to ask for >it. >> >> Regards, >> Julo >> >> "Mallikarjun C." <cbm@rose.hp.com> on 28/03/2001 00:45:02 >> >> Please respond to cbm@rose.hp.com >> >> To: Black_David@emc.com >> cc: Julian Satran/Haifa/IBM@IBMIL, cbm@rose.hp.com, someshg@yahoo.com, >> steph@cs.uchicago.edu, John Hufferd/San Jose/IBM@IBMUS, >> ldalleore@snapserver.com, venkat@rhapsodynetworks.com >> Subject: RE: iSCSI ERT: data SACK/replay buffer/"semi-transport" >> >> >> >> >> David and Julian, >> >> I appreciate both your views, and should I say that they're >> along predicted lines :-) >> >> - David's right in saying that the situation is akin to FC's. >> However, I would like to point out that FC is an unreliable >> transport, and hence is forced to pick up a lot of the transport >> baggage (at least in FCP-2, as I understand), in addition >> to being a SCSI encapsulation layer. Unfortunately, even with >> TCP being the "reliable" transport, iSCSI is going along the >> same lines - ie. transport baggage + SCSI encapsulation. My >> point is - if this is indeed a necessary evil, why don't we >> complete iSCSI's transport functionality by data-ACKs? >> >> - If data SACK is introduced mostly to make up for TCP's shortcomings, >> we're making its usage (and implementation) drastically less appealing >> since the only way error recovery algorithms can *rely* on data SACK >> is when replay is supported (or, "ReplaySupport=yes" in my proposal), >> which is extremely expensive. IOW, we're defining data SACK in the >> draft and not providing any incentives to implement and use it! >> >> - I submit that since iSCSI is being hailed as the ideal SCSI Transport >> protocol in its definition so far (and I believe, rightly so - >mandating >> command ordering, bi-di support, SCSI CRN support to name a few >> examples), >> the perfectly SCSI-legal R/W interactions that break in other >transports >> *do not* have to break in iSCSI. >> >> - A last idea (may seem radical at this point) in regards to iSCSI >> being a "full transport". This provides us an opportunity to "cast >> off" the transport baggage in future when we truly move to a "reliable" >> transport (perhaps TCP with CRCs/SCTP ?) - if we do a good job of >> keeping the encapsulation stuff separate from the transport stuff. >> (Julian, I heard from Randy that ideas similar to this were explored >> in your Haifa meeting. And yes, he recalls they were given up since >> TCP was supposed to be reliable and granularity of recovery was deemed >> one I/O.) >> >> With that said, may I request David (with his co-chair hat on, :-)) >> to add some binding comments/observations on this discussion? >> >> If we decide to leave data SACKs as unattractive to implement, the draft >> should in the least add a statement like - "Note that satisfying all >> possible data SACK requests for a task with an unacknowledged status >> implies implementing the I/O replay buffer on the part of targets." >> -- >> Mallikarjun >> >> >> Mallikarjun Chadalapaka >> Networked Storage Architecture >> Network Storage Solutions Organization >> MS 5668 Hewlett-Packard, Roseville. >> cbm@rose.hp.com >> >> >> >> >> >I think Julian's basically right -- I would point >> >out that any case of write after read that breaks >> >over iSCSI will also break over Fibre Channel. >> >On FC, the scenario starts with a frame CRC failure >> >on read data at the Initiator, so applications >> >have to cope and typically do so by enforcing >> >ordering at the app rather than using SCSI task >> >ordering. >> > >> >While SCSI has clever tools like ACA and task >> >ordering that appear to allow dependent operations >> >to be sent to the target concurrently, in practice >> >they don't work and/or aren't used (funny thing, >> >those two reinforce each other ;-) ). Hence >> >a minimal approach to them is in order: >> >- Make sure the result will interoperate. >> >- Make sure T10 doesn't ding us for leaving something >> > completely out. >> >- Don't specify anything not needed for the above. >> > >> >My 0.02, >> >--David >> > >> >> -----Original Message----- >> >> From: julian_satran@il.ibm.com [SMTP:julian_satran@il.ibm.com] >> >> Sent: Tuesday, March 27, 2001 9:23 AM >> >> To: cbm@rose.hp.com >> >> Cc: someshg@yahoo.com; steph@cs.uchicago.edu; hufferd@us.ibm.com; >> >> cbm@rose.hp.com; ldalleore@snapserver.com; Venkat Rangan; >> >> Black_David@emc.com >> >> Subject: Re: iSCSI ERT: data SACK/replay buffer/"semi-transport" >> >> >> >> >> >> >> >> Mallikarjun, >> >> >> >> I commiserate with you at the lack of ack for data but the Orlando >> meeting >> >> stated - no. Recall that I kept the number only as a mechanism to >> detect >> >> missing packets. >> >> >> >> You can achieve the effect you want by keeping around data for a while >> >> (you >> >> determine how long and then discard). >> >> >> >> If a SACK comes and you can recover - fine. If not you either reaccess >> the >> >> media (if you know how) or reject >> >> and let the initiator retry. >> >> >> >> You should not worry about R/W conflicts as programs bound to have >such >> >> conflicts either: >> >> >> >> 1)can live with them or >> >> 2)protect themselves through some locks and rely on >> "operation-end-status" >> >> to keep results deterministic. >> >> >> >> Regards, >> >> Julo >> >> >> >> >> >> >> >> "Mallikarjun C." <cbm@rose.hp.com> on 27/03/2001 03:34:16 >> >> >> >> Please respond to cbm@rose.hp.com >> >> >> >> To: cbm@rose.hp.com, someshg@yahoo.com, steph@cs.uchicago.edu, >Julian >> >> Satran/Haifa/IBM@IBMIL, John Hufferd/San Jose/IBM@IBMUS >> >> cc: Black_David@emc.com >> >> Subject: iSCSI ERT: data SACK/replay buffer/"semi-transport" >> >> >> >> >> >> >> >> >> >> Hi Error Recovery Team, >> >> >> >> iSCSI can discard PDUs because of digest errors and request >> >> retransmissions using the iSCSI data SACK. To deal with such >> >> an eventuality, targets that want to support data SACK have >> >> the following options: >> >> >> >> (A) maintain a complete "replay" buffer for the entire I/O since >> >> a SACK could come anytime before the status is ack'ed by the >> >> initiator. [ simple, but extremely expensive in memory resources] >> >> >> >> (B) (re-introduce data-ACKs into the draft, and) implement data-ACKs. >> >> Thus enables keeping only those I/O buffers that haven't been ack'ed >> >> by the initiator. IOW, become a real full transport! [ everyone >> disliked >> >> it earlier...] >> >> >> >> (C) re-access the medium for data retransmission requests. Now there >> >> are 3 sub-cases in this to handle the changed data on the medium in >a >> >> write-after-read scenario. (SEE NOTE.1 at the bottom on how it is >> >> legal.) >> >> (1) On seeing any write, stall till status is ack'ed for all the >> >> previous reads (basically drain the pipe). [simple, but >> incurs >> >> an additional roundtrip delay for all writes]. >> >> (2) A variation of the above, keep an eye only on the prior >> >> overlapping reads. [more BW efficient, but complicated to >> >> resolve the block dependencies in a stream of >> reads followed >> >> by writes] >> >> (3) Document the caveat and leave it upto the applications >> >> to avoid this case since this leads to data integrity >> issues. >> >> [pushing to apps since the transport can't get it right!] >> >> >> >> My first preference is (B), followed by (A), and I suggest we not go >> >> to (C) at all with its inherent dangers. >> >> >> >> Doing (B) naturally completes the transport job that iSCSI has taken >> >> on itself in view of TCP's claimed unreliable checksum. That is the >> >> right thing to do architecturally instead of being a "semi-transport"! >> >> >> >> Comments? >> >> -- >> >> Mallikarjun >> >> >> >> >> >> Mallikarjun Chadalapaka >> >> Networked Storage Architecture >> >> Network Storage Solutions Organization >> >> MS 5668 Hewlett-Packard, Roseville. >> >> cbm@rose.hp.com >> >> >> >> >> >__________________________________________________________________________ >> >> Note.1: A Read followed by a Write (to the same blocks) is perfectly >> legal >> >> if SCSI sets the ORDERED task attribute on both the >> commands AND >> >> sets the NACA bit to one to indicate that Write shall be >> executed >> >> only if the Read did not fail (result in a Check Condition). >> >> >> >> In the current case, since Read completed just fine from >SCSI's >> >> point of view, SCSI is moving on to execute Write. Those read >> >> buffers >> >> had been freed up since iSCSI received an ACK at the TCP >level, >> >> and >> >> since iSCSI has no other way to have the data ack'ed!
Home Last updated: Tue Sep 04 01:05:11 2001 6315 messages in chronological order |