Re: iSCSI ERT: data SACK/replay buffer/"semi-transport"

To: ips@ece.cmu.edu
Subject: Re: iSCSI ERT: data SACK/replay buffer/"semi-transport"
From: "Jon Hall" <jhall@emc.com>
Date: Thu, 05 Apr 2001 09:55:11 -0400
Sender: owner-ips@ece.cmu.edu

Julian,

I don't understand.  Are you saying that an "expensive" target
will implement specific error recovery mechanisms for very rare
events?  Or are you saying that this case is not a rare event?

If the former, there is a problem of completeness (e.g., should
there be recovery procedures for when the sun goes nova :-).
If the latter, this would be very interesting and useful to
know about...

-Jon

julian_satran@il.ibm.com writes:
>
>Jon,
>
>Inexpensive implementation are always free to do away with recovery. That
>si true for targets too.
>But not specifying the mechanism for the more expensive one we make them
>non-interoperable.
>
>Julo
>
>"Jon Hall" <jhall@emc.com> on 04/04/2001 22:55:40
>
>Please respond to "Jon Hall" <jhall@emc.com>
>
>To:   ips@ece.cmu.edu
>cc:
>Subject:  Re: iSCSI ERT: data SACK/replay buffer/"semi-transport"
>
>
>But CRC errors are not really the issue.  It is the
>singular case of a TCP cksum failing to detect what a
>CRC succeeds in detecting, and this occurring to a TCP
>segment containing an iSCSI hdr with a StatSN.
>
>Is there a reason to believe that iSCSI StatSNs will be
>lost at a higher rate than is currently documented for TCP
>cksum failure?  Or, is the problem a loss of one TCP segment
>in tens (possibly hundreds) of millions of segments.  Where
>the bad segment may contain a StatSN but probably doesn't
>because it is a data pdu.  If the latter, why does a SCSI-level
>timeout and retry (on the initiator) not suffice?  [Note,
>an initiator timeout/retry does not require a connection
>to be closed.]
>
>I realize that I am being annoyingly repetitious, but it is
>not an idle question.  For some targets, retained rsp status
>is not cheap (and retained rsp data is not tractable at all).
>
>IMO there appears to be no real need for SNACK.  And, more
>radically, there appears to be no need for StatSNs.
>
>Maybe, as Somesh said, this is a dead horse but why include
>something in the spec which suggests a need for target-side
>complexity, while not solving a clear and compelling
>requirement?
>
>-Jon
>
>julian_satran@il.ibm.com writes:
>>
>>SNACK is here for two reasons - Status retry (which is cheap) and Data
>>retry as a side benefit.
>>CRC errors are not that rare (although we don't have real data the
>>simulation with file systems seem to indicate that numbers could be as
>high
>>a 0.0002%). A restart of link - is expensive (slow start) and even if they
>>are far lower for many applications a slow start is a painfull event.
>>
>>Removing them from the spec is not a path we should take lightly.
>>
>>Julo
>>
>>"Jon Hall" <jhall@emc.com> on 02/04/2001 16:13:35
>>
>>Please respond to "Jon Hall" <jhall@emc.com>
>>
>>To:   ips@ece.cmu.edu
>>cc:
>>Subject:  Re: iSCSI ERT: data SACK/replay buffer/"semi-transport"
>>
>>
>>
>>
>>
>>I agree with Somesh.  And would go farther -- the complexity
>>that results from retaining enough target-side state to respond
>>to a SACK/SNACK request is non-trivial and needs clear justification.
>>Intuitively, a CRC that discovers an error in an iSCSI pdu header
>>(that the TCP cksum missed) seems like it should be a rare event.
>>
>>What is the frequency of this event?  IMO the answer to this
>>question should be written into the protocol spec -- assuming
>>that it substantiates the benefit of SACK/SNACK.  Otherwise, the
>>SACK/SNACK pdu should be removed.
>>
>>-Jon
>>
>>julian_satran@il.ibm.com writes:
>>>
>>>Somesh,
>>>
>>>As I stated earlier - the DataSN was created to detect missing data PDUs.
>>>SNACK is needed to recover missing StatusSN and missing dataSN is only a
>>>bonus if the target wants to support it.  It is a trivial mechanism and I
>>>think it should stay.
>>>
>>>Julo
>>>
>>>"Somesh Gupta" <someshg@yahoo.com> on 31/03/2001 02:25:52
>>>
>>>Please respond to someshg@yahoo.com
>>>
>>>To:   Julian Satran/Haifa/IBM@IBMIL, ips@ece.cmu.edu
>>>cc:
>>>Subject:  RE: iSCSI ERT: data SACK/replay buffer/"semi-transport"
>>>
>>>
>>>
>>>
>>>Sorry to have been missing for a while. Hope you will
>>>appreciate my being back in action :-). It was a fairly
>>>clear consensus in Orlando that applications broke up
>>>their transfers into reasonably small chunks i.e. they
>>>did not have very long running transfers.
>>>
>>>Therefore the consensus was that a command level recovery
>>>mechanism was sufficient instead of an ack/sack for each
>>>data PDU.
>>>
>>>The SACK mechanism was a post Orlando invention. Without
>>>an ack mechanism (for every data PDU), the SACK mechanism
>>>just imposes additional burden on either end of the session,
>>>without really much benefit.
>>>
>>>The benefit of having SACK is of saving bandwidth in case
>>>the data part of the data PDU failed an integrity check
>>>(but passed TCP checksum). This is a rare enough case that
>>>as a percentage, the bandwidth loss from retransmitting
>>>all the data associated with a read or write command is
>>>very very small.
>>>
>>>In addition, it avoids the complexity of restarting
>>>something from the middle, as compared to from the begining.
>>>
>>>To me it seems that there is significant simplicity (from
>>>implementation, reliability and recovery process) from
>>>having smaller data transfer per command.
>>>
>>>I would really like to get rid of the SACK command.
>>>
>>>Somesh
>>>
>>>> -----Original Message-----
>>>> From: owner-ips@ece.cmu.edu [mailto:owner-ips@ece.cmu.edu]On Behalf Of
>>>> julian_satran@il.ibm.com
>>>> Sent: Wednesday, March 28, 2001 6:57 AM
>>>> To: ips@ece.cmu.edu
>>>> Subject: RE: iSCSI ERT: data SACK/replay buffer/"semi-transport"
>>>>
>>>>
>>>>
>>>>
>>>> Mallikarjun,
>>>>
>>>> Last summer I thought that recovery within a connection should be left
>>to
>>>> TCP. It is simple and could be made available through IPsec (if no new
>>>> option of any form can be added).
>>>>
>>>> Two things killed this:
>>>>
>>>>    The requirement to have a data encapsulation that can pass through
>>>>    application proxies (like a storage router)
>>>>    The "NO WAY" message we got from IESG-Security on a CRC only IPSec
>>>>    header
>>>>
>>>>
>>>> As for the ACK - I am very much in favor of it (it is a no brainer) and
>>>> implementations are in fact allowed to drop even unacked data.
>>>>
>>>> I am bound by the Orlando meeting decision to drop it. Except the
>>regular
>>>> "oppose everything" crowd the two vocal opponents where Somesh Gupta
>and
>>>> Matt Wakeley.
>>>>
>>>> David may want or not to re-open the issue - I am not going to ask for
>>>it.
>>>>
>>>> Regards,
>>>> Julo
>>>>
>>>> "Mallikarjun C." <cbm@rose.hp.com> on 28/03/2001 00:45:02
>>>>
>>>> Please respond to cbm@rose.hp.com
>>>>
>>>> To:   Black_David@emc.com
>>>> cc:   Julian Satran/Haifa/IBM@IBMIL, cbm@rose.hp.com,
>someshg@yahoo.com,
>>>>       steph@cs.uchicago.edu, John Hufferd/San Jose/IBM@IBMUS,
>>>>       ldalleore@snapserver.com, venkat@rhapsodynetworks.com
>>>> Subject:  RE: iSCSI ERT: data SACK/replay buffer/"semi-transport"
>>>>
>>>>
>>>>
>>>>
>>>> David and Julian,
>>>>
>>>> I appreciate both your views, and should I say that they're
>>>> along predicted lines :-)
>>>>
>>>> - David's right in saying that the situation is akin to FC's.
>>>>   However, I would like to point out that FC is an unreliable
>>>>   transport, and hence is forced to pick up a lot of the transport
>>>>   baggage (at least in FCP-2, as I understand), in addition
>>>>   to being a SCSI encapsulation layer.  Unfortunately, even with
>>>>   TCP being the "reliable" transport, iSCSI is going along the
>>>>   same lines - ie. transport baggage + SCSI encapsulation.  My
>>>>   point is - if this is indeed a necessary evil, why don't we
>>>>   complete iSCSI's transport functionality by data-ACKs?
>>>>
>>>> - If data SACK is introduced mostly to make up for TCP's shortcomings,
>>>>   we're making its usage (and implementation) drastically less
>appealing
>>>>   since the only way error recovery algorithms can *rely* on data SACK
>>>>   is when replay is supported (or, "ReplaySupport=yes"  in my
>proposal),
>>>>   which is extremely expensive.  IOW, we're defining data SACK in the
>>>>   draft and not providing any incentives to implement and use it!
>>>>
>>>> - I submit that since iSCSI is being hailed as the ideal SCSI Transport
>>>>   protocol in its definition so far (and I believe, rightly so -
>>>mandating
>>>>   command ordering, bi-di support, SCSI CRN support to name a few
>>>> examples),
>>>>   the perfectly SCSI-legal R/W interactions that break in other
>>>transports
>>>>   *do not* have to break in iSCSI.
>>>>
>>>> - A last idea (may seem radical at this point) in regards to iSCSI
>>>>   being a "full transport". This provides us an opportunity to "cast
>>>>   off" the transport baggage in future when we truly move to a
>>"reliable"
>>>>   transport (perhaps TCP with CRCs/SCTP ?) - if we do a good job of
>>>>   keeping the encapsulation stuff separate from the transport stuff.
>>>>   (Julian, I heard from Randy that ideas similar to this were explored
>>>>   in your Haifa meeting.  And yes, he recalls they were given up since
>>>>   TCP was supposed to be reliable and granularity of recovery was
>deemed
>>>>   one I/O.)
>>>>
>>>> With that said, may I request David (with his co-chair hat on, :-))
>>>> to add some binding comments/observations on this discussion?
>>>>
>>>> If we decide to leave data SACKs as unattractive to implement, the
>draft
>>>> should in the least add a statement like - "Note that satisfying all
>>>> possible data SACK requests for a task with an unacknowledged status
>>>> implies implementing the I/O replay buffer on the part of targets."
>>>> --
>>>> Mallikarjun
>>>>
>>>>
>>>> Mallikarjun Chadalapaka
>>>> Networked Storage Architecture
>>>> Network Storage Solutions Organization
>>>> MS 5668   Hewlett-Packard, Roseville.
>>>> cbm@rose.hp.com
>>>>
>>>>
>>>>
>>>>
>>>> >I think Julian's basically right -- I would point
>>>> >out that any case of write after read that breaks
>>>> >over iSCSI will also break over Fibre Channel.
>>>> >On FC, the scenario starts with a frame CRC failure
>>>> >on read data at the Initiator, so applications
>>>> >have to cope and typically do so by enforcing
>>>> >ordering at the app rather than using SCSI task
>>>> >ordering.
>>>> >
>>>> >While SCSI has clever tools like ACA and task
>>>> >ordering that appear to allow dependent operations
>>>> >to be sent to the target concurrently, in practice
>>>> >they don't work and/or aren't used (funny thing,
>>>> >those two reinforce each other ;-) ).  Hence
>>>> >a minimal approach to them is in order:
>>>> >- Make sure the result will interoperate.
>>>> >- Make sure T10 doesn't ding us for leaving something
>>>> >    completely out.
>>>> >- Don't specify anything not needed for the above.
>>>> >
>>>> >My 0.02,
>>>> >--David
>>>> >
>>>> >> -----Original Message-----
>>>> >> From:  julian_satran@il.ibm.com [SMTP:julian_satran@il.ibm.com]
>>>> >> Sent:  Tuesday, March 27, 2001 9:23 AM
>>>> >> To:    cbm@rose.hp.com
>>>> >> Cc:    someshg@yahoo.com; steph@cs.uchicago.edu; hufferd@us.ibm.com;
>>>> >> cbm@rose.hp.com; ldalleore@snapserver.com; Venkat Rangan;
>>>> >> Black_David@emc.com
>>>> >> Subject:    Re: iSCSI ERT: data SACK/replay buffer/"semi-transport"
>>>> >>
>>>> >>
>>>> >>
>>>> >> Mallikarjun,
>>>> >>
>>>> >> I commiserate with you at the lack of ack for data but the Orlando
>>>> meeting
>>>> >> stated - no.  Recall that I kept the number only as a mechanism to
>>>> detect
>>>> >> missing packets.
>>>> >>
>>>> >> You can achieve the effect you want by keeping around data for a
>>while
>>>> >> (you
>>>> >> determine how long and then discard).
>>>> >>
>>>> >> If a SACK comes and you can recover - fine. If not you either
>>reaccess
>>>> the
>>>> >> media (if you know how) or reject
>>>> >> and let the initiator retry.
>>>> >>
>>>> >> You should not worry about R/W conflicts as programs bound to have
>>>such
>>>> >> conflicts either:
>>>> >>
>>>> >> 1)can live with them or
>>>> >> 2)protect themselves through some locks and rely on
>>>> "operation-end-status"
>>>> >> to keep results deterministic.
>>>> >>
>>>> >> Regards,
>>>> >> Julo
>>>> >>
>>>> >>
>>>> >>
>>>> >> "Mallikarjun C." <cbm@rose.hp.com> on 27/03/2001 03:34:16
>>>> >>
>>>> >> Please respond to cbm@rose.hp.com
>>>> >>
>>>> >> To:   cbm@rose.hp.com, someshg@yahoo.com, steph@cs.uchicago.edu,
>>>Julian
>>>> >>       Satran/Haifa/IBM@IBMIL, John Hufferd/San Jose/IBM@IBMUS
>>>> >> cc:   Black_David@emc.com
>>>> >> Subject:  iSCSI ERT: data SACK/replay buffer/"semi-transport"
>>>> >>
>>>> >>
>>>> >>
>>>> >>
>>>> >> Hi Error Recovery Team,
>>>> >>
>>>> >> iSCSI can discard PDUs because of digest errors and request
>>>> >> retransmissions using the iSCSI data SACK.  To deal with such
>>>> >> an eventuality, targets that want to support data SACK have
>>>> >> the following options:
>>>> >>
>>>> >> (A) maintain a complete "replay" buffer for the entire I/O since
>>>> >>   a SACK could come anytime before the status is ack'ed by the
>>>> >>   initiator. [ simple, but extremely expensive in memory resources]
>>>> >>
>>>> >> (B) (re-introduce data-ACKs into the draft, and) implement
>data-ACKs.
>>>> >>   Thus enables keeping only those I/O buffers that haven't been
>>ack'ed
>>>> >>   by the initiator. IOW, become a real full transport! [ everyone
>>>> disliked
>>>> >>   it earlier...]
>>>> >>
>>>> >> (C) re-access the medium for data retransmission requests.  Now
>there
>>>> >>   are 3 sub-cases in this to handle the changed data on the medium
>in
>>>a
>>>> >>   write-after-read scenario.  (SEE NOTE.1 at the bottom on how it is
>>>> >> legal.)
>>>> >>      (1) On seeing any write, stall till status is ack'ed for all
>the
>>>> >>             previous reads (basically drain the pipe). [simple, but
>>>> incurs
>>>> >>             an additional roundtrip delay for all writes].
>>>> >>      (2) A variation of the above, keep an eye only on the prior
>>>> >>             overlapping reads. [more BW efficient, but complicated
>to
>>>> >>             resolve the block dependencies in a stream of
>>>> reads followed
>>>> >>             by writes]
>>>> >>         (3) Document the caveat and leave it upto the applications
>>>> >>             to avoid this case since this leads to data integrity
>>>> issues.
>>>> >>             [pushing to apps since the transport can't get it
>right!]
>>>> >>
>>>> >> My first preference is (B), followed by (A), and I suggest we not go
>>>> >> to (C) at all with its inherent dangers.
>>>> >>
>>>> >> Doing (B) naturally completes the transport job that iSCSI has taken
>>>> >> on itself in view of TCP's claimed unreliable checksum.  That is the
>>>> >> right thing to do architecturally instead of being a
>>"semi-transport"!
>>>> >>
>>>> >> Comments?
>>>> >> --
>>>> >> Mallikarjun
>>>> >>
>>>> >>
>>>> >> Mallikarjun Chadalapaka
>>>> >> Networked Storage Architecture
>>>> >> Network Storage Solutions Organization
>>>> >> MS 5668   Hewlett-Packard, Roseville.
>>>> >> cbm@rose.hp.com
>>>> >>
>>>> >>
>>>>
>>>__________________________________________________________________________
>
>>>> >> Note.1: A Read followed by a Write (to the same blocks) is perfectly
>>>> legal
>>>> >>         if SCSI sets the ORDERED task attribute on both the
>>>> commands AND
>>>> >>         sets the NACA bit to one to indicate that Write shall be
>>>> executed
>>>> >>         only if the Read did not fail (result in a Check Condition).
>>>> >>
>>>> >>         In the current case, since Read completed just fine from
>>>SCSI's
>>>> >>         point of view, SCSI is moving on to execute Write.  Those
>>read
>>>> >> buffers
>>>> >>         had been freed up since iSCSI received an ACK at the TCP
>>>level,
>>>> >> and
>>>> >>         since iSCSI has no other way to have the data ack'ed!
>>
>
Follow-Ups:
- Re: iSCSI ERT: data SACK/replay buffer/"semi-transport"
  - From: Mark Bakke <mbakke@cisco.com>
Prev by Date: Re: iSCSI ERT: data SACK/replay buffer/"semi-transport"
Next by Date: Re: iSCSI ERT: data SACK/replay buffer/"semi-transport"
Prev by thread: Re: iSCSI ERT: data SACK/replay buffer/"semi-transport"
Next by thread: Re: iSCSI ERT: data SACK/replay buffer/"semi-transport"
Index(es):
- Date
- Thread
Home
Last updated: Tue Sep 04 01:05:10 2001
6315 messages in chronological order