Re: iSCSI ERT: data SACK/replay buffer/"semi-transport"

To: ips@ece.cmu.edu
Subject: Re: iSCSI ERT: data SACK/replay buffer/"semi-transport"
From: julian_satran@il.ibm.com
Date: Wed, 4 Apr 2001 16:31:30 +0200
Content-Disposition: inline
Content-type: text/plain; charset=us-ascii
Sender: owner-ips@ece.cmu.edu


SNACK is here for two reasons - Status retry (which is cheap) and Data
retry as a side benefit.
CRC errors are not that rare (although we don't have real data the
simulation with file systems seem to indicate that numbers could be as high
a 0.0002%). A restart of link - is expensive (slow start) and even if they
are far lower for many applications a slow start is a painfull event.

Removing them from the spec is not a path we should take lightly.

Julo

"Jon Hall" <jhall@emc.com> on 02/04/2001 16:13:35

Please respond to "Jon Hall" <jhall@emc.com>

To:   ips@ece.cmu.edu
cc:
Subject:  Re: iSCSI ERT: data SACK/replay buffer/"semi-transport"





I agree with Somesh.  And would go farther -- the complexity
that results from retaining enough target-side state to respond
to a SACK/SNACK request is non-trivial and needs clear justification.
Intuitively, a CRC that discovers an error in an iSCSI pdu header
(that the TCP cksum missed) seems like it should be a rare event.

What is the frequency of this event?  IMO the answer to this
question should be written into the protocol spec -- assuming
that it substantiates the benefit of SACK/SNACK.  Otherwise, the
SACK/SNACK pdu should be removed.

-Jon

julian_satran@il.ibm.com writes:
>
>Somesh,
>
>As I stated earlier - the DataSN was created to detect missing data PDUs.
>SNACK is needed to recover missing StatusSN and missing dataSN is only a
>bonus if the target wants to support it.  It is a trivial mechanism and I
>think it should stay.
>
>Julo
>
>"Somesh Gupta" <someshg@yahoo.com> on 31/03/2001 02:25:52
>
>Please respond to someshg@yahoo.com
>
>To:   Julian Satran/Haifa/IBM@IBMIL, ips@ece.cmu.edu
>cc:
>Subject:  RE: iSCSI ERT: data SACK/replay buffer/"semi-transport"
>
>
>
>
>Sorry to have been missing for a while. Hope you will
>appreciate my being back in action :-). It was a fairly
>clear consensus in Orlando that applications broke up
>their transfers into reasonably small chunks i.e. they
>did not have very long running transfers.
>
>Therefore the consensus was that a command level recovery
>mechanism was sufficient instead of an ack/sack for each
>data PDU.
>
>The SACK mechanism was a post Orlando invention. Without
>an ack mechanism (for every data PDU), the SACK mechanism
>just imposes additional burden on either end of the session,
>without really much benefit.
>
>The benefit of having SACK is of saving bandwidth in case
>the data part of the data PDU failed an integrity check
>(but passed TCP checksum). This is a rare enough case that
>as a percentage, the bandwidth loss from retransmitting
>all the data associated with a read or write command is
>very very small.
>
>In addition, it avoids the complexity of restarting
>something from the middle, as compared to from the begining.
>
>To me it seems that there is significant simplicity (from
>implementation, reliability and recovery process) from
>having smaller data transfer per command.
>
>I would really like to get rid of the SACK command.
>
>Somesh
>
>> -----Original Message-----
>> From: owner-ips@ece.cmu.edu [mailto:owner-ips@ece.cmu.edu]On Behalf Of
>> julian_satran@il.ibm.com
>> Sent: Wednesday, March 28, 2001 6:57 AM
>> To: ips@ece.cmu.edu
>> Subject: RE: iSCSI ERT: data SACK/replay buffer/"semi-transport"
>>
>>
>>
>>
>> Mallikarjun,
>>
>> Last summer I thought that recovery within a connection should be left
to
>> TCP. It is simple and could be made available through IPsec (if no new
>> option of any form can be added).
>>
>> Two things killed this:
>>
>>    The requirement to have a data encapsulation that can pass through
>>    application proxies (like a storage router)
>>    The "NO WAY" message we got from IESG-Security on a CRC only IPSec
>>    header
>>
>>
>> As for the ACK - I am very much in favor of it (it is a no brainer) and
>> implementations are in fact allowed to drop even unacked data.
>>
>> I am bound by the Orlando meeting decision to drop it. Except the
regular
>> "oppose everything" crowd the two vocal opponents where Somesh Gupta and
>> Matt Wakeley.
>>
>> David may want or not to re-open the issue - I am not going to ask for
>it.
>>
>> Regards,
>> Julo
>>
>> "Mallikarjun C." <cbm@rose.hp.com> on 28/03/2001 00:45:02
>>
>> Please respond to cbm@rose.hp.com
>>
>> To:   Black_David@emc.com
>> cc:   Julian Satran/Haifa/IBM@IBMIL, cbm@rose.hp.com, someshg@yahoo.com,
>>       steph@cs.uchicago.edu, John Hufferd/San Jose/IBM@IBMUS,
>>       ldalleore@snapserver.com, venkat@rhapsodynetworks.com
>> Subject:  RE: iSCSI ERT: data SACK/replay buffer/"semi-transport"
>>
>>
>>
>>
>> David and Julian,
>>
>> I appreciate both your views, and should I say that they're
>> along predicted lines :-)
>>
>> - David's right in saying that the situation is akin to FC's.
>>   However, I would like to point out that FC is an unreliable
>>   transport, and hence is forced to pick up a lot of the transport
>>   baggage (at least in FCP-2, as I understand), in addition
>>   to being a SCSI encapsulation layer.  Unfortunately, even with
>>   TCP being the "reliable" transport, iSCSI is going along the
>>   same lines - ie. transport baggage + SCSI encapsulation.  My
>>   point is - if this is indeed a necessary evil, why don't we
>>   complete iSCSI's transport functionality by data-ACKs?
>>
>> - If data SACK is introduced mostly to make up for TCP's shortcomings,
>>   we're making its usage (and implementation) drastically less appealing
>>   since the only way error recovery algorithms can *rely* on data SACK
>>   is when replay is supported (or, "ReplaySupport=yes"  in my proposal),
>>   which is extremely expensive.  IOW, we're defining data SACK in the
>>   draft and not providing any incentives to implement and use it!
>>
>> - I submit that since iSCSI is being hailed as the ideal SCSI Transport
>>   protocol in its definition so far (and I believe, rightly so -
>mandating
>>   command ordering, bi-di support, SCSI CRN support to name a few
>> examples),
>>   the perfectly SCSI-legal R/W interactions that break in other
>transports
>>   *do not* have to break in iSCSI.
>>
>> - A last idea (may seem radical at this point) in regards to iSCSI
>>   being a "full transport". This provides us an opportunity to "cast
>>   off" the transport baggage in future when we truly move to a
"reliable"
>>   transport (perhaps TCP with CRCs/SCTP ?) - if we do a good job of
>>   keeping the encapsulation stuff separate from the transport stuff.
>>   (Julian, I heard from Randy that ideas similar to this were explored
>>   in your Haifa meeting.  And yes, he recalls they were given up since
>>   TCP was supposed to be reliable and granularity of recovery was deemed
>>   one I/O.)
>>
>> With that said, may I request David (with his co-chair hat on, :-))
>> to add some binding comments/observations on this discussion?
>>
>> If we decide to leave data SACKs as unattractive to implement, the draft
>> should in the least add a statement like - "Note that satisfying all
>> possible data SACK requests for a task with an unacknowledged status
>> implies implementing the I/O replay buffer on the part of targets."
>> --
>> Mallikarjun
>>
>>
>> Mallikarjun Chadalapaka
>> Networked Storage Architecture
>> Network Storage Solutions Organization
>> MS 5668   Hewlett-Packard, Roseville.
>> cbm@rose.hp.com
>>
>>
>>
>>
>> >I think Julian's basically right -- I would point
>> >out that any case of write after read that breaks
>> >over iSCSI will also break over Fibre Channel.
>> >On FC, the scenario starts with a frame CRC failure
>> >on read data at the Initiator, so applications
>> >have to cope and typically do so by enforcing
>> >ordering at the app rather than using SCSI task
>> >ordering.
>> >
>> >While SCSI has clever tools like ACA and task
>> >ordering that appear to allow dependent operations
>> >to be sent to the target concurrently, in practice
>> >they don't work and/or aren't used (funny thing,
>> >those two reinforce each other ;-) ).  Hence
>> >a minimal approach to them is in order:
>> >- Make sure the result will interoperate.
>> >- Make sure T10 doesn't ding us for leaving something
>> >    completely out.
>> >- Don't specify anything not needed for the above.
>> >
>> >My 0.02,
>> >--David
>> >
>> >> -----Original Message-----
>> >> From:  julian_satran@il.ibm.com [SMTP:julian_satran@il.ibm.com]
>> >> Sent:  Tuesday, March 27, 2001 9:23 AM
>> >> To:    cbm@rose.hp.com
>> >> Cc:    someshg@yahoo.com; steph@cs.uchicago.edu; hufferd@us.ibm.com;
>> >> cbm@rose.hp.com; ldalleore@snapserver.com; Venkat Rangan;
>> >> Black_David@emc.com
>> >> Subject:    Re: iSCSI ERT: data SACK/replay buffer/"semi-transport"
>> >>
>> >>
>> >>
>> >> Mallikarjun,
>> >>
>> >> I commiserate with you at the lack of ack for data but the Orlando
>> meeting
>> >> stated - no.  Recall that I kept the number only as a mechanism to
>> detect
>> >> missing packets.
>> >>
>> >> You can achieve the effect you want by keeping around data for a
while
>> >> (you
>> >> determine how long and then discard).
>> >>
>> >> If a SACK comes and you can recover - fine. If not you either
reaccess
>> the
>> >> media (if you know how) or reject
>> >> and let the initiator retry.
>> >>
>> >> You should not worry about R/W conflicts as programs bound to have
>such
>> >> conflicts either:
>> >>
>> >> 1)can live with them or
>> >> 2)protect themselves through some locks and rely on
>> "operation-end-status"
>> >> to keep results deterministic.
>> >>
>> >> Regards,
>> >> Julo
>> >>
>> >>
>> >>
>> >> "Mallikarjun C." <cbm@rose.hp.com> on 27/03/2001 03:34:16
>> >>
>> >> Please respond to cbm@rose.hp.com
>> >>
>> >> To:   cbm@rose.hp.com, someshg@yahoo.com, steph@cs.uchicago.edu,
>Julian
>> >>       Satran/Haifa/IBM@IBMIL, John Hufferd/San Jose/IBM@IBMUS
>> >> cc:   Black_David@emc.com
>> >> Subject:  iSCSI ERT: data SACK/replay buffer/"semi-transport"
>> >>
>> >>
>> >>
>> >>
>> >> Hi Error Recovery Team,
>> >>
>> >> iSCSI can discard PDUs because of digest errors and request
>> >> retransmissions using the iSCSI data SACK.  To deal with such
>> >> an eventuality, targets that want to support data SACK have
>> >> the following options:
>> >>
>> >> (A) maintain a complete "replay" buffer for the entire I/O since
>> >>   a SACK could come anytime before the status is ack'ed by the
>> >>   initiator. [ simple, but extremely expensive in memory resources]
>> >>
>> >> (B) (re-introduce data-ACKs into the draft, and) implement data-ACKs.
>> >>   Thus enables keeping only those I/O buffers that haven't been
ack'ed
>> >>   by the initiator. IOW, become a real full transport! [ everyone
>> disliked
>> >>   it earlier...]
>> >>
>> >> (C) re-access the medium for data retransmission requests.  Now there
>> >>   are 3 sub-cases in this to handle the changed data on the medium in
>a
>> >>   write-after-read scenario.  (SEE NOTE.1 at the bottom on how it is
>> >> legal.)
>> >>      (1) On seeing any write, stall till status is ack'ed for all the
>> >>             previous reads (basically drain the pipe). [simple, but
>> incurs
>> >>             an additional roundtrip delay for all writes].
>> >>      (2) A variation of the above, keep an eye only on the prior
>> >>             overlapping reads. [more BW efficient, but complicated to
>> >>             resolve the block dependencies in a stream of
>> reads followed
>> >>             by writes]
>> >>         (3) Document the caveat and leave it upto the applications
>> >>             to avoid this case since this leads to data integrity
>> issues.
>> >>             [pushing to apps since the transport can't get it right!]
>> >>
>> >> My first preference is (B), followed by (A), and I suggest we not go
>> >> to (C) at all with its inherent dangers.
>> >>
>> >> Doing (B) naturally completes the transport job that iSCSI has taken
>> >> on itself in view of TCP's claimed unreliable checksum.  That is the
>> >> right thing to do architecturally instead of being a
"semi-transport"!
>> >>
>> >> Comments?
>> >> --
>> >> Mallikarjun
>> >>
>> >>
>> >> Mallikarjun Chadalapaka
>> >> Networked Storage Architecture
>> >> Network Storage Solutions Organization
>> >> MS 5668   Hewlett-Packard, Roseville.
>> >> cbm@rose.hp.com
>> >>
>> >>
>>
>__________________________________________________________________________
>> >> Note.1: A Read followed by a Write (to the same blocks) is perfectly
>> legal
>> >>         if SCSI sets the ORDERED task attribute on both the
>> commands AND
>> >>         sets the NACA bit to one to indicate that Write shall be
>> executed
>> >>         only if the Read did not fail (result in a Check Condition).
>> >>
>> >>         In the current case, since Read completed just fine from
>SCSI's
>> >>         point of view, SCSI is moving on to execute Write.  Those
read
>> >> buffers
>> >>         had been freed up since iSCSI received an ACK at the TCP
>level,
>> >> and
>> >>         since iSCSI has no other way to have the data ack'ed!
Follow-Ups:
- Re: iSCSI ERT: data SACK/replay buffer/"semi-transport"
  - From: Pierre Labat <pierre_labat@hp.com>
- RE: iSCSI ERT: data SACK/replay buffer/"semi-transport"
  - From: "Somesh Gupta" <someshg@yahoo.com>
Prev by Date: iSCSI linux implementation
Next by Date: RE: iSCSI ERT: data SACK/replay buffer/"semi-transport"
Prev by thread: RE: iSCSI ERT: data SACK/replay buffer/"semi-transport"
Next by thread: RE: iSCSI ERT: data SACK/replay buffer/"semi-transport"
Index(es):
- Date
- Thread
Home
Last updated: Tue Sep 04 01:05:11 2001
6315 messages in chronological order