RE: iSCSI ERT: data SACK/replay buffer/"semi-transport"

To: <julian_satran@il.ibm.com>, <ips@ece.cmu.edu>
Subject: RE: iSCSI ERT: data SACK/replay buffer/"semi-transport"
From: "Douglas Otis" <dotis@sanlight.net>
Date: Mon, 2 Apr 2001 21:51:57 -0700
Content-Transfer-Encoding: 7bit
Content-Type: text/plain;charset="iso-8859-1"
Importance: Normal
In-Reply-To: <C1256A23.00077341.00@d12mta02.de.ibm.com>
Sender: owner-ips@ece.cmu.edu
Julian,

Acknowledgement of every PDU does not require an acknowledgement exchange
per PDU.  I would expect some algorithm used to limit these exchanges.  I
agree every PDU should have a sequence number to allow acknowledgement and,
in some cases, purging.

Doug

> Somesh,
>
> That will certainly result in poor performance for important applications
> even with hardware implementations of iSCSI - mainly due to the large SCSI
> command traffic and associated interrupts.
>
> Julo
>
> "Somesh Gupta" <someshg@yahoo.com> on 02/04/2001 22:23:25
>
> Please respond to someshg@yahoo.com
>
> To:   cbm@rose.hp.com, ips@ece.cmu.edu
> cc:
> Subject:  RE: iSCSI ERT: data SACK/replay buffer/"semi-transport"
>
>
>
>
> To beat a dead horse ..
>
> One has to really decide fundamentally whether
>
> 1. Commands are used to transfer very large amounts of
>    data (multiple data PDUs are needed)
> 2. Commands are used to transfer relatively small amounts
>    of data (few/about one data PDU) and multiple commands
>    are then used to do long transfers
>
> (Orlando consensus was #2)
>
> If we assume the first model, then we really should have
> a sequence # and acknowledgement of every PDU - not just
> data PDUs. In this case, it is important to fill holes
> in the iSCSI stream. We can have a "super-transport" as
> Mallikarjun suggested between the iSCSI protocol layer
> and the TCP layer that provides the various "transport"
> like features we seem to want.
>
> If we assume the second model, we assume that recovery at
> the command level is sufficient. In this case it is important
> to have whatever mechanisms are (including data seq #s) needed
> to detect that a command will not succeed without recovery
> at the command level. However, recovery is needed only
> at the command level.
>
> I would let the current application model decide the features
> in "version 1" of the iSCSI protocol.
>
> Somesh
>
> > -----Original Message-----
> > From: owner-ips@ece.cmu.edu [mailto:owner-ips@ece.cmu.edu]On Behalf Of
> > Mallikarjun C.
> > Sent: Monday, April 02, 2001 10:34 AM
> > To: ips@ece.cmu.edu
> > Subject: RE: iSCSI ERT: data SACK/replay buffer/"semi-transport"
> >
> >
> > >Sorry to have been missing for a while. Hope you will
> > >appreciate my being back in action :-). It was a fairly
> > >clear consensus in Orlando that applications broke up
> > >their transfers into reasonably small chunks i.e. they
> > >did not have very long running transfers.
> > >
> > >Therefore the consensus was that a command level recovery
> > >mechanism was sufficient instead of an ack/sack for each
> > >data PDU.
> > >
> > >The SACK mechanism was a post Orlando invention. Without
> > >an ack mechanism (for every data PDU), the SACK mechanism
> > >just imposes additional burden on either end of the session,
> > >without really much benefit.
> >
> > To be fair to data SACK, one could think of an upper bound
> > on the unack'ed data - agreed on at the login time.  While not
> > requiring acks on every PDU, it gives targets the deterministic
> > maximum on the buffer size they have to keep around if they
> > choose to "reliably" support data SACK.  The current answer of
> > "replay buffer size/IO size", IMHO, is simply not attractive.
> > Also to be fair to data SACK, I believe FCP-2 allows sequence-level
> > error recovery in an I/O.
> >
> > However, I think that it's extremely useful to include a discussion
> > in the draft of  the TCP checksum "escape" statistics and the
> > device types for which this was considered an absolute requirement
> > to make forward progress at this error rates (like huge tape
> > backups?) - essentially the reasons that convinced Julian to define
> > this mechanism in. That gives credibility and acceptance to this,
> > or alternately may lead to the consensus that data SACK is not required.
> > --
> > Mallikarjun
> >
> >
> > Mallikarjun Chadalapaka
> > Networked Storage Architecture
> > Network Storage Solutions Organization
> > MS 5668 Hewlett-Packard, Roseville.
> > cbm@rose.hp.com
> >
> > >
> > >The benefit of having SACK is of saving bandwidth in case
> > >the data part of the data PDU failed an integrity check
> > >(but passed TCP checksum). This is a rare enough case that
> > >as a percentage, the bandwidth loss from retransmitting
> > >all the data associated with a read or write command is
> > >very very small.
> > >
> > >In addition, it avoids the complexity of restarting
> > >something from the middle, as compared to from the begining.
> > >
> > >To me it seems that there is significant simplicity (from
> > >implementation, reliability and recovery process) from
> > >having smaller data transfer per command.
> > >
> > >I would really like to get rid of the SACK command.
> > >
> > >Somesh
> > >
> > >> -----Original Message-----
> > >> From: owner-ips@ece.cmu.edu [mailto:owner-ips@ece.cmu.edu]On
> Behalf Of
> > >> julian_satran@il.ibm.com
> > >> Sent: Wednesday, March 28, 2001 6:57 AM
> > >> To: ips@ece.cmu.edu
> > >> Subject: RE: iSCSI ERT: data SACK/replay buffer/"semi-transport"
> > >>
> > >>
> > >>
> > >>
> > >> Mallikarjun,
> > >>
> > >> Last summer I thought that recovery within a connection should
> > be left to
> > >> TCP. It is simple and could be made available through IPsec
> (if no new
> > >> option of any form can be added).
> > >>
> > >> Two things killed this:
> > >>
> > >>    The requirement to have a data encapsulation that can pass through
> > >>    application proxies (like a storage router)
> > >>    The "NO WAY" message we got from IESG-Security on a CRC only IPSec
> > >>    header
> > >>
> > >>
> > >> As for the ACK - I am very much in favor of it (it is a no brainer)
> and
> > >> implementations are in fact allowed to drop even unacked data.
> > >>
> > >> I am bound by the Orlando meeting decision to drop it. Except
> > the regular
> > >> "oppose everything" crowd the two vocal opponents where Somesh
> > Gupta and
> > >> Matt Wakeley.
> > >>
> > >> David may want or not to re-open the issue - I am not going to
> > ask for it.
> > >>
> > >> Regards,
> > >> Julo
> > >>
> > >> "Mallikarjun C." <cbm@rose.hp.com> on 28/03/2001 00:45:02
> > >>
> > >> Please respond to cbm@rose.hp.com
> > >>
> > >> To:   Black_David@emc.com
> > >> cc:   Julian Satran/Haifa/IBM@IBMIL, cbm@rose.hp.com,
> > someshg@yahoo.com,
> > >>       steph@cs.uchicago.edu, John Hufferd/San Jose/IBM@IBMUS,
> > >>       ldalleore@snapserver.com, venkat@rhapsodynetworks.com
> > >> Subject:  RE: iSCSI ERT: data SACK/replay buffer/"semi-transport"
> > >>
> > >>
> > >>
> > >>
> > >> David and Julian,
> > >>
> > >> I appreciate both your views, and should I say that they're
> > >> along predicted lines :-)
> > >>
> > >> - David's right in saying that the situation is akin to FC's.
> > >>   However, I would like to point out that FC is an unreliable
> > >>   transport, and hence is forced to pick up a lot of the transport
> > >>   baggage (at least in FCP-2, as I understand), in addition
> > >>   to being a SCSI encapsulation layer.  Unfortunately, even with
> > >>   TCP being the "reliable" transport, iSCSI is going along the
> > >>   same lines - ie. transport baggage + SCSI encapsulation.  My
> > >>   point is - if this is indeed a necessary evil, why don't we
> > >>   complete iSCSI's transport functionality by data-ACKs?
> > >>
> > >> - If data SACK is introduced mostly to make up for TCP's
> shortcomings,
> > >>   we're making its usage (and implementation) drastically less
> > appealing
> > >>   since the only way error recovery algorithms can *rely* on
> data SACK
> > >>   is when replay is supported (or, "ReplaySupport=yes"  in my
> > proposal),
> > >>   which is extremely expensive.  IOW, we're defining data SACK in the
> > >>   draft and not providing any incentives to implement and use it!
> > >>
> > >> - I submit that since iSCSI is being hailed as the ideal SCSI
> Transport
> > >>   protocol in its definition so far (and I believe, rightly so
> > - mandating
> > >>   command ordering, bi-di support, SCSI CRN support to name a few
> > >> examples),
> > >>   the perfectly SCSI-legal R/W interactions that break in
> > other transports
> > >>   *do not* have to break in iSCSI.
> > >>
> > >> - A last idea (may seem radical at this point) in regards to iSCSI
> > >>   being a "full transport". This provides us an opportunity to "cast
> > >>   off" the transport baggage in future when we truly move to a
> > "reliable"
> > >>   transport (perhaps TCP with CRCs/SCTP ?) - if we do a good job of
> > >>   keeping the encapsulation stuff separate from the transport stuff.
> > >>   (Julian, I heard from Randy that ideas similar to this
> were explored
> > >>   in your Haifa meeting.  And yes, he recalls they were
> given up since
> > >>   TCP was supposed to be reliable and granularity of recovery
> > was deemed
> > >>   one I/O.)
> > >>
> > >> With that said, may I request David (with his co-chair hat on, :-))
> > >> to add some binding comments/observations on this discussion?
> > >>
> > >> If we decide to leave data SACKs as unattractive to implement,
> > the draft
> > >> should in the least add a statement like - "Note that satisfying all
> > >> possible data SACK requests for a task with an unacknowledged status
> > >> implies implementing the I/O replay buffer on the part of targets."
> > >> --
> > >> Mallikarjun
> > >>
> > >>
> > >> Mallikarjun Chadalapaka
> > >> Networked Storage Architecture
> > >> Network Storage Solutions Organization
> > >> MS 5668   Hewlett-Packard, Roseville.
> > >> cbm@rose.hp.com
> > >>
> > >>
> > >>
> > >>
> > >> >I think Julian's basically right -- I would point
> > >> >out that any case of write after read that breaks
> > >> >over iSCSI will also break over Fibre Channel.
> > >> >On FC, the scenario starts with a frame CRC failure
> > >> >on read data at the Initiator, so applications
> > >> >have to cope and typically do so by enforcing
> > >> >ordering at the app rather than using SCSI task
> > >> >ordering.
> > >> >
> > >> >While SCSI has clever tools like ACA and task
> > >> >ordering that appear to allow dependent operations
> > >> >to be sent to the target concurrently, in practice
> > >> >they don't work and/or aren't used (funny thing,
> > >> >those two reinforce each other ;-) ).  Hence
> > >> >a minimal approach to them is in order:
> > >> >- Make sure the result will interoperate.
> > >> >- Make sure T10 doesn't ding us for leaving something
> > >> >    completely out.
> > >> >- Don't specify anything not needed for the above.
> > >> >
> > >> >My 0.02,
> > >> >--David
> > >> >
> > >> >> -----Original Message-----
> > >> >> From:  julian_satran@il.ibm.com [SMTP:julian_satran@il.ibm.com]
> > >> >> Sent:  Tuesday, March 27, 2001 9:23 AM
> > >> >> To:    cbm@rose.hp.com
> > >> >> Cc:    someshg@yahoo.com; steph@cs.uchicago.edu;
> hufferd@us.ibm.com;
> > >> >> cbm@rose.hp.com; ldalleore@snapserver.com; Venkat Rangan;
> > >> >> Black_David@emc.com
> > >> >> Subject:    Re: iSCSI ERT: data SACK/replay
> buffer/"semi-transport"
> > >> >>
> > >> >>
> > >> >>
> > >> >> Mallikarjun,
> > >> >>
> > >> >> I commiserate with you at the lack of ack for data but the Orlando
> > >> meeting
> > >> >> stated - no.  Recall that I kept the number only as a mechanism to
> > >> detect
> > >> >> missing packets.
> > >> >>
> > >> >> You can achieve the effect you want by keeping around data
> > for a while
> > >> >> (you
> > >> >> determine how long and then discard).
> > >> >>
> > >> >> If a SACK comes and you can recover - fine. If not you
> > either reaccess
> > >> the
> > >> >> media (if you know how) or reject
> > >> >> and let the initiator retry.
> > >> >>
> > >> >> You should not worry about R/W conflicts as programs bound
> > to have such
> > >> >> conflicts either:
> > >> >>
> > >> >> 1)can live with them or
> > >> >> 2)protect themselves through some locks and rely on
> > >> "operation-end-status"
> > >> >> to keep results deterministic.
> > >> >>
> > >> >> Regards,
> > >> >> Julo
> > >> >>
> > >> >>
> > >> >>
> > >> >> "Mallikarjun C." <cbm@rose.hp.com> on 27/03/2001 03:34:16
> > >> >>
> > >> >> Please respond to cbm@rose.hp.com
> > >> >>
> > >> >> To:   cbm@rose.hp.com, someshg@yahoo.com,
> > steph@cs.uchicago.edu, Julian
> > >> >>       Satran/Haifa/IBM@IBMIL, John Hufferd/San Jose/IBM@IBMUS
> > >> >> cc:   Black_David@emc.com
> > >> >> Subject:  iSCSI ERT: data SACK/replay buffer/"semi-transport"
> > >> >>
> > >> >>
> > >> >>
> > >> >>
> > >> >> Hi Error Recovery Team,
> > >> >>
> > >> >> iSCSI can discard PDUs because of digest errors and request
> > >> >> retransmissions using the iSCSI data SACK.  To deal with such
> > >> >> an eventuality, targets that want to support data SACK have
> > >> >> the following options:
> > >> >>
> > >> >> (A) maintain a complete "replay" buffer for the entire I/O since
> > >> >>   a SACK could come anytime before the status is ack'ed by the
> > >> >>   initiator. [ simple, but extremely expensive in memory
> resources]
> > >> >>
> > >> >> (B) (re-introduce data-ACKs into the draft, and) implement
> > data-ACKs.
> > >> >>   Thus enables keeping only those I/O buffers that haven't
> > been ack'ed
> > >> >>   by the initiator. IOW, become a real full transport! [ everyone
> > >> disliked
> > >> >>   it earlier...]
> > >> >>
> > >> >> (C) re-access the medium for data retransmission requests.
> > Now there
> > >> >>   are 3 sub-cases in this to handle the changed data on the
> > medium in a
> > >> >>   write-after-read scenario.  (SEE NOTE.1 at the bottom on how it
> is
> > >> >> legal.)
> > >> >>      (1) On seeing any write, stall till status is ack'ed
> > for all the
> > >> >>             previous reads (basically drain the pipe).
> [simple, but
> > >> incurs
> > >> >>             an additional roundtrip delay for all writes].
> > >> >>      (2) A variation of the above, keep an eye only on the prior
> > >> >>             overlapping reads. [more BW efficient, but
> > complicated to
> > >> >>             resolve the block dependencies in a stream of
> > >> reads followed
> > >> >>             by writes]
> > >> >>         (3) Document the caveat and leave it upto the applications
> > >> >>             to avoid this case since this leads to data integrity
> > >> issues.
> > >> >>             [pushing to apps since the transport can't get
> > it right!]
> > >> >>
> > >> >> My first preference is (B), followed by (A), and I suggest we not
> go
> > >> >> to (C) at all with its inherent dangers.
> > >> >>
> > >> >> Doing (B) naturally completes the transport job that iSCSI has
> taken
> > >> >> on itself in view of TCP's claimed unreliable checksum.  That is
> the
> > >> >> right thing to do architecturally instead of being a
> > "semi-transport"!
> > >> >>
> > >> >> Comments?
> > >> >> --
> > >> >> Mallikarjun
> > >> >>
> > >> >>
> > >> >> Mallikarjun Chadalapaka
> > >> >> Networked Storage Architecture
> > >> >> Network Storage Solutions Organization
> > >> >> MS 5668   Hewlett-Packard, Roseville.
> > >> >> cbm@rose.hp.com
> > >> >>
> > >> >>
> > >>
> >
> __________________________________________________________________________
> > >> >> Note.1: A Read followed by a Write (to the same blocks) is
> perfectly
> > >> legal
> > >> >>         if SCSI sets the ORDERED task attribute on both the
> > >> commands AND
> > >> >>         sets the NACA bit to one to indicate that Write shall be
> > >> executed
> > >> >>         only if the Read did not fail (result in a Check
> Condition).
> > >> >>
> > >> >>         In the current case, since Read completed just fine
> > from SCSI's
> > >> >>         point of view, SCSI is moving on to execute Write.
> > Those read
> > >> >> buffers
> > >> >>         had been freed up since iSCSI received an ACK at
> > the TCP level,
> > >> >> and
> > >> >>         since iSCSI has no other way to have the data ack'ed!
> > >> >>
> > >> >>
> > >> >>
> > >> >>
> > >> >
> > >>
> > >>
> > >>
> > >>
> > >
> > >
> > >_________________________________________________________
> > >Do You Yahoo!?
> > >Get your free @yahoo.com address at http://mail.yahoo.com
> > >
> > >
> >
>
>
> _________________________________________________________
> Do You Yahoo!?
> Get your free @yahoo.com address at http://mail.yahoo.com
>
>
>
>
>
References:
- RE: iSCSI ERT: data SACK/replay buffer/"semi-transport"
  - From: julian_satran@il.ibm.com
Prev by Date: RE: iSCSI ERT: data SACK/replay buffer/"semi-transport"
Next by Date: iSCSI requirements drafts
Prev by thread: RE: iSCSI ERT: data SACK/replay buffer/"semi-transport"
Next by thread: RE: iSCSI ERT: data SACK/replay buffer/"semi-transport"
Index(es):
- Date
- Thread
Home
Last updated: Tue Sep 04 01:05:11 2001
6315 messages in chronological order