SORT BY:

LIST ORDER
THREAD
AUTHOR
SUBJECT


SEARCH

IPS HOME


    [Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

    RE: iSCSI ERT: data SACK/replay buffer/"semi-transport"



    Julian,
    
    Which part of my note were you raising a concern about?
    
    Somesh
    
    > -----Original Message-----
    > From: owner-ips@ece.cmu.edu [mailto:owner-ips@ece.cmu.edu]On Behalf Of
    > julian_satran@il.ibm.com
    > Sent: Monday, April 02, 2001 6:25 PM
    > To: ips@ece.cmu.edu
    > Subject: RE: iSCSI ERT: data SACK/replay buffer/"semi-transport"
    >
    >
    >
    >
    > Somesh,
    >
    > That will certainly result in poor performance for important applications
    > even with hardware implementations of iSCSI - mainly due to the large SCSI
    > command traffic and associated interrupts.
    >
    > Julo
    >
    > "Somesh Gupta" <someshg@yahoo.com> on 02/04/2001 22:23:25
    >
    > Please respond to someshg@yahoo.com
    >
    > To:   cbm@rose.hp.com, ips@ece.cmu.edu
    > cc:
    > Subject:  RE: iSCSI ERT: data SACK/replay buffer/"semi-transport"
    >
    >
    >
    >
    > To beat a dead horse ..
    >
    > One has to really decide fundamentally whether
    >
    > 1. Commands are used to transfer very large amounts of
    >    data (multiple data PDUs are needed)
    > 2. Commands are used to transfer relatively small amounts
    >    of data (few/about one data PDU) and multiple commands
    >    are then used to do long transfers
    >
    > (Orlando consensus was #2)
    >
    > If we assume the first model, then we really should have
    > a sequence # and acknowledgement of every PDU - not just
    > data PDUs. In this case, it is important to fill holes
    > in the iSCSI stream. We can have a "super-transport" as
    > Mallikarjun suggested between the iSCSI protocol layer
    > and the TCP layer that provides the various "transport"
    > like features we seem to want.
    >
    > If we assume the second model, we assume that recovery at
    > the command level is sufficient. In this case it is important
    > to have whatever mechanisms are (including data seq #s) needed
    > to detect that a command will not succeed without recovery
    > at the command level. However, recovery is needed only
    > at the command level.
    >
    > I would let the current application model decide the features
    > in "version 1" of the iSCSI protocol.
    >
    > Somesh
    >
    > > -----Original Message-----
    > > From: owner-ips@ece.cmu.edu [mailto:owner-ips@ece.cmu.edu]On Behalf Of
    > > Mallikarjun C.
    > > Sent: Monday, April 02, 2001 10:34 AM
    > > To: ips@ece.cmu.edu
    > > Subject: RE: iSCSI ERT: data SACK/replay buffer/"semi-transport"
    > >
    > >
    > > >Sorry to have been missing for a while. Hope you will
    > > >appreciate my being back in action :-). It was a fairly
    > > >clear consensus in Orlando that applications broke up
    > > >their transfers into reasonably small chunks i.e. they
    > > >did not have very long running transfers.
    > > >
    > > >Therefore the consensus was that a command level recovery
    > > >mechanism was sufficient instead of an ack/sack for each
    > > >data PDU.
    > > >
    > > >The SACK mechanism was a post Orlando invention. Without
    > > >an ack mechanism (for every data PDU), the SACK mechanism
    > > >just imposes additional burden on either end of the session,
    > > >without really much benefit.
    > >
    > > To be fair to data SACK, one could think of an upper bound
    > > on the unack'ed data - agreed on at the login time.  While not
    > > requiring acks on every PDU, it gives targets the deterministic
    > > maximum on the buffer size they have to keep around if they
    > > choose to "reliably" support data SACK.  The current answer of
    > > "replay buffer size/IO size", IMHO, is simply not attractive.
    > > Also to be fair to data SACK, I believe FCP-2 allows sequence-level
    > > error recovery in an I/O.
    > >
    > > However, I think that it's extremely useful to include a discussion
    > > in the draft of  the TCP checksum "escape" statistics and the
    > > device types for which this was considered an absolute requirement
    > > to make forward progress at this error rates (like huge tape
    > > backups?) - essentially the reasons that convinced Julian to define
    > > this mechanism in. That gives credibility and acceptance to this,
    > > or alternately may lead to the consensus that data SACK is not required.
    > > --
    > > Mallikarjun
    > >
    > >
    > > Mallikarjun Chadalapaka
    > > Networked Storage Architecture
    > > Network Storage Solutions Organization
    > > MS 5668 Hewlett-Packard, Roseville.
    > > cbm@rose.hp.com
    > >
    > > >
    > > >The benefit of having SACK is of saving bandwidth in case
    > > >the data part of the data PDU failed an integrity check
    > > >(but passed TCP checksum). This is a rare enough case that
    > > >as a percentage, the bandwidth loss from retransmitting
    > > >all the data associated with a read or write command is
    > > >very very small.
    > > >
    > > >In addition, it avoids the complexity of restarting
    > > >something from the middle, as compared to from the begining.
    > > >
    > > >To me it seems that there is significant simplicity (from
    > > >implementation, reliability and recovery process) from
    > > >having smaller data transfer per command.
    > > >
    > > >I would really like to get rid of the SACK command.
    > > >
    > > >Somesh
    > > >
    > > >> -----Original Message-----
    > > >> From: owner-ips@ece.cmu.edu [mailto:owner-ips@ece.cmu.edu]On
    > Behalf Of
    > > >> julian_satran@il.ibm.com
    > > >> Sent: Wednesday, March 28, 2001 6:57 AM
    > > >> To: ips@ece.cmu.edu
    > > >> Subject: RE: iSCSI ERT: data SACK/replay buffer/"semi-transport"
    > > >>
    > > >>
    > > >>
    > > >>
    > > >> Mallikarjun,
    > > >>
    > > >> Last summer I thought that recovery within a connection should
    > > be left to
    > > >> TCP. It is simple and could be made available through IPsec
    > (if no new
    > > >> option of any form can be added).
    > > >>
    > > >> Two things killed this:
    > > >>
    > > >>    The requirement to have a data encapsulation that can pass through
    > > >>    application proxies (like a storage router)
    > > >>    The "NO WAY" message we got from IESG-Security on a CRC only IPSec
    > > >>    header
    > > >>
    > > >>
    > > >> As for the ACK - I am very much in favor of it (it is a no brainer)
    > and
    > > >> implementations are in fact allowed to drop even unacked data.
    > > >>
    > > >> I am bound by the Orlando meeting decision to drop it. Except
    > > the regular
    > > >> "oppose everything" crowd the two vocal opponents where Somesh
    > > Gupta and
    > > >> Matt Wakeley.
    > > >>
    > > >> David may want or not to re-open the issue - I am not going to
    > > ask for it.
    > > >>
    > > >> Regards,
    > > >> Julo
    > > >>
    > > >> "Mallikarjun C." <cbm@rose.hp.com> on 28/03/2001 00:45:02
    > > >>
    > > >> Please respond to cbm@rose.hp.com
    > > >>
    > > >> To:   Black_David@emc.com
    > > >> cc:   Julian Satran/Haifa/IBM@IBMIL, cbm@rose.hp.com,
    > > someshg@yahoo.com,
    > > >>       steph@cs.uchicago.edu, John Hufferd/San Jose/IBM@IBMUS,
    > > >>       ldalleore@snapserver.com, venkat@rhapsodynetworks.com
    > > >> Subject:  RE: iSCSI ERT: data SACK/replay buffer/"semi-transport"
    > > >>
    > > >>
    > > >>
    > > >>
    > > >> David and Julian,
    > > >>
    > > >> I appreciate both your views, and should I say that they're
    > > >> along predicted lines :-)
    > > >>
    > > >> - David's right in saying that the situation is akin to FC's.
    > > >>   However, I would like to point out that FC is an unreliable
    > > >>   transport, and hence is forced to pick up a lot of the transport
    > > >>   baggage (at least in FCP-2, as I understand), in addition
    > > >>   to being a SCSI encapsulation layer.  Unfortunately, even with
    > > >>   TCP being the "reliable" transport, iSCSI is going along the
    > > >>   same lines - ie. transport baggage + SCSI encapsulation.  My
    > > >>   point is - if this is indeed a necessary evil, why don't we
    > > >>   complete iSCSI's transport functionality by data-ACKs?
    > > >>
    > > >> - If data SACK is introduced mostly to make up for TCP's
    > shortcomings,
    > > >>   we're making its usage (and implementation) drastically less
    > > appealing
    > > >>   since the only way error recovery algorithms can *rely* on
    > data SACK
    > > >>   is when replay is supported (or, "ReplaySupport=yes"  in my
    > > proposal),
    > > >>   which is extremely expensive.  IOW, we're defining data SACK in the
    > > >>   draft and not providing any incentives to implement and use it!
    > > >>
    > > >> - I submit that since iSCSI is being hailed as the ideal SCSI
    > Transport
    > > >>   protocol in its definition so far (and I believe, rightly so
    > > - mandating
    > > >>   command ordering, bi-di support, SCSI CRN support to name a few
    > > >> examples),
    > > >>   the perfectly SCSI-legal R/W interactions that break in
    > > other transports
    > > >>   *do not* have to break in iSCSI.
    > > >>
    > > >> - A last idea (may seem radical at this point) in regards to iSCSI
    > > >>   being a "full transport". This provides us an opportunity to "cast
    > > >>   off" the transport baggage in future when we truly move to a
    > > "reliable"
    > > >>   transport (perhaps TCP with CRCs/SCTP ?) - if we do a good job of
    > > >>   keeping the encapsulation stuff separate from the transport stuff.
    > > >>   (Julian, I heard from Randy that ideas similar to this
    > were explored
    > > >>   in your Haifa meeting.  And yes, he recalls they were
    > given up since
    > > >>   TCP was supposed to be reliable and granularity of recovery
    > > was deemed
    > > >>   one I/O.)
    > > >>
    > > >> With that said, may I request David (with his co-chair hat on, :-))
    > > >> to add some binding comments/observations on this discussion?
    > > >>
    > > >> If we decide to leave data SACKs as unattractive to implement,
    > > the draft
    > > >> should in the least add a statement like - "Note that satisfying all
    > > >> possible data SACK requests for a task with an unacknowledged status
    > > >> implies implementing the I/O replay buffer on the part of targets."
    > > >> --
    > > >> Mallikarjun
    > > >>
    > > >>
    > > >> Mallikarjun Chadalapaka
    > > >> Networked Storage Architecture
    > > >> Network Storage Solutions Organization
    > > >> MS 5668   Hewlett-Packard, Roseville.
    > > >> cbm@rose.hp.com
    > > >>
    > > >>
    > > >>
    > > >>
    > > >> >I think Julian's basically right -- I would point
    > > >> >out that any case of write after read that breaks
    > > >> >over iSCSI will also break over Fibre Channel.
    > > >> >On FC, the scenario starts with a frame CRC failure
    > > >> >on read data at the Initiator, so applications
    > > >> >have to cope and typically do so by enforcing
    > > >> >ordering at the app rather than using SCSI task
    > > >> >ordering.
    > > >> >
    > > >> >While SCSI has clever tools like ACA and task
    > > >> >ordering that appear to allow dependent operations
    > > >> >to be sent to the target concurrently, in practice
    > > >> >they don't work and/or aren't used (funny thing,
    > > >> >those two reinforce each other ;-) ).  Hence
    > > >> >a minimal approach to them is in order:
    > > >> >- Make sure the result will interoperate.
    > > >> >- Make sure T10 doesn't ding us for leaving something
    > > >> >    completely out.
    > > >> >- Don't specify anything not needed for the above.
    > > >> >
    > > >> >My 0.02,
    > > >> >--David
    > > >> >
    > > >> >> -----Original Message-----
    > > >> >> From:  julian_satran@il.ibm.com [SMTP:julian_satran@il.ibm.com]
    > > >> >> Sent:  Tuesday, March 27, 2001 9:23 AM
    > > >> >> To:    cbm@rose.hp.com
    > > >> >> Cc:    someshg@yahoo.com; steph@cs.uchicago.edu;
    > hufferd@us.ibm.com;
    > > >> >> cbm@rose.hp.com; ldalleore@snapserver.com; Venkat Rangan;
    > > >> >> Black_David@emc.com
    > > >> >> Subject:    Re: iSCSI ERT: data SACK/replay
    > buffer/"semi-transport"
    > > >> >>
    > > >> >>
    > > >> >>
    > > >> >> Mallikarjun,
    > > >> >>
    > > >> >> I commiserate with you at the lack of ack for data but the Orlando
    > > >> meeting
    > > >> >> stated - no.  Recall that I kept the number only as a mechanism to
    > > >> detect
    > > >> >> missing packets.
    > > >> >>
    > > >> >> You can achieve the effect you want by keeping around data
    > > for a while
    > > >> >> (you
    > > >> >> determine how long and then discard).
    > > >> >>
    > > >> >> If a SACK comes and you can recover - fine. If not you
    > > either reaccess
    > > >> the
    > > >> >> media (if you know how) or reject
    > > >> >> and let the initiator retry.
    > > >> >>
    > > >> >> You should not worry about R/W conflicts as programs bound
    > > to have such
    > > >> >> conflicts either:
    > > >> >>
    > > >> >> 1)can live with them or
    > > >> >> 2)protect themselves through some locks and rely on
    > > >> "operation-end-status"
    > > >> >> to keep results deterministic.
    > > >> >>
    > > >> >> Regards,
    > > >> >> Julo
    > > >> >>
    > > >> >>
    > > >> >>
    > > >> >> "Mallikarjun C." <cbm@rose.hp.com> on 27/03/2001 03:34:16
    > > >> >>
    > > >> >> Please respond to cbm@rose.hp.com
    > > >> >>
    > > >> >> To:   cbm@rose.hp.com, someshg@yahoo.com,
    > > steph@cs.uchicago.edu, Julian
    > > >> >>       Satran/Haifa/IBM@IBMIL, John Hufferd/San Jose/IBM@IBMUS
    > > >> >> cc:   Black_David@emc.com
    > > >> >> Subject:  iSCSI ERT: data SACK/replay buffer/"semi-transport"
    > > >> >>
    > > >> >>
    > > >> >>
    > > >> >>
    > > >> >> Hi Error Recovery Team,
    > > >> >>
    > > >> >> iSCSI can discard PDUs because of digest errors and request
    > > >> >> retransmissions using the iSCSI data SACK.  To deal with such
    > > >> >> an eventuality, targets that want to support data SACK have
    > > >> >> the following options:
    > > >> >>
    > > >> >> (A) maintain a complete "replay" buffer for the entire I/O since
    > > >> >>   a SACK could come anytime before the status is ack'ed by the
    > > >> >>   initiator. [ simple, but extremely expensive in memory
    > resources]
    > > >> >>
    > > >> >> (B) (re-introduce data-ACKs into the draft, and) implement
    > > data-ACKs.
    > > >> >>   Thus enables keeping only those I/O buffers that haven't
    > > been ack'ed
    > > >> >>   by the initiator. IOW, become a real full transport! [ everyone
    > > >> disliked
    > > >> >>   it earlier...]
    > > >> >>
    > > >> >> (C) re-access the medium for data retransmission requests.
    > > Now there
    > > >> >>   are 3 sub-cases in this to handle the changed data on the
    > > medium in a
    > > >> >>   write-after-read scenario.  (SEE NOTE.1 at the bottom on how it
    > is
    > > >> >> legal.)
    > > >> >>      (1) On seeing any write, stall till status is ack'ed
    > > for all the
    > > >> >>             previous reads (basically drain the pipe).
    > [simple, but
    > > >> incurs
    > > >> >>             an additional roundtrip delay for all writes].
    > > >> >>      (2) A variation of the above, keep an eye only on the prior
    > > >> >>             overlapping reads. [more BW efficient, but
    > > complicated to
    > > >> >>             resolve the block dependencies in a stream of
    > > >> reads followed
    > > >> >>             by writes]
    > > >> >>         (3) Document the caveat and leave it upto the applications
    > > >> >>             to avoid this case since this leads to data integrity
    > > >> issues.
    > > >> >>             [pushing to apps since the transport can't get
    > > it right!]
    > > >> >>
    > > >> >> My first preference is (B), followed by (A), and I suggest we not
    > go
    > > >> >> to (C) at all with its inherent dangers.
    > > >> >>
    > > >> >> Doing (B) naturally completes the transport job that iSCSI has
    > taken
    > > >> >> on itself in view of TCP's claimed unreliable checksum.  That is
    > the
    > > >> >> right thing to do architecturally instead of being a
    > > "semi-transport"!
    > > >> >>
    > > >> >> Comments?
    > > >> >> --
    > > >> >> Mallikarjun
    > > >> >>
    > > >> >>
    > > >> >> Mallikarjun Chadalapaka
    > > >> >> Networked Storage Architecture
    > > >> >> Network Storage Solutions Organization
    > > >> >> MS 5668   Hewlett-Packard, Roseville.
    > > >> >> cbm@rose.hp.com
    > > >> >>
    > > >> >>
    > > >>
    > >
    > __________________________________________________________________________
    > > >> >> Note.1: A Read followed by a Write (to the same blocks) is
    > perfectly
    > > >> legal
    > > >> >>         if SCSI sets the ORDERED task attribute on both the
    > > >> commands AND
    > > >> >>         sets the NACA bit to one to indicate that Write shall be
    > > >> executed
    > > >> >>         only if the Read did not fail (result in a Check
    > Condition).
    > > >> >>
    > > >> >>         In the current case, since Read completed just fine
    > > from SCSI's
    > > >> >>         point of view, SCSI is moving on to execute Write.
    > > Those read
    > > >> >> buffers
    > > >> >>         had been freed up since iSCSI received an ACK at
    > > the TCP level,
    > > >> >> and
    > > >> >>         since iSCSI has no other way to have the data ack'ed!
    > > >> >>
    > > >> >>
    > > >> >>
    > > >> >>
    > > >> >
    > > >>
    > > >>
    > > >>
    > > >>
    > > >
    > > >
    > > >_________________________________________________________
    > > >Do You Yahoo!?
    > > >Get your free @yahoo.com address at http://mail.yahoo.com
    > > >
    > > >
    > >
    >
    >
    > _________________________________________________________
    > Do You Yahoo!?
    > Get your free @yahoo.com address at http://mail.yahoo.com
    >
    >
    >
    
    
    _________________________________________________________
    Do You Yahoo!?
    Get your free @yahoo.com address at http://mail.yahoo.com
    
    


Home

Last updated: Tue Sep 04 01:05:11 2001
6315 messages in chronological order