SORT BY:

LIST ORDER
THREAD
AUTHOR
SUBJECT


SEARCH

IPS HOME


    [Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

    RE: iSCSI ERT: data SACK/replay buffer/"semi-transport"



    Julian,
    
    There was an earlier discussion on this very thread with
    posting from Pierre and Bob. If you list all the posting on the
    thread and pick the one from Pierre and Bob. And please
    read the caveat in the last sentence in my posting.
    
    To beat a dead horse again and again ...
    
    The requirements for detecting errors are more stringent,
    even though as Steph makes very valid points in his last
    message.
    
    The requirements for recovery are different as it is better
    to let market forces and maintainance crew take care of the
    bad middle boxes.
    
    Somesh
    
    > -----Original Message-----
    > From: owner-ips@ece.cmu.edu [mailto:owner-ips@ece.cmu.edu]On Behalf Of
    > julian_satran@il.ibm.com
    > Sent: Wednesday, April 04, 2001 2:23 PM
    > To: someshg@yahoo.com
    > Cc: someshg@yahoo.com; ips@ece.cmu.edu
    > Subject: RE: iSCSI ERT: data SACK/replay buffer/"semi-transport"
    >
    >
    >
    >
    > Somesh,
    >
    > Can you give us a reference for those rates?  Where do they come from?
    >
    > Regards,
    > Julo
    >
    > "Somesh Gupta" <someshg@yahoo.com> on 04/04/2001 23:02:06
    >
    > Please respond to someshg@yahoo.com
    >
    > To:   Julian Satran/Haifa/IBM@IBMIL, someshg@yahoo.com
    > cc:   ips@ece.cmu.edu
    > Subject:  RE: iSCSI ERT: data SACK/replay buffer/"semi-transport"
    >
    >
    >
    >
    > Assuming that the packet corruption escape rate is 1 in 10billion,
    > we have (rough assuming 1K byte per packet), 1 escaped packet every
    > 10 Trillion Bytes of data transfer. Seems to me that if I
    > had to transfer 1 MBytes for having to recover at the
    > command level rather than at a more granular level, that does
    > not pose much of an additional burden (1 MB out of 10 Trillion
    > bytes). Also assuming each i/o is 1 MByte in size, you would
    > have to do recovery for every 1 in 10 million transactions.
    >
    > I don't know how realistic the 1 in 10 billion packet corruption
    > escape rate is but I am using the number from past discussions.
    >
    > Somesh
    >
    > > -----Original Message-----
    > > From: julian_satran@il.ibm.com [mailto:julian_satran@il.ibm.com]
    > > Sent: Wednesday, April 04, 2001 11:56 AM
    > > To: someshg@yahoo.com
    > > Cc: ips@ece.cmu.edu
    > > Subject: RE: iSCSI ERT: data SACK/replay buffer/"semi-transport"
    > >
    > >
    > >
    > >
    > > What are the numbers you are looking at:
    > >
    > > 1 per 10 sec, 1/10h or 1 /10y?
    > >
    > > Julo
    > >
    > > "Somesh Gupta" <someshg@yahoo.com> on 04/04/2001 20:15:53
    > >
    > > Please respond to someshg@yahoo.com
    > >
    > > To:   Julian Satran/Haifa/IBM@IBMIL, ips@ece.cmu.edu
    > > cc:
    > > Subject:  RE: iSCSI ERT: data SACK/replay buffer/"semi-transport"
    > >
    > >
    > >
    > >
    > >
    > >
    > > > -----Original Message-----
    > > > From: owner-ips@ece.cmu.edu [mailto:owner-ips@ece.cmu.edu]On Behalf Of
    > > > julian_satran@il.ibm.com
    > > > Sent: Wednesday, April 04, 2001 7:32 AM
    > > > To: ips@ece.cmu.edu
    > > > Subject: Re: iSCSI ERT: data SACK/replay buffer/"semi-transport"
    > > >
    > > >
    > > >
    > > >
    > > > SNACK is here for two reasons - Status retry (which is cheap) and Data
    > > > retry as a side benefit.
    > >
    > >   Unless there is clear benefit (i.e. the event is frequent enough
    > >   to justify recovery at this level), the entire mechanism should be
    > >   dropped - it is neither cheap nor free. If it is relatively
    > >   infrequent, the recovery at the command level should be a sufficient
    > >   mechanism
    > >
    > > > CRC errors are not that rare (although we don't have real data the
    > > > simulation with file systems seem to indicate that numbers could
    > > > be as high
    > > > a 0.0002%). A restart of link - is expensive (slow start) and even if
    > > they
    > > > are far lower for many applications a slow start is a painfull event.
    > >
    > >   Intuitively, it seems that the combination of link level CRC, TCP
    > >   checksum, and good hardware (ECC, parity etc) should lead to a
    > >   much lower level of errors caught by the iSCSI CRC algorithm. We have
    > >   to seperate error detection (i.e. what if I have bad hardware or
    > >   some vendor makes bad/buggy intermediate system) from recovery
    > >   mechanisms (not based on hardware being bad or buggy - market forces
    > >   will wean out the vendor) which should not be based on assumptions
    > >   of bugs in hardware/software of specific implementations.
    > >
    > > >
    > > > Removing them from the spec is not a path we should take lightly.
    > >
    > >   I would phrase it the other way. We should not keep adding things
    > >   unless there is very clear proof that the additional feature is
    > >   beneficial and does not have negative side effects (and there is
    > >   some consensus on adding it)
    > > >
    > > > Julo
    > > >
    > > > "Jon Hall" <jhall@emc.com> on 02/04/2001 16:13:35
    > > >
    > > > Please respond to "Jon Hall" <jhall@emc.com>
    > > >
    > > > To:   ips@ece.cmu.edu
    > > > cc:
    > > > Subject:  Re: iSCSI ERT: data SACK/replay buffer/"semi-transport"
    > > >
    > > >
    > > >
    > > >
    > > >
    > > > I agree with Somesh.  And would go farther -- the complexity
    > > > that results from retaining enough target-side state to respond
    > > > to a SACK/SNACK request is non-trivial and needs clear justification.
    > > > Intuitively, a CRC that discovers an error in an iSCSI pdu header
    > > > (that the TCP cksum missed) seems like it should be a rare event.
    > > >
    > > > What is the frequency of this event?  IMO the answer to this
    > > > question should be written into the protocol spec -- assuming
    > > > that it substantiates the benefit of SACK/SNACK.  Otherwise, the
    > > > SACK/SNACK pdu should be removed.
    > > >
    > > > -Jon
    > > >
    > > > julian_satran@il.ibm.com writes:
    > > > >
    > > > >Somesh,
    > > > >
    > > > >As I stated earlier - the DataSN was created to detect missing data
    > > PDUs.
    > > > >SNACK is needed to recover missing StatusSN and missing dataSN
    > > is only a
    > > > >bonus if the target wants to support it.  It is a trivial mechanism
    > and
    > > I
    > > > >think it should stay.
    > > > >
    > > > >Julo
    > > > >
    > > > >"Somesh Gupta" <someshg@yahoo.com> on 31/03/2001 02:25:52
    > > > >
    > > > >Please respond to someshg@yahoo.com
    > > > >
    > > > >To:   Julian Satran/Haifa/IBM@IBMIL, ips@ece.cmu.edu
    > > > >cc:
    > > > >Subject:  RE: iSCSI ERT: data SACK/replay buffer/"semi-transport"
    > > > >
    > > > >
    > > > >
    > > > >
    > > > >Sorry to have been missing for a while. Hope you will
    > > > >appreciate my being back in action :-). It was a fairly
    > > > >clear consensus in Orlando that applications broke up
    > > > >their transfers into reasonably small chunks i.e. they
    > > > >did not have very long running transfers.
    > > > >
    > > > >Therefore the consensus was that a command level recovery
    > > > >mechanism was sufficient instead of an ack/sack for each
    > > > >data PDU.
    > > > >
    > > > >The SACK mechanism was a post Orlando invention. Without
    > > > >an ack mechanism (for every data PDU), the SACK mechanism
    > > > >just imposes additional burden on either end of the session,
    > > > >without really much benefit.
    > > > >
    > > > >The benefit of having SACK is of saving bandwidth in case
    > > > >the data part of the data PDU failed an integrity check
    > > > >(but passed TCP checksum). This is a rare enough case that
    > > > >as a percentage, the bandwidth loss from retransmitting
    > > > >all the data associated with a read or write command is
    > > > >very very small.
    > > > >
    > > > >In addition, it avoids the complexity of restarting
    > > > >something from the middle, as compared to from the begining.
    > > > >
    > > > >To me it seems that there is significant simplicity (from
    > > > >implementation, reliability and recovery process) from
    > > > >having smaller data transfer per command.
    > > > >
    > > > >I would really like to get rid of the SACK command.
    > > > >
    > > > >Somesh
    > > > >
    > > > >> -----Original Message-----
    > > > >> From: owner-ips@ece.cmu.edu [mailto:owner-ips@ece.cmu.edu]On
    > > Behalf Of
    > > > >> julian_satran@il.ibm.com
    > > > >> Sent: Wednesday, March 28, 2001 6:57 AM
    > > > >> To: ips@ece.cmu.edu
    > > > >> Subject: RE: iSCSI ERT: data SACK/replay buffer/"semi-transport"
    > > > >>
    > > > >>
    > > > >>
    > > > >>
    > > > >> Mallikarjun,
    > > > >>
    > > > >> Last summer I thought that recovery within a connection
    > > should be left
    > > > to
    > > > >> TCP. It is simple and could be made available through IPsec
    > > (if no new
    > > > >> option of any form can be added).
    > > > >>
    > > > >> Two things killed this:
    > > > >>
    > > > >>    The requirement to have a data encapsulation that can pass
    > through
    > > > >>    application proxies (like a storage router)
    > > > >>    The "NO WAY" message we got from IESG-Security on a CRC only
    > IPSec
    > > > >>    header
    > > > >>
    > > > >>
    > > > >> As for the ACK - I am very much in favor of it (it is a no brainer)
    > > and
    > > > >> implementations are in fact allowed to drop even unacked data.
    > > > >>
    > > > >> I am bound by the Orlando meeting decision to drop it. Except the
    > > > regular
    > > > >> "oppose everything" crowd the two vocal opponents where Somesh
    > > > Gupta and
    > > > >> Matt Wakeley.
    > > > >>
    > > > >> David may want or not to re-open the issue - I am not going
    > > to ask for
    > > > >it.
    > > > >>
    > > > >> Regards,
    > > > >> Julo
    > > > >>
    > > > >> "Mallikarjun C." <cbm@rose.hp.com> on 28/03/2001 00:45:02
    > > > >>
    > > > >> Please respond to cbm@rose.hp.com
    > > > >>
    > > > >> To:   Black_David@emc.com
    > > > >> cc:   Julian Satran/Haifa/IBM@IBMIL, cbm@rose.hp.com,
    > > > someshg@yahoo.com,
    > > > >>       steph@cs.uchicago.edu, John Hufferd/San Jose/IBM@IBMUS,
    > > > >>       ldalleore@snapserver.com, venkat@rhapsodynetworks.com
    > > > >> Subject:  RE: iSCSI ERT: data SACK/replay buffer/"semi-transport"
    > > > >>
    > > > >>
    > > > >>
    > > > >>
    > > > >> David and Julian,
    > > > >>
    > > > >> I appreciate both your views, and should I say that they're
    > > > >> along predicted lines :-)
    > > > >>
    > > > >> - David's right in saying that the situation is akin to FC's.
    > > > >>   However, I would like to point out that FC is an unreliable
    > > > >>   transport, and hence is forced to pick up a lot of the transport
    > > > >>   baggage (at least in FCP-2, as I understand), in addition
    > > > >>   to being a SCSI encapsulation layer.  Unfortunately, even with
    > > > >>   TCP being the "reliable" transport, iSCSI is going along the
    > > > >>   same lines - ie. transport baggage + SCSI encapsulation.  My
    > > > >>   point is - if this is indeed a necessary evil, why don't we
    > > > >>   complete iSCSI's transport functionality by data-ACKs?
    > > > >>
    > > > >> - If data SACK is introduced mostly to make up for TCP's
    > > shortcomings,
    > > > >>   we're making its usage (and implementation) drastically less
    > > > appealing
    > > > >>   since the only way error recovery algorithms can *rely* on
    > > data SACK
    > > > >>   is when replay is supported (or, "ReplaySupport=yes"  in my
    > > > proposal),
    > > > >>   which is extremely expensive.  IOW, we're defining data SACK in
    > the
    > > > >>   draft and not providing any incentives to implement and use it!
    > > > >>
    > > > >> - I submit that since iSCSI is being hailed as the ideal SCSI
    > > Transport
    > > > >>   protocol in its definition so far (and I believe, rightly so -
    > > > >mandating
    > > > >>   command ordering, bi-di support, SCSI CRN support to name a few
    > > > >> examples),
    > > > >>   the perfectly SCSI-legal R/W interactions that break in other
    > > > >transports
    > > > >>   *do not* have to break in iSCSI.
    > > > >>
    > > > >> - A last idea (may seem radical at this point) in regards to iSCSI
    > > > >>   being a "full transport". This provides us an
    > opportunity to "cast
    > > > >>   off" the transport baggage in future when we truly move to a
    > > > "reliable"
    > > > >>   transport (perhaps TCP with CRCs/SCTP ?) - if we do a good job of
    > > > >>   keeping the encapsulation stuff separate from the
    > transport stuff.
    > > > >>   (Julian, I heard from Randy that ideas similar to this
    > > were explored
    > > > >>   in your Haifa meeting.  And yes, he recalls they were
    > > given up since
    > > > >>   TCP was supposed to be reliable and granularity of recovery
    > > > was deemed
    > > > >>   one I/O.)
    > > > >>
    > > > >> With that said, may I request David (with his co-chair hat on, :-))
    > > > >> to add some binding comments/observations on this discussion?
    > > > >>
    > > > >> If we decide to leave data SACKs as unattractive to implement,
    > > > the draft
    > > > >> should in the least add a statement like - "Note that
    > satisfying all
    > > > >> possible data SACK requests for a task with an
    > unacknowledged status
    > > > >> implies implementing the I/O replay buffer on the part of targets."
    > > > >> --
    > > > >> Mallikarjun
    > > > >>
    > > > >>
    > > > >> Mallikarjun Chadalapaka
    > > > >> Networked Storage Architecture
    > > > >> Network Storage Solutions Organization
    > > > >> MS 5668   Hewlett-Packard, Roseville.
    > > > >> cbm@rose.hp.com
    > > > >>
    > > > >>
    > > > >>
    > > > >>
    > > > >> >I think Julian's basically right -- I would point
    > > > >> >out that any case of write after read that breaks
    > > > >> >over iSCSI will also break over Fibre Channel.
    > > > >> >On FC, the scenario starts with a frame CRC failure
    > > > >> >on read data at the Initiator, so applications
    > > > >> >have to cope and typically do so by enforcing
    > > > >> >ordering at the app rather than using SCSI task
    > > > >> >ordering.
    > > > >> >
    > > > >> >While SCSI has clever tools like ACA and task
    > > > >> >ordering that appear to allow dependent operations
    > > > >> >to be sent to the target concurrently, in practice
    > > > >> >they don't work and/or aren't used (funny thing,
    > > > >> >those two reinforce each other ;-) ).  Hence
    > > > >> >a minimal approach to them is in order:
    > > > >> >- Make sure the result will interoperate.
    > > > >> >- Make sure T10 doesn't ding us for leaving something
    > > > >> >    completely out.
    > > > >> >- Don't specify anything not needed for the above.
    > > > >> >
    > > > >> >My 0.02,
    > > > >> >--David
    > > > >> >
    > > > >> >> -----Original Message-----
    > > > >> >> From:  julian_satran@il.ibm.com [SMTP:julian_satran@il.ibm.com]
    > > > >> >> Sent:  Tuesday, March 27, 2001 9:23 AM
    > > > >> >> To:    cbm@rose.hp.com
    > > > >> >> Cc:    someshg@yahoo.com; steph@cs.uchicago.edu;
    > > hufferd@us.ibm.com;
    > > > >> >> cbm@rose.hp.com; ldalleore@snapserver.com; Venkat Rangan;
    > > > >> >> Black_David@emc.com
    > > > >> >> Subject:    Re: iSCSI ERT: data SACK/replay
    > > buffer/"semi-transport"
    > > > >> >>
    > > > >> >>
    > > > >> >>
    > > > >> >> Mallikarjun,
    > > > >> >>
    > > > >> >> I commiserate with you at the lack of ack for data but the
    > Orlando
    > > > >> meeting
    > > > >> >> stated - no.  Recall that I kept the number only as a mechanism
    > to
    > > > >> detect
    > > > >> >> missing packets.
    > > > >> >>
    > > > >> >> You can achieve the effect you want by keeping around data for a
    > > > while
    > > > >> >> (you
    > > > >> >> determine how long and then discard).
    > > > >> >>
    > > > >> >> If a SACK comes and you can recover - fine. If not you either
    > > > reaccess
    > > > >> the
    > > > >> >> media (if you know how) or reject
    > > > >> >> and let the initiator retry.
    > > > >> >>
    > > > >> >> You should not worry about R/W conflicts as programs bound to
    > have
    > > > >such
    > > > >> >> conflicts either:
    > > > >> >>
    > > > >> >> 1)can live with them or
    > > > >> >> 2)protect themselves through some locks and rely on
    > > > >> "operation-end-status"
    > > > >> >> to keep results deterministic.
    > > > >> >>
    > > > >> >> Regards,
    > > > >> >> Julo
    > > > >> >>
    > > > >> >>
    > > > >> >>
    > > > >> >> "Mallikarjun C." <cbm@rose.hp.com> on 27/03/2001 03:34:16
    > > > >> >>
    > > > >> >> Please respond to cbm@rose.hp.com
    > > > >> >>
    > > > >> >> To:   cbm@rose.hp.com, someshg@yahoo.com, steph@cs.uchicago.edu,
    > > > >Julian
    > > > >> >>       Satran/Haifa/IBM@IBMIL, John Hufferd/San Jose/IBM@IBMUS
    > > > >> >> cc:   Black_David@emc.com
    > > > >> >> Subject:  iSCSI ERT: data SACK/replay buffer/"semi-transport"
    > > > >> >>
    > > > >> >>
    > > > >> >>
    > > > >> >>
    > > > >> >> Hi Error Recovery Team,
    > > > >> >>
    > > > >> >> iSCSI can discard PDUs because of digest errors and request
    > > > >> >> retransmissions using the iSCSI data SACK.  To deal with such
    > > > >> >> an eventuality, targets that want to support data SACK have
    > > > >> >> the following options:
    > > > >> >>
    > > > >> >> (A) maintain a complete "replay" buffer for the entire I/O since
    > > > >> >>   a SACK could come anytime before the status is ack'ed by the
    > > > >> >>   initiator. [ simple, but extremely expensive in memory
    > > resources]
    > > > >> >>
    > > > >> >> (B) (re-introduce data-ACKs into the draft, and) implement
    > > > data-ACKs.
    > > > >> >>   Thus enables keeping only those I/O buffers that haven't been
    > > > ack'ed
    > > > >> >>   by the initiator. IOW, become a real full transport!
    > [ everyone
    > > > >> disliked
    > > > >> >>   it earlier...]
    > > > >> >>
    > > > >> >> (C) re-access the medium for data retransmission requests.
    > > > Now there
    > > > >> >>   are 3 sub-cases in this to handle the changed data on the
    > > > medium in
    > > > >a
    > > > >> >>   write-after-read scenario.  (SEE NOTE.1 at the bottom
    > on how it
    > > is
    > > > >> >> legal.)
    > > > >> >>      (1) On seeing any write, stall till status is ack'ed
    > > > for all the
    > > > >> >>             previous reads (basically drain the pipe).
    > > [simple, but
    > > > >> incurs
    > > > >> >>             an additional roundtrip delay for all writes].
    > > > >> >>      (2) A variation of the above, keep an eye only on the prior
    > > > >> >>             overlapping reads. [more BW efficient, but
    > > > complicated to
    > > > >> >>             resolve the block dependencies in a stream of
    > > > >> reads followed
    > > > >> >>             by writes]
    > > > >> >>         (3) Document the caveat and leave it upto the
    > applications
    > > > >> >>             to avoid this case since this leads to data
    > integrity
    > > > >> issues.
    > > > >> >>             [pushing to apps since the transport can't get
    > > > it right!]
    > > > >> >>
    > > > >> >> My first preference is (B), followed by (A), and I
    > suggest we not
    > > go
    > > > >> >> to (C) at all with its inherent dangers.
    > > > >> >>
    > > > >> >> Doing (B) naturally completes the transport job that iSCSI has
    > > taken
    > > > >> >> on itself in view of TCP's claimed unreliable checksum.  That is
    > > the
    > > > >> >> right thing to do architecturally instead of being a
    > > > "semi-transport"!
    > > > >> >>
    > > > >> >> Comments?
    > > > >> >> --
    > > > >> >> Mallikarjun
    > > > >> >>
    > > > >> >>
    > > > >> >> Mallikarjun Chadalapaka
    > > > >> >> Networked Storage Architecture
    > > > >> >> Network Storage Solutions Organization
    > > > >> >> MS 5668   Hewlett-Packard, Roseville.
    > > > >> >> cbm@rose.hp.com
    > > > >> >>
    > > > >> >>
    > > > >>
    > > > >_________________________________________________________________
    > > > _________
    > > > >> >> Note.1: A Read followed by a Write (to the same blocks) is
    > > perfectly
    > > > >> legal
    > > > >> >>         if SCSI sets the ORDERED task attribute on both the
    > > > >> commands AND
    > > > >> >>         sets the NACA bit to one to indicate that Write shall be
    > > > >> executed
    > > > >> >>         only if the Read did not fail (result in a Check
    > > Condition).
    > > > >> >>
    > > > >> >>         In the current case, since Read completed just fine from
    > > > >SCSI's
    > > > >> >>         point of view, SCSI is moving on to execute
    > Write.  Those
    > > > read
    > > > >> >> buffers
    > > > >> >>         had been freed up since iSCSI received an ACK at the TCP
    > > > >level,
    > > > >> >> and
    > > > >> >>         since iSCSI has no other way to have the data ack'ed!
    > > >
    > > >
    > >
    > >
    > > _________________________________________________________
    > > Do You Yahoo!?
    > > Get your free @yahoo.com address at http://mail.yahoo.com
    > >
    > >
    > >
    >
    >
    > _________________________________________________________
    > Do You Yahoo!?
    > Get your free @yahoo.com address at http://mail.yahoo.com
    >
    >
    >
    
    
    _________________________________________________________
    Do You Yahoo!?
    Get your free @yahoo.com address at http://mail.yahoo.com
    
    


Home

Last updated: Tue Sep 04 01:05:10 2001
6315 messages in chronological order