SORT BY:

LIST ORDER
THREAD
AUTHOR
SUBJECT


SEARCH

IPS HOME


    [Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

    RE: iSCSI ERT: data SACK/replay buffer/"semi-transport"



    Assuming that the packet corruption escape rate is 1 in 10billion,
    we have (rough assuming 1K byte per packet), 1 escaped packet every
    10 Trillion Bytes of data transfer. Seems to me that if I
    had to transfer 1 MBytes for having to recover at the
    command level rather than at a more granular level, that does
    not pose much of an additional burden (1 MB out of 10 Trillion
    bytes). Also assuming each i/o is 1 MByte in size, you would
    have to do recovery for every 1 in 10 million transactions.
    
    I don't know how realistic the 1 in 10 billion packet corruption
    escape rate is but I am using the number from past discussions.
    
    Somesh
    
    > -----Original Message-----
    > From: julian_satran@il.ibm.com [mailto:julian_satran@il.ibm.com]
    > Sent: Wednesday, April 04, 2001 11:56 AM
    > To: someshg@yahoo.com
    > Cc: ips@ece.cmu.edu
    > Subject: RE: iSCSI ERT: data SACK/replay buffer/"semi-transport"
    >
    >
    >
    >
    > What are the numbers you are looking at:
    >
    > 1 per 10 sec, 1/10h or 1 /10y?
    >
    > Julo
    >
    > "Somesh Gupta" <someshg@yahoo.com> on 04/04/2001 20:15:53
    >
    > Please respond to someshg@yahoo.com
    >
    > To:   Julian Satran/Haifa/IBM@IBMIL, ips@ece.cmu.edu
    > cc:
    > Subject:  RE: iSCSI ERT: data SACK/replay buffer/"semi-transport"
    >
    >
    >
    >
    >
    >
    > > -----Original Message-----
    > > From: owner-ips@ece.cmu.edu [mailto:owner-ips@ece.cmu.edu]On Behalf Of
    > > julian_satran@il.ibm.com
    > > Sent: Wednesday, April 04, 2001 7:32 AM
    > > To: ips@ece.cmu.edu
    > > Subject: Re: iSCSI ERT: data SACK/replay buffer/"semi-transport"
    > >
    > >
    > >
    > >
    > > SNACK is here for two reasons - Status retry (which is cheap) and Data
    > > retry as a side benefit.
    >
    >   Unless there is clear benefit (i.e. the event is frequent enough
    >   to justify recovery at this level), the entire mechanism should be
    >   dropped - it is neither cheap nor free. If it is relatively
    >   infrequent, the recovery at the command level should be a sufficient
    >   mechanism
    >
    > > CRC errors are not that rare (although we don't have real data the
    > > simulation with file systems seem to indicate that numbers could
    > > be as high
    > > a 0.0002%). A restart of link - is expensive (slow start) and even if
    > they
    > > are far lower for many applications a slow start is a painfull event.
    >
    >   Intuitively, it seems that the combination of link level CRC, TCP
    >   checksum, and good hardware (ECC, parity etc) should lead to a
    >   much lower level of errors caught by the iSCSI CRC algorithm. We have
    >   to seperate error detection (i.e. what if I have bad hardware or
    >   some vendor makes bad/buggy intermediate system) from recovery
    >   mechanisms (not based on hardware being bad or buggy - market forces
    >   will wean out the vendor) which should not be based on assumptions
    >   of bugs in hardware/software of specific implementations.
    >
    > >
    > > Removing them from the spec is not a path we should take lightly.
    >
    >   I would phrase it the other way. We should not keep adding things
    >   unless there is very clear proof that the additional feature is
    >   beneficial and does not have negative side effects (and there is
    >   some consensus on adding it)
    > >
    > > Julo
    > >
    > > "Jon Hall" <jhall@emc.com> on 02/04/2001 16:13:35
    > >
    > > Please respond to "Jon Hall" <jhall@emc.com>
    > >
    > > To:   ips@ece.cmu.edu
    > > cc:
    > > Subject:  Re: iSCSI ERT: data SACK/replay buffer/"semi-transport"
    > >
    > >
    > >
    > >
    > >
    > > I agree with Somesh.  And would go farther -- the complexity
    > > that results from retaining enough target-side state to respond
    > > to a SACK/SNACK request is non-trivial and needs clear justification.
    > > Intuitively, a CRC that discovers an error in an iSCSI pdu header
    > > (that the TCP cksum missed) seems like it should be a rare event.
    > >
    > > What is the frequency of this event?  IMO the answer to this
    > > question should be written into the protocol spec -- assuming
    > > that it substantiates the benefit of SACK/SNACK.  Otherwise, the
    > > SACK/SNACK pdu should be removed.
    > >
    > > -Jon
    > >
    > > julian_satran@il.ibm.com writes:
    > > >
    > > >Somesh,
    > > >
    > > >As I stated earlier - the DataSN was created to detect missing data
    > PDUs.
    > > >SNACK is needed to recover missing StatusSN and missing dataSN
    > is only a
    > > >bonus if the target wants to support it.  It is a trivial mechanism and
    > I
    > > >think it should stay.
    > > >
    > > >Julo
    > > >
    > > >"Somesh Gupta" <someshg@yahoo.com> on 31/03/2001 02:25:52
    > > >
    > > >Please respond to someshg@yahoo.com
    > > >
    > > >To:   Julian Satran/Haifa/IBM@IBMIL, ips@ece.cmu.edu
    > > >cc:
    > > >Subject:  RE: iSCSI ERT: data SACK/replay buffer/"semi-transport"
    > > >
    > > >
    > > >
    > > >
    > > >Sorry to have been missing for a while. Hope you will
    > > >appreciate my being back in action :-). It was a fairly
    > > >clear consensus in Orlando that applications broke up
    > > >their transfers into reasonably small chunks i.e. they
    > > >did not have very long running transfers.
    > > >
    > > >Therefore the consensus was that a command level recovery
    > > >mechanism was sufficient instead of an ack/sack for each
    > > >data PDU.
    > > >
    > > >The SACK mechanism was a post Orlando invention. Without
    > > >an ack mechanism (for every data PDU), the SACK mechanism
    > > >just imposes additional burden on either end of the session,
    > > >without really much benefit.
    > > >
    > > >The benefit of having SACK is of saving bandwidth in case
    > > >the data part of the data PDU failed an integrity check
    > > >(but passed TCP checksum). This is a rare enough case that
    > > >as a percentage, the bandwidth loss from retransmitting
    > > >all the data associated with a read or write command is
    > > >very very small.
    > > >
    > > >In addition, it avoids the complexity of restarting
    > > >something from the middle, as compared to from the begining.
    > > >
    > > >To me it seems that there is significant simplicity (from
    > > >implementation, reliability and recovery process) from
    > > >having smaller data transfer per command.
    > > >
    > > >I would really like to get rid of the SACK command.
    > > >
    > > >Somesh
    > > >
    > > >> -----Original Message-----
    > > >> From: owner-ips@ece.cmu.edu [mailto:owner-ips@ece.cmu.edu]On
    > Behalf Of
    > > >> julian_satran@il.ibm.com
    > > >> Sent: Wednesday, March 28, 2001 6:57 AM
    > > >> To: ips@ece.cmu.edu
    > > >> Subject: RE: iSCSI ERT: data SACK/replay buffer/"semi-transport"
    > > >>
    > > >>
    > > >>
    > > >>
    > > >> Mallikarjun,
    > > >>
    > > >> Last summer I thought that recovery within a connection
    > should be left
    > > to
    > > >> TCP. It is simple and could be made available through IPsec
    > (if no new
    > > >> option of any form can be added).
    > > >>
    > > >> Two things killed this:
    > > >>
    > > >>    The requirement to have a data encapsulation that can pass through
    > > >>    application proxies (like a storage router)
    > > >>    The "NO WAY" message we got from IESG-Security on a CRC only IPSec
    > > >>    header
    > > >>
    > > >>
    > > >> As for the ACK - I am very much in favor of it (it is a no brainer)
    > and
    > > >> implementations are in fact allowed to drop even unacked data.
    > > >>
    > > >> I am bound by the Orlando meeting decision to drop it. Except the
    > > regular
    > > >> "oppose everything" crowd the two vocal opponents where Somesh
    > > Gupta and
    > > >> Matt Wakeley.
    > > >>
    > > >> David may want or not to re-open the issue - I am not going
    > to ask for
    > > >it.
    > > >>
    > > >> Regards,
    > > >> Julo
    > > >>
    > > >> "Mallikarjun C." <cbm@rose.hp.com> on 28/03/2001 00:45:02
    > > >>
    > > >> Please respond to cbm@rose.hp.com
    > > >>
    > > >> To:   Black_David@emc.com
    > > >> cc:   Julian Satran/Haifa/IBM@IBMIL, cbm@rose.hp.com,
    > > someshg@yahoo.com,
    > > >>       steph@cs.uchicago.edu, John Hufferd/San Jose/IBM@IBMUS,
    > > >>       ldalleore@snapserver.com, venkat@rhapsodynetworks.com
    > > >> Subject:  RE: iSCSI ERT: data SACK/replay buffer/"semi-transport"
    > > >>
    > > >>
    > > >>
    > > >>
    > > >> David and Julian,
    > > >>
    > > >> I appreciate both your views, and should I say that they're
    > > >> along predicted lines :-)
    > > >>
    > > >> - David's right in saying that the situation is akin to FC's.
    > > >>   However, I would like to point out that FC is an unreliable
    > > >>   transport, and hence is forced to pick up a lot of the transport
    > > >>   baggage (at least in FCP-2, as I understand), in addition
    > > >>   to being a SCSI encapsulation layer.  Unfortunately, even with
    > > >>   TCP being the "reliable" transport, iSCSI is going along the
    > > >>   same lines - ie. transport baggage + SCSI encapsulation.  My
    > > >>   point is - if this is indeed a necessary evil, why don't we
    > > >>   complete iSCSI's transport functionality by data-ACKs?
    > > >>
    > > >> - If data SACK is introduced mostly to make up for TCP's
    > shortcomings,
    > > >>   we're making its usage (and implementation) drastically less
    > > appealing
    > > >>   since the only way error recovery algorithms can *rely* on
    > data SACK
    > > >>   is when replay is supported (or, "ReplaySupport=yes"  in my
    > > proposal),
    > > >>   which is extremely expensive.  IOW, we're defining data SACK in the
    > > >>   draft and not providing any incentives to implement and use it!
    > > >>
    > > >> - I submit that since iSCSI is being hailed as the ideal SCSI
    > Transport
    > > >>   protocol in its definition so far (and I believe, rightly so -
    > > >mandating
    > > >>   command ordering, bi-di support, SCSI CRN support to name a few
    > > >> examples),
    > > >>   the perfectly SCSI-legal R/W interactions that break in other
    > > >transports
    > > >>   *do not* have to break in iSCSI.
    > > >>
    > > >> - A last idea (may seem radical at this point) in regards to iSCSI
    > > >>   being a "full transport". This provides us an opportunity to "cast
    > > >>   off" the transport baggage in future when we truly move to a
    > > "reliable"
    > > >>   transport (perhaps TCP with CRCs/SCTP ?) - if we do a good job of
    > > >>   keeping the encapsulation stuff separate from the transport stuff.
    > > >>   (Julian, I heard from Randy that ideas similar to this
    > were explored
    > > >>   in your Haifa meeting.  And yes, he recalls they were
    > given up since
    > > >>   TCP was supposed to be reliable and granularity of recovery
    > > was deemed
    > > >>   one I/O.)
    > > >>
    > > >> With that said, may I request David (with his co-chair hat on, :-))
    > > >> to add some binding comments/observations on this discussion?
    > > >>
    > > >> If we decide to leave data SACKs as unattractive to implement,
    > > the draft
    > > >> should in the least add a statement like - "Note that satisfying all
    > > >> possible data SACK requests for a task with an unacknowledged status
    > > >> implies implementing the I/O replay buffer on the part of targets."
    > > >> --
    > > >> Mallikarjun
    > > >>
    > > >>
    > > >> Mallikarjun Chadalapaka
    > > >> Networked Storage Architecture
    > > >> Network Storage Solutions Organization
    > > >> MS 5668   Hewlett-Packard, Roseville.
    > > >> cbm@rose.hp.com
    > > >>
    > > >>
    > > >>
    > > >>
    > > >> >I think Julian's basically right -- I would point
    > > >> >out that any case of write after read that breaks
    > > >> >over iSCSI will also break over Fibre Channel.
    > > >> >On FC, the scenario starts with a frame CRC failure
    > > >> >on read data at the Initiator, so applications
    > > >> >have to cope and typically do so by enforcing
    > > >> >ordering at the app rather than using SCSI task
    > > >> >ordering.
    > > >> >
    > > >> >While SCSI has clever tools like ACA and task
    > > >> >ordering that appear to allow dependent operations
    > > >> >to be sent to the target concurrently, in practice
    > > >> >they don't work and/or aren't used (funny thing,
    > > >> >those two reinforce each other ;-) ).  Hence
    > > >> >a minimal approach to them is in order:
    > > >> >- Make sure the result will interoperate.
    > > >> >- Make sure T10 doesn't ding us for leaving something
    > > >> >    completely out.
    > > >> >- Don't specify anything not needed for the above.
    > > >> >
    > > >> >My 0.02,
    > > >> >--David
    > > >> >
    > > >> >> -----Original Message-----
    > > >> >> From:  julian_satran@il.ibm.com [SMTP:julian_satran@il.ibm.com]
    > > >> >> Sent:  Tuesday, March 27, 2001 9:23 AM
    > > >> >> To:    cbm@rose.hp.com
    > > >> >> Cc:    someshg@yahoo.com; steph@cs.uchicago.edu;
    > hufferd@us.ibm.com;
    > > >> >> cbm@rose.hp.com; ldalleore@snapserver.com; Venkat Rangan;
    > > >> >> Black_David@emc.com
    > > >> >> Subject:    Re: iSCSI ERT: data SACK/replay
    > buffer/"semi-transport"
    > > >> >>
    > > >> >>
    > > >> >>
    > > >> >> Mallikarjun,
    > > >> >>
    > > >> >> I commiserate with you at the lack of ack for data but the Orlando
    > > >> meeting
    > > >> >> stated - no.  Recall that I kept the number only as a mechanism to
    > > >> detect
    > > >> >> missing packets.
    > > >> >>
    > > >> >> You can achieve the effect you want by keeping around data for a
    > > while
    > > >> >> (you
    > > >> >> determine how long and then discard).
    > > >> >>
    > > >> >> If a SACK comes and you can recover - fine. If not you either
    > > reaccess
    > > >> the
    > > >> >> media (if you know how) or reject
    > > >> >> and let the initiator retry.
    > > >> >>
    > > >> >> You should not worry about R/W conflicts as programs bound to have
    > > >such
    > > >> >> conflicts either:
    > > >> >>
    > > >> >> 1)can live with them or
    > > >> >> 2)protect themselves through some locks and rely on
    > > >> "operation-end-status"
    > > >> >> to keep results deterministic.
    > > >> >>
    > > >> >> Regards,
    > > >> >> Julo
    > > >> >>
    > > >> >>
    > > >> >>
    > > >> >> "Mallikarjun C." <cbm@rose.hp.com> on 27/03/2001 03:34:16
    > > >> >>
    > > >> >> Please respond to cbm@rose.hp.com
    > > >> >>
    > > >> >> To:   cbm@rose.hp.com, someshg@yahoo.com, steph@cs.uchicago.edu,
    > > >Julian
    > > >> >>       Satran/Haifa/IBM@IBMIL, John Hufferd/San Jose/IBM@IBMUS
    > > >> >> cc:   Black_David@emc.com
    > > >> >> Subject:  iSCSI ERT: data SACK/replay buffer/"semi-transport"
    > > >> >>
    > > >> >>
    > > >> >>
    > > >> >>
    > > >> >> Hi Error Recovery Team,
    > > >> >>
    > > >> >> iSCSI can discard PDUs because of digest errors and request
    > > >> >> retransmissions using the iSCSI data SACK.  To deal with such
    > > >> >> an eventuality, targets that want to support data SACK have
    > > >> >> the following options:
    > > >> >>
    > > >> >> (A) maintain a complete "replay" buffer for the entire I/O since
    > > >> >>   a SACK could come anytime before the status is ack'ed by the
    > > >> >>   initiator. [ simple, but extremely expensive in memory
    > resources]
    > > >> >>
    > > >> >> (B) (re-introduce data-ACKs into the draft, and) implement
    > > data-ACKs.
    > > >> >>   Thus enables keeping only those I/O buffers that haven't been
    > > ack'ed
    > > >> >>   by the initiator. IOW, become a real full transport! [ everyone
    > > >> disliked
    > > >> >>   it earlier...]
    > > >> >>
    > > >> >> (C) re-access the medium for data retransmission requests.
    > > Now there
    > > >> >>   are 3 sub-cases in this to handle the changed data on the
    > > medium in
    > > >a
    > > >> >>   write-after-read scenario.  (SEE NOTE.1 at the bottom on how it
    > is
    > > >> >> legal.)
    > > >> >>      (1) On seeing any write, stall till status is ack'ed
    > > for all the
    > > >> >>             previous reads (basically drain the pipe).
    > [simple, but
    > > >> incurs
    > > >> >>             an additional roundtrip delay for all writes].
    > > >> >>      (2) A variation of the above, keep an eye only on the prior
    > > >> >>             overlapping reads. [more BW efficient, but
    > > complicated to
    > > >> >>             resolve the block dependencies in a stream of
    > > >> reads followed
    > > >> >>             by writes]
    > > >> >>         (3) Document the caveat and leave it upto the applications
    > > >> >>             to avoid this case since this leads to data integrity
    > > >> issues.
    > > >> >>             [pushing to apps since the transport can't get
    > > it right!]
    > > >> >>
    > > >> >> My first preference is (B), followed by (A), and I suggest we not
    > go
    > > >> >> to (C) at all with its inherent dangers.
    > > >> >>
    > > >> >> Doing (B) naturally completes the transport job that iSCSI has
    > taken
    > > >> >> on itself in view of TCP's claimed unreliable checksum.  That is
    > the
    > > >> >> right thing to do architecturally instead of being a
    > > "semi-transport"!
    > > >> >>
    > > >> >> Comments?
    > > >> >> --
    > > >> >> Mallikarjun
    > > >> >>
    > > >> >>
    > > >> >> Mallikarjun Chadalapaka
    > > >> >> Networked Storage Architecture
    > > >> >> Network Storage Solutions Organization
    > > >> >> MS 5668   Hewlett-Packard, Roseville.
    > > >> >> cbm@rose.hp.com
    > > >> >>
    > > >> >>
    > > >>
    > > >_________________________________________________________________
    > > _________
    > > >> >> Note.1: A Read followed by a Write (to the same blocks) is
    > perfectly
    > > >> legal
    > > >> >>         if SCSI sets the ORDERED task attribute on both the
    > > >> commands AND
    > > >> >>         sets the NACA bit to one to indicate that Write shall be
    > > >> executed
    > > >> >>         only if the Read did not fail (result in a Check
    > Condition).
    > > >> >>
    > > >> >>         In the current case, since Read completed just fine from
    > > >SCSI's
    > > >> >>         point of view, SCSI is moving on to execute Write.  Those
    > > read
    > > >> >> buffers
    > > >> >>         had been freed up since iSCSI received an ACK at the TCP
    > > >level,
    > > >> >> and
    > > >> >>         since iSCSI has no other way to have the data ack'ed!
    > >
    > >
    >
    >
    > _________________________________________________________
    > Do You Yahoo!?
    > Get your free @yahoo.com address at http://mail.yahoo.com
    >
    >
    >
    
    
    _________________________________________________________
    Do You Yahoo!?
    Get your free @yahoo.com address at http://mail.yahoo.com
    
    


Home

Last updated: Tue Sep 04 01:05:11 2001
6315 messages in chronological order