RE: iSCSI ERT: data SACK/replay buffer/"semi-transport"

To: someshg@yahoo.com
Subject: RE: iSCSI ERT: data SACK/replay buffer/"semi-transport"
From: julian_satran@il.ibm.com
Date: Wed, 4 Apr 2001 23:22:56 +0200
cc: someshg@yahoo.com, ips@ece.cmu.edu
Content-Disposition: inline
Content-type: text/plain; charset=us-ascii
Sender: owner-ips@ece.cmu.edu


Somesh,

Can you give us a reference for those rates?  Where do they come from?

Regards,
Julo

"Somesh Gupta" <someshg@yahoo.com> on 04/04/2001 23:02:06

Please respond to someshg@yahoo.com

To:   Julian Satran/Haifa/IBM@IBMIL, someshg@yahoo.com
cc:   ips@ece.cmu.edu
Subject:  RE: iSCSI ERT: data SACK/replay buffer/"semi-transport"




Assuming that the packet corruption escape rate is 1 in 10billion,
we have (rough assuming 1K byte per packet), 1 escaped packet every
10 Trillion Bytes of data transfer. Seems to me that if I
had to transfer 1 MBytes for having to recover at the
command level rather than at a more granular level, that does
not pose much of an additional burden (1 MB out of 10 Trillion
bytes). Also assuming each i/o is 1 MByte in size, you would
have to do recovery for every 1 in 10 million transactions.

I don't know how realistic the 1 in 10 billion packet corruption
escape rate is but I am using the number from past discussions.

Somesh

> -----Original Message-----
> From: julian_satran@il.ibm.com [mailto:julian_satran@il.ibm.com]
> Sent: Wednesday, April 04, 2001 11:56 AM
> To: someshg@yahoo.com
> Cc: ips@ece.cmu.edu
> Subject: RE: iSCSI ERT: data SACK/replay buffer/"semi-transport"
>
>
>
>
> What are the numbers you are looking at:
>
> 1 per 10 sec, 1/10h or 1 /10y?
>
> Julo
>
> "Somesh Gupta" <someshg@yahoo.com> on 04/04/2001 20:15:53
>
> Please respond to someshg@yahoo.com
>
> To:   Julian Satran/Haifa/IBM@IBMIL, ips@ece.cmu.edu
> cc:
> Subject:  RE: iSCSI ERT: data SACK/replay buffer/"semi-transport"
>
>
>
>
>
>
> > -----Original Message-----
> > From: owner-ips@ece.cmu.edu [mailto:owner-ips@ece.cmu.edu]On Behalf Of
> > julian_satran@il.ibm.com
> > Sent: Wednesday, April 04, 2001 7:32 AM
> > To: ips@ece.cmu.edu
> > Subject: Re: iSCSI ERT: data SACK/replay buffer/"semi-transport"
> >
> >
> >
> >
> > SNACK is here for two reasons - Status retry (which is cheap) and Data
> > retry as a side benefit.
>
>   Unless there is clear benefit (i.e. the event is frequent enough
>   to justify recovery at this level), the entire mechanism should be
>   dropped - it is neither cheap nor free. If it is relatively
>   infrequent, the recovery at the command level should be a sufficient
>   mechanism
>
> > CRC errors are not that rare (although we don't have real data the
> > simulation with file systems seem to indicate that numbers could
> > be as high
> > a 0.0002%). A restart of link - is expensive (slow start) and even if
> they
> > are far lower for many applications a slow start is a painfull event.
>
>   Intuitively, it seems that the combination of link level CRC, TCP
>   checksum, and good hardware (ECC, parity etc) should lead to a
>   much lower level of errors caught by the iSCSI CRC algorithm. We have
>   to seperate error detection (i.e. what if I have bad hardware or
>   some vendor makes bad/buggy intermediate system) from recovery
>   mechanisms (not based on hardware being bad or buggy - market forces
>   will wean out the vendor) which should not be based on assumptions
>   of bugs in hardware/software of specific implementations.
>
> >
> > Removing them from the spec is not a path we should take lightly.
>
>   I would phrase it the other way. We should not keep adding things
>   unless there is very clear proof that the additional feature is
>   beneficial and does not have negative side effects (and there is
>   some consensus on adding it)
> >
> > Julo
> >
> > "Jon Hall" <jhall@emc.com> on 02/04/2001 16:13:35
> >
> > Please respond to "Jon Hall" <jhall@emc.com>
> >
> > To:   ips@ece.cmu.edu
> > cc:
> > Subject:  Re: iSCSI ERT: data SACK/replay buffer/"semi-transport"
> >
> >
> >
> >
> >
> > I agree with Somesh.  And would go farther -- the complexity
> > that results from retaining enough target-side state to respond
> > to a SACK/SNACK request is non-trivial and needs clear justification.
> > Intuitively, a CRC that discovers an error in an iSCSI pdu header
> > (that the TCP cksum missed) seems like it should be a rare event.
> >
> > What is the frequency of this event?  IMO the answer to this
> > question should be written into the protocol spec -- assuming
> > that it substantiates the benefit of SACK/SNACK.  Otherwise, the
> > SACK/SNACK pdu should be removed.
> >
> > -Jon
> >
> > julian_satran@il.ibm.com writes:
> > >
> > >Somesh,
> > >
> > >As I stated earlier - the DataSN was created to detect missing data
> PDUs.
> > >SNACK is needed to recover missing StatusSN and missing dataSN
> is only a
> > >bonus if the target wants to support it.  It is a trivial mechanism
and
> I
> > >think it should stay.
> > >
> > >Julo
> > >
> > >"Somesh Gupta" <someshg@yahoo.com> on 31/03/2001 02:25:52
> > >
> > >Please respond to someshg@yahoo.com
> > >
> > >To:   Julian Satran/Haifa/IBM@IBMIL, ips@ece.cmu.edu
> > >cc:
> > >Subject:  RE: iSCSI ERT: data SACK/replay buffer/"semi-transport"
> > >
> > >
> > >
> > >
> > >Sorry to have been missing for a while. Hope you will
> > >appreciate my being back in action :-). It was a fairly
> > >clear consensus in Orlando that applications broke up
> > >their transfers into reasonably small chunks i.e. they
> > >did not have very long running transfers.
> > >
> > >Therefore the consensus was that a command level recovery
> > >mechanism was sufficient instead of an ack/sack for each
> > >data PDU.
> > >
> > >The SACK mechanism was a post Orlando invention. Without
> > >an ack mechanism (for every data PDU), the SACK mechanism
> > >just imposes additional burden on either end of the session,
> > >without really much benefit.
> > >
> > >The benefit of having SACK is of saving bandwidth in case
> > >the data part of the data PDU failed an integrity check
> > >(but passed TCP checksum). This is a rare enough case that
> > >as a percentage, the bandwidth loss from retransmitting
> > >all the data associated with a read or write command is
> > >very very small.
> > >
> > >In addition, it avoids the complexity of restarting
> > >something from the middle, as compared to from the begining.
> > >
> > >To me it seems that there is significant simplicity (from
> > >implementation, reliability and recovery process) from
> > >having smaller data transfer per command.
> > >
> > >I would really like to get rid of the SACK command.
> > >
> > >Somesh
> > >
> > >> -----Original Message-----
> > >> From: owner-ips@ece.cmu.edu [mailto:owner-ips@ece.cmu.edu]On
> Behalf Of
> > >> julian_satran@il.ibm.com
> > >> Sent: Wednesday, March 28, 2001 6:57 AM
> > >> To: ips@ece.cmu.edu
> > >> Subject: RE: iSCSI ERT: data SACK/replay buffer/"semi-transport"
> > >>
> > >>
> > >>
> > >>
> > >> Mallikarjun,
> > >>
> > >> Last summer I thought that recovery within a connection
> should be left
> > to
> > >> TCP. It is simple and could be made available through IPsec
> (if no new
> > >> option of any form can be added).
> > >>
> > >> Two things killed this:
> > >>
> > >>    The requirement to have a data encapsulation that can pass
through
> > >>    application proxies (like a storage router)
> > >>    The "NO WAY" message we got from IESG-Security on a CRC only
IPSec
> > >>    header
> > >>
> > >>
> > >> As for the ACK - I am very much in favor of it (it is a no brainer)
> and
> > >> implementations are in fact allowed to drop even unacked data.
> > >>
> > >> I am bound by the Orlando meeting decision to drop it. Except the
> > regular
> > >> "oppose everything" crowd the two vocal opponents where Somesh
> > Gupta and
> > >> Matt Wakeley.
> > >>
> > >> David may want or not to re-open the issue - I am not going
> to ask for
> > >it.
> > >>
> > >> Regards,
> > >> Julo
> > >>
> > >> "Mallikarjun C." <cbm@rose.hp.com> on 28/03/2001 00:45:02
> > >>
> > >> Please respond to cbm@rose.hp.com
> > >>
> > >> To:   Black_David@emc.com
> > >> cc:   Julian Satran/Haifa/IBM@IBMIL, cbm@rose.hp.com,
> > someshg@yahoo.com,
> > >>       steph@cs.uchicago.edu, John Hufferd/San Jose/IBM@IBMUS,
> > >>       ldalleore@snapserver.com, venkat@rhapsodynetworks.com
> > >> Subject:  RE: iSCSI ERT: data SACK/replay buffer/"semi-transport"
> > >>
> > >>
> > >>
> > >>
> > >> David and Julian,
> > >>
> > >> I appreciate both your views, and should I say that they're
> > >> along predicted lines :-)
> > >>
> > >> - David's right in saying that the situation is akin to FC's.
> > >>   However, I would like to point out that FC is an unreliable
> > >>   transport, and hence is forced to pick up a lot of the transport
> > >>   baggage (at least in FCP-2, as I understand), in addition
> > >>   to being a SCSI encapsulation layer.  Unfortunately, even with
> > >>   TCP being the "reliable" transport, iSCSI is going along the
> > >>   same lines - ie. transport baggage + SCSI encapsulation.  My
> > >>   point is - if this is indeed a necessary evil, why don't we
> > >>   complete iSCSI's transport functionality by data-ACKs?
> > >>
> > >> - If data SACK is introduced mostly to make up for TCP's
> shortcomings,
> > >>   we're making its usage (and implementation) drastically less
> > appealing
> > >>   since the only way error recovery algorithms can *rely* on
> data SACK
> > >>   is when replay is supported (or, "ReplaySupport=yes"  in my
> > proposal),
> > >>   which is extremely expensive.  IOW, we're defining data SACK in
the
> > >>   draft and not providing any incentives to implement and use it!
> > >>
> > >> - I submit that since iSCSI is being hailed as the ideal SCSI
> Transport
> > >>   protocol in its definition so far (and I believe, rightly so -
> > >mandating
> > >>   command ordering, bi-di support, SCSI CRN support to name a few
> > >> examples),
> > >>   the perfectly SCSI-legal R/W interactions that break in other
> > >transports
> > >>   *do not* have to break in iSCSI.
> > >>
> > >> - A last idea (may seem radical at this point) in regards to iSCSI
> > >>   being a "full transport". This provides us an opportunity to "cast
> > >>   off" the transport baggage in future when we truly move to a
> > "reliable"
> > >>   transport (perhaps TCP with CRCs/SCTP ?) - if we do a good job of
> > >>   keeping the encapsulation stuff separate from the transport stuff.
> > >>   (Julian, I heard from Randy that ideas similar to this
> were explored
> > >>   in your Haifa meeting.  And yes, he recalls they were
> given up since
> > >>   TCP was supposed to be reliable and granularity of recovery
> > was deemed
> > >>   one I/O.)
> > >>
> > >> With that said, may I request David (with his co-chair hat on, :-))
> > >> to add some binding comments/observations on this discussion?
> > >>
> > >> If we decide to leave data SACKs as unattractive to implement,
> > the draft
> > >> should in the least add a statement like - "Note that satisfying all
> > >> possible data SACK requests for a task with an unacknowledged status
> > >> implies implementing the I/O replay buffer on the part of targets."
> > >> --
> > >> Mallikarjun
> > >>
> > >>
> > >> Mallikarjun Chadalapaka
> > >> Networked Storage Architecture
> > >> Network Storage Solutions Organization
> > >> MS 5668   Hewlett-Packard, Roseville.
> > >> cbm@rose.hp.com
> > >>
> > >>
> > >>
> > >>
> > >> >I think Julian's basically right -- I would point
> > >> >out that any case of write after read that breaks
> > >> >over iSCSI will also break over Fibre Channel.
> > >> >On FC, the scenario starts with a frame CRC failure
> > >> >on read data at the Initiator, so applications
> > >> >have to cope and typically do so by enforcing
> > >> >ordering at the app rather than using SCSI task
> > >> >ordering.
> > >> >
> > >> >While SCSI has clever tools like ACA and task
> > >> >ordering that appear to allow dependent operations
> > >> >to be sent to the target concurrently, in practice
> > >> >they don't work and/or aren't used (funny thing,
> > >> >those two reinforce each other ;-) ).  Hence
> > >> >a minimal approach to them is in order:
> > >> >- Make sure the result will interoperate.
> > >> >- Make sure T10 doesn't ding us for leaving something
> > >> >    completely out.
> > >> >- Don't specify anything not needed for the above.
> > >> >
> > >> >My 0.02,
> > >> >--David
> > >> >
> > >> >> -----Original Message-----
> > >> >> From:  julian_satran@il.ibm.com [SMTP:julian_satran@il.ibm.com]
> > >> >> Sent:  Tuesday, March 27, 2001 9:23 AM
> > >> >> To:    cbm@rose.hp.com
> > >> >> Cc:    someshg@yahoo.com; steph@cs.uchicago.edu;
> hufferd@us.ibm.com;
> > >> >> cbm@rose.hp.com; ldalleore@snapserver.com; Venkat Rangan;
> > >> >> Black_David@emc.com
> > >> >> Subject:    Re: iSCSI ERT: data SACK/replay
> buffer/"semi-transport"
> > >> >>
> > >> >>
> > >> >>
> > >> >> Mallikarjun,
> > >> >>
> > >> >> I commiserate with you at the lack of ack for data but the
Orlando
> > >> meeting
> > >> >> stated - no.  Recall that I kept the number only as a mechanism
to
> > >> detect
> > >> >> missing packets.
> > >> >>
> > >> >> You can achieve the effect you want by keeping around data for a
> > while
> > >> >> (you
> > >> >> determine how long and then discard).
> > >> >>
> > >> >> If a SACK comes and you can recover - fine. If not you either
> > reaccess
> > >> the
> > >> >> media (if you know how) or reject
> > >> >> and let the initiator retry.
> > >> >>
> > >> >> You should not worry about R/W conflicts as programs bound to
have
> > >such
> > >> >> conflicts either:
> > >> >>
> > >> >> 1)can live with them or
> > >> >> 2)protect themselves through some locks and rely on
> > >> "operation-end-status"
> > >> >> to keep results deterministic.
> > >> >>
> > >> >> Regards,
> > >> >> Julo
> > >> >>
> > >> >>
> > >> >>
> > >> >> "Mallikarjun C." <cbm@rose.hp.com> on 27/03/2001 03:34:16
> > >> >>
> > >> >> Please respond to cbm@rose.hp.com
> > >> >>
> > >> >> To:   cbm@rose.hp.com, someshg@yahoo.com, steph@cs.uchicago.edu,
> > >Julian
> > >> >>       Satran/Haifa/IBM@IBMIL, John Hufferd/San Jose/IBM@IBMUS
> > >> >> cc:   Black_David@emc.com
> > >> >> Subject:  iSCSI ERT: data SACK/replay buffer/"semi-transport"
> > >> >>
> > >> >>
> > >> >>
> > >> >>
> > >> >> Hi Error Recovery Team,
> > >> >>
> > >> >> iSCSI can discard PDUs because of digest errors and request
> > >> >> retransmissions using the iSCSI data SACK.  To deal with such
> > >> >> an eventuality, targets that want to support data SACK have
> > >> >> the following options:
> > >> >>
> > >> >> (A) maintain a complete "replay" buffer for the entire I/O since
> > >> >>   a SACK could come anytime before the status is ack'ed by the
> > >> >>   initiator. [ simple, but extremely expensive in memory
> resources]
> > >> >>
> > >> >> (B) (re-introduce data-ACKs into the draft, and) implement
> > data-ACKs.
> > >> >>   Thus enables keeping only those I/O buffers that haven't been
> > ack'ed
> > >> >>   by the initiator. IOW, become a real full transport! [ everyone
> > >> disliked
> > >> >>   it earlier...]
> > >> >>
> > >> >> (C) re-access the medium for data retransmission requests.
> > Now there
> > >> >>   are 3 sub-cases in this to handle the changed data on the
> > medium in
> > >a
> > >> >>   write-after-read scenario.  (SEE NOTE.1 at the bottom on how it
> is
> > >> >> legal.)
> > >> >>      (1) On seeing any write, stall till status is ack'ed
> > for all the
> > >> >>             previous reads (basically drain the pipe).
> [simple, but
> > >> incurs
> > >> >>             an additional roundtrip delay for all writes].
> > >> >>      (2) A variation of the above, keep an eye only on the prior
> > >> >>             overlapping reads. [more BW efficient, but
> > complicated to
> > >> >>             resolve the block dependencies in a stream of
> > >> reads followed
> > >> >>             by writes]
> > >> >>         (3) Document the caveat and leave it upto the
applications
> > >> >>             to avoid this case since this leads to data integrity
> > >> issues.
> > >> >>             [pushing to apps since the transport can't get
> > it right!]
> > >> >>
> > >> >> My first preference is (B), followed by (A), and I suggest we not
> go
> > >> >> to (C) at all with its inherent dangers.
> > >> >>
> > >> >> Doing (B) naturally completes the transport job that iSCSI has
> taken
> > >> >> on itself in view of TCP's claimed unreliable checksum.  That is
> the
> > >> >> right thing to do architecturally instead of being a
> > "semi-transport"!
> > >> >>
> > >> >> Comments?
> > >> >> --
> > >> >> Mallikarjun
> > >> >>
> > >> >>
> > >> >> Mallikarjun Chadalapaka
> > >> >> Networked Storage Architecture
> > >> >> Network Storage Solutions Organization
> > >> >> MS 5668   Hewlett-Packard, Roseville.
> > >> >> cbm@rose.hp.com
> > >> >>
> > >> >>
> > >>
> > >_________________________________________________________________
> > _________
> > >> >> Note.1: A Read followed by a Write (to the same blocks) is
> perfectly
> > >> legal
> > >> >>         if SCSI sets the ORDERED task attribute on both the
> > >> commands AND
> > >> >>         sets the NACA bit to one to indicate that Write shall be
> > >> executed
> > >> >>         only if the Read did not fail (result in a Check
> Condition).
> > >> >>
> > >> >>         In the current case, since Read completed just fine from
> > >SCSI's
> > >> >>         point of view, SCSI is moving on to execute Write.  Those
> > read
> > >> >> buffers
> > >> >>         had been freed up since iSCSI received an ACK at the TCP
> > >level,
> > >> >> and
> > >> >>         since iSCSI has no other way to have the data ack'ed!
> >
> >
>
>
> _________________________________________________________
> Do You Yahoo!?
> Get your free @yahoo.com address at http://mail.yahoo.com
>
>
>


_________________________________________________________
Do You Yahoo!?
Get your free @yahoo.com address at http://mail.yahoo.com
Follow-Ups:
- RE: iSCSI ERT: data SACK/replay buffer/"semi-transport"
  - From: "Somesh Gupta" <someshg@yahoo.com>
Prev by Date: Re: ISCSI: Detail on counting offset for fixed markers
Next by Date: RE: iSCSI ERT: data SACK/replay buffer/"semi-transport"
Prev by thread: Re: iSCSI ERT: data SACK/replay buffer/"semi-transport"
Next by thread: RE: iSCSI ERT: data SACK/replay buffer/"semi-transport"
Index(es):
- Date
- Thread
Home
Last updated: Tue Sep 04 01:05:11 2001
6315 messages in chronological order