F2F Mtg Summary for Framing and Error Recovery

To: "'ips@ece.cmu.edu'" <ips@ece.cmu.edu>
Subject: F2F Mtg Summary for Framing and Error Recovery
From: "WENDT,JIM (HP-Roseville,ex1)" <jim_wendt@hp.com>
Date: Wed, 18 Jul 2001 12:18:22 -0400
Cc: "'Jeff Chase (E-mail)'" <chase@cs.duke.edu>, "HAAGENS,RANDY (HP-Roseville,ex1)" <randy_haagens@hp.com>, "WENDT,JIM (HP-Roseville,ex1)" <jim_wendt@hp.com>
Content-Type: text/plain;charset="iso-8859-1"
Sender: owner-ips@ece.cmu.edu
Here is a quick summary of the outcome from the June 27-28 Design
Teams Face-to-Face meeting in Palo Alto in the two focal areas:
Framing and Error Recovery. The bulk of the meeting was spent
discussing framing scenarios, requirements, and alternatives.

As soon as the slide decks make it onto Julian's web site,
I'll email out that info.

Regards,
Jim Wendt
Networked Storage Architecture / NSSO
Hewlett-Packard Company
jim_wendt@hp.com 916-785-5198

---------------------------------------------------------------
F2F Meeting Summary for Framing and Error Recovery
Design Teams Face-to-Face Meeting / June 27-28 Palo Alto


---------------- Framing: The Bottom Line ----------------------

To cut to the chase, the following rough proposal was generated
for handling ULP Framing for iSCSI:

A) Proposed changes to "ULP Framing for TCP" I-D are:
    1) Modify I-D to include two framing modes:
        - "Marker mode" for unmodified TCP stacks
        - "PDU-alignment mode" for modified TCP stacks

    2) ULP is responsible for negotiating use of framing protocol
       and enabling framing behavior on the TCP connection in an
       unambiguous manner

    3) The framing protocol usage, framing mode, and framing
       operational parameters are negotiated separately in each
       direction on a TCP connection. Thus there are "Senders"
       and "Receivers" on a framing TCP connection. An iSCSI
       Initiator or Target is both a Sender and a Receiver
       with respect to an framing TCP connection. 

    4) ULP is responsible for negotiating use of a specific
       framing mode over the TCP connection by having the 
       receiver request highest framing mode desired from sender
       (first PDU-alignment, then Marker, then none) and having
       the sender comply:
         - if receiver requests, and sender supports, PDU-alignment
           mode, then sender MUST enable PDU-alignment mode
         - else if receiver requests, and sender supports, Marker
           mode, then sender MUST enable Marker mode
         - else don't use framing protocol

    5) ULP is responsible for negotiating framing operational
       parameters:
         - Marker period (in Marker mode)
         - Receiver maximum PDU size (in Marker mode)
         - Framing keys (in PDU-alignment mode)
         - ULP packing behavior (in PDU-alignment mode)

    6) Change the marker fields to be 16-bits rather than 32-bits
       (and refer to as "offsets" rather than "pointers")

    *) An updated version of the "ULP Framing for TCP I-D"
       reflecting these changes has been posted (7/9/01) to TSVWG
       for comments (draft-ietf-tsvwg-tcp-ulp-frame-00)

B) Proposed changes to iSCSI spec are:
    1) Remove Markers appendix from iSCSI spec (Appendix D. Synch
       and Steering with Fixed Interval Markers)

    2) iSCSI spec adds wording to the effect of:
       * iSCSI initiator and target framing behavior over a TCP
         connection is defined in draft-ietf-tsvwg-tcp-ulp-frame-00
         (or eventual RFC#)
       * an iSCSI initiator or target is both a sender and receiver
         with respect to framing behavior
       * an iSCSI framing sender MUST implement Marker mode, and
         MAY implement PDU-alignment mode, as defined in <I-D>
       * an iSCSI framing receiver MAY implement PDU-alignment
         mode, or Marker mode, or both, or none as defined in <I-D>
       * an iSCSI receiver on a framing TCP connection dictates
         use of the highest framing mode desired from sender as
         follows:
         - if receiver requests, and sender supports, PDU-alignment
           mode, then sender MUST enable PDU-alignment mode
         - else if receiver requests, and sender supports, Marker
           mode, then sender MUST enable Marker mode
         - else framing behavior is disabled
       * Perhaps there is some description of probable framing
         scenarios capturing the most likely combinations of
         the following attributes:
         - initiator or target
         - software implementation or hardware implementation
         - unmodified or modified TCP stack
         - sender AND receiver framing behaviors (no framing, 
           or Marker mode, or PDU-Alignment mode)
         - values for framing operational parameters

    3) > Still need to determine iSCSI mechanism for turning on
         Framing protocol Marker mode operation

    4) > Still need to determine iSCSI mechanism for negotiating
         framing operational parameters:
            - Framing mode (if both Marker and PDU Alignment mode
              are supported)
            - Marker period (in Marker mode)
            - Receiver maximum PDU size (in Marker mode)
            - Framing keys (in PDU-alignment mode, if supported)
            - ULP packing behavior (in PDU-alignment mode, if
              supported)

-----------
The reasoning for these proposed changes is as follows:

    1) Re: Merge Marker mode into "ULP Framing for TCP" I-D
         a) The TCP-related framing work already has mindshare
            in TSVWG and this work is embodied in the current
            framing I-D. Rather than dilute the framing effort 
            with additional I-Ds, all framing related work
            should be collected into a modified version of the
            existing framing I-D.

         b) Other ULPs may also find Marker mode useful in
            software-only unmodified-TCP client scenarios

         c) The framing I-D appears to be a reasonable literary
            vehicle for documenting the collection of framing
            schemes. The I-D could be extended in the future to
            include a byte or word stuffing frame marker method
            such as COBS.

         d) A single framing I-D may help to encourage a single
            consistent interface with the ULP regardless of which
            framing mode is employed.

         e) The iSCSI spec can simply reference the one framing I-D.

    2) Re: Make Marker mode mandatory for all iSCSI implementations,
       and PDU-Alignment mode optional for all iSCSI implementations.

        a) This allows interoperation of software-only,
           UNMODIFIED-TCP-stack clients with hardware-accelerated, 
           small-buffer-memory storage arrays. This applies to both
           1Gbps-client/1Gbps-array and 1Gbps-client/10Gbps-array 
           scenarios.

        b) One potentially compelling application for iSCSI involves 
           software-only implementations on mainstream desktops and 
           laptops operating over unmodified TCP stacks to access 
           centralized storage arrays.

        c) Software implementations are likely to exist far into the 
           future. Individual software-only clients may not operate 
           at 10Gbps, but will be combined together with other clients 
           that aggregate to 10Gbps.

        d) The only framing mechanisms that can operate completely 
           above a client TCP and not require any modification to 
           the client's standard TCP stack are the interval-based 
           (Marker mode, periodic PDU alignment, fixed length PDU)
           and byte-stuffing (COBS) framing schemes. All other 
           framing mechanisms (including PDU-Alignment mode)
           require modification to the client's TCP stack.

        e) The processing overhead for a client software 
           implementation to insert Markers is small compared to 
           the processing overhead of a byte-stuffing scheme.

        f) Receivers are allowed to dictate the sender's framing
           behavior because it is the receiver that is impacted 
           by the presence or absence of framing behavior on the 
           connection. 

        g) Hardware-accelerated receivers can be implemented with 
           minimal buffer memory, meaning that they always rely on 
           framing-based direct data placement processing, only if 
           it is known in advance that every client the receiver
           could potentially interoperate with is capable of 
           providing the necessary framing-based behavior. These 
           hardware-accelerated receivers will request, and expect 
           that, the sender insert markers (or PDU-Alignment if 
           supported). 

        h) Since a software-implemented receiver may incur extra 
           data movements in processing markers, these receivers 
           can request, and expect that, a sender NOT insert 
           markers, if desired.

        i) Marker mode doesn't completely eliminate the need for 
           buffer memory on the receiver. The receiver still needs 
           to use "eddy buffers" that temporarily hold incoming data
           after a dropped segment containing a ULP header up until 
           the next ULP header is located in the packet stream, and 
           which exist for as long as the original ULP header segment 
           is outstanding. But Marker mode does greatly reduce the 
           amount of memory needed as compared to a traditional TCP
           receiver's reassembly memory requirements (often equal to 
           number-of-connections X round-trip-pipe-size). The Marker 
           mode small memory requirements are dependent upon the 
           period of the marker, and the size of the ULP PDUs being 
           restricted to a reasonably small value. The larger that 
           either one is, the larger the eddy buffer memory 
           requirements. Also, an eddy buffer is required each time 
           a ULP header is dropped, so that multiple ULP header drops
           in close proximity may cause multiple eddy buffers to be 
           temporarily pending on a connection.

        j) The PDU-alignment framing mode is preferred. However, it 
           may be several years before all of the different software 
           TCP/IP implementations will be able to support framing 
           behavior.

-----------
Open Issues:

    1) Acceptability of the PDU-Alignment framing mode's reliance
       on "key+length" matching across resegmenting middleboxes
         - In PDU-Alignment mode each TCP segment payload contains
           one complete framing PDU (consisting of an 8 byte
           framing header followed by one or more complete ULP
           PDUs). Thus, every TCP segment has the TCP header 
           followed immediately by the framing header.
         - In certain cases a single framing PDU must be broken 
           across multiple TCP segments (such as dynamic Path MTU
           reductions), resulting in TCP segments where a framing
           header doesn't immediately follow the TCP header.
         - The framing I-D defines sender behaviors that allow
           PDU-alignment mode to function deterministically and
           correctly in all cases where the TCP segmentation
           flowing from sender to receiver is not altered.
         - If the TCP segmentation from sender to receiver is
           altered by an intermediary (resegmenting middlebox),
           and a framing-header-containing segment drop or 
           reordering has occurred such that the receiver is
           attempting to locate the next framing header in the
           segment stream, then the receiver must examine the 
           first 8 bytes of each incoming TCP segment payload for
           a valid framing header containing valid Key(6B) and 
           Length(2B) fields.
         - A false-positive occurs if, upon resegmentation by a 
           middlebox, the receiver gets a TCP segment in which
           the first 8 bytes of the payload indicate a valid
           framing header (the first 6 bytes match the
           previously exchanged random key value, and the next 
           2 bytes contain a valid length), yet the TCP segment 
           payload isn't actually a framing header.
         - While it is felt that the probability of a
           false-positive in these resegmenting-middlebox scenarios
           will be sufficiently low, further analysis work may be
           may be required in this area.
         - Note that this mechanism is NOT a scanning technique
           for locating start-of-frame across an arbitrary byte
           stream. It only provides an indication of PDU
           alignment or not. The first 8 bytes of the TCP segment
           payload are examined to determine if the segment
           contains the start of a ULP PDU.

    2) None of the current framing schemes take TCP data integrity
       into account. It either needs to be decided:
         a) how to detect when a data integrity problem occurs
            within a framing header, and what to do about it 
            (even if it just kills the TCP connection),
         b) or that a sufficient level of data integrity needs
            to be provided for all protocols running over TCP
            via a more holistic approach.

    3) Do Markers work at 10Gbps
         - The feasibility of markers at 10Gbps has been questioned.
           It would be beneficial to hear specifics regarding why 
           Markers won't work at 10Gbps. Markers don't allow for a
           no-memory direct data placement NIC since eddy-buffers 
           are required. So, support for clients with unmodified 
           TCP stacks comes at a cost, which is the cost of 
           supporting eddy buffers on the NIC.
         - One question is whether the eddy buffers can be contained 
           entirely in the ASIC or need to be in off-chip memory.


---------------- Error Recovery: The Bottom Line ----------------------

1) Information was presented regarding estimated iSCSI header and
   data digest error rates, and possible approaches to iSCSI
   error recovery. The error rates info is summarized as follows:

        a) "Good Internet"
              - 1500 byte MTU / 8192 byte iSCSI PDU
              - TCP checksum mismatch 1 in 90,000
              - Checksum escape 1 in 135M to 1 in 10B
              - For bandwidth of 30Mbps @ 100msec RTT
                  > 8 to 600 digest errors per year
                  > 1 header digest every 2 months to 10 years
              - For bandwidth of 300Mbps @ 10msec RTT
                  > 80 to 6,000 digest errors per year
                  > 1 to 70 header digest errors per year

        b) "Bad Internet"
              - 1500 byte MTU / 8192 byte iSCSI PDU
              - TCP checksum mismatch 1 in 11,000
              - Checksum escape 1 in 16M to 1 in 1B
              - For bandwidth of 10Mbps @ 100msec RTT
                  > .5 to 33 digest errors per week
                  > 26 to 1,650 digest errors per year
                  > 0.3 to 20 header digest errors per year
              - For bandwidth of 100Mbps @ 10msec RTT
                  > 5 to 335 digest errors per week
                  > 260 to 16,500 digest errors per year
                  > 3 to 200 header digest errors per year

        c) 1Gbps iSCSI connection
              - assuming current TCP yields 70% bandwidth utilization
              - at 100msec
                  > packet loss less than 1 in 50M
                  > .5 to 40 digest errors per year
                  > 1 header digest error every 2..100 years
              - at 10msec
                  > packet loss less than 1 in 500,000
                  > 60 to 4,000 digest errors per year
                  > 1 to 40 header digest errors per year

        d) 10Gbps iSCSI connection
              - assuming current TCP yields 70% bandwidth utilization
              - at 100msec
                  > packet loss less than 1 in 5 billion
                  > 1 digest error every 3 mo to 10 years
                  > 1 header digest error every 20 to 1000 years
              - at 10msec
                  > packet loss less than 1 in 50M
                  > 6 to 400 digest errors per year
                  > 1 header digest error every 3 mo to 12 years

        e) The frequency of framing header corruption escaping the
           TCP checksum mechanism is on the order of the frequency
           of the iSCSI header escaping, but depends on the 
           mechanism used as well as the MSS and iSCSI PDU sizes:
             - Markers (at 2k intervals) - 1/3 as likely as iSCSI 
               headers.
             - Framing (w/o chunking) - 1.5 times as likely
             - Framing (w/ chunks) - 1/2 as likely as iSCSI headers.
           These assumed an 8k iSCSI PDU size, except for Framing
           (w/o chunking), which assumed a 1k iSCSI PDU to fit in a 
           single segment.  All schemes had a framing header size 
           of 8 bytes, and assumed an MSS of 1460.

2) No definitive conclusions were reached during the F2F in regards
   to Error Recovery mechanisms.
      - Further work needs to be done in this area.
      - Mallikarjun Chadalapaka, Mark Bakke, and others can help
        move the work forward in this area.

2) It would be valuable to collect information regarding TCP
   checksum mismatch rates on production systems. If anyone has
   access to fairly busy systems and can collect the following 
   information, you can forward it to Mark Bakke (mbakke@cisco.com). 
   You'll want to collect three data items:
      a) sysUpTime
      b) tcpInSegs - total number of inbound TCP segments
      c) tcpInErrs - total number of inbound TCP segments with errors
         (most likely checksum mismatches, but some implementations 
         may count other error discards here as well)

---------------- Slide Decks ------------------------------------

Sorry, but these materials aren't on the web yet. Hopefully
they will be in the next week or two. I'll email when they are 
available on a server somewhere.

* "iSCSI Framing Presentation" - slides/spreadsheet - Matt Wakeley
* "TCP Framing Discussion" - slides - Jim Wendt
* "Recovering From iSCSI Digest Errors" - slides - Mark Bakke
* "Expected iSCSI digest error rates on Internet connections"
  - spreadsheet - Mark Bakke
* CRC and checksum performance - slides - Jonathan Stone

---------------- Framing Discussion Summary ----------------------

/iSCSI usage scenarios
A wide variance of usage scenarios were strongly represented:
* High-speed short-distance storage LANs
* High-speed long-distance storage WANs
* Multitudes of low-end clients using software iSCSI client 
  implementations and unmodified software TCP stacks
* Multiple first-generation 1Gbps clients aggregating to 
  next-generation 10Gbps storage arrays
* A variety of IP networks and paths with potential for both 
  TCP-level resegmenting middleboxes and dynamic changes in Path MTU.

/Memory-based solutions
* It was felt that 1Gbps memory-based solutions are feasible and 
  may be cost-competitive (e.g. there is no usage of direct data 
  placement nor framing mechanisms in this case)
* There were different opinions regarding whether 10Gbps memory-based 
  solutions would be cost-competitive or feasible for 10Gbps.
* There was discussion regarding the comparative cost of memory-based 
  and no-memory iSCSI HBAs and infrastructure relative to Fibre Channel
* There were concerns regarding next-generation 10Gbps storage arrays 
  that want to support first-generation clients. The next-generation 
  10Gbps storage arrays can only implement no-memory solutions if the
  first-generation clients were mandated to implement support for 
  framing (thus making direct data placement possible on the storage 
  array).
* Hybrid schemes were discussed where a next-generation 10Gbps 
  storage array would contain a moderate amount of memory to handle
  non-framing first-generation clients while using full framing and
  direct data placement with no memory buffers for 10Gbps clients
* Matt Wakeley has created a spreadsheet for high-speed memory 
  subsystem costs

/Direct data placement alternatives
* Discussion of various levels at which direct data placement 
  information can ride:
    - Above TCP (iSCSI task tags, RDMA protocol)
    - At transport (TCP RDMA option)
    - Below transport (TAF)

/iSCSI layering scenarios and evolution
* iSCSI can be layered:
    - over normal TCP
    - over Markers over normal TCP
    - over Framing TCP
    - over RDMA+chunking over Framing TCP
* Layering iSCSI over RDMA+chunking doesn't seem likely for 
  first-generation iSCSI implementations

/Framing alternatives
* Framing mechanism classes:
    - Intervalic (Periodic Markers, Periodically aligned headers,
      Fixed size ULP PDUs)
    - Framing aware TCP (ala ULP Framing over TCP I-D)
    - TCP message boundary indications (Reserved bit, TCP option, 
      URG pointer, PSH bit, etc)
    - Byte stuffing (COBS, 7B/8B, etc)
* Framing mechanism characteristics:
    - sender TCP modifications required?
    - receiver memory requirements (full TCP receive window,
      eddy buffers, IP reassembly buffers)
    - level of TCP changes (none, behavioral, header fields)
    - support ULP PDU > TCP MSS
    - software processing overhead
    - hardware implementation complexity
    - handle dynamic Path MTU changes
    - handle resegmenting TCP middlebox
    - require [dynamic] chunking above TCP
    - emit short segments more often than typical
    - added protocol bytes overhead
    - tied to TCP sequence number processing
    - increase probability of segment drops
    - TCP aesthetics

/Markers and ULP Framing merge
* Proposal to merge Marker mechanism into current "ULP Framing for 
  TCP I-D" and have iSCSI mandate implementation of Marker mode
* See "Framing: The Bottom Line" section above

---------------- Error Recovery Discussion Summary ---------------

/Mark Bakke slides and spreadsheet
* Mark presented slides and a spreadsheet discussing:
    - expected iSCSI header and data digest error rates given 
      link bandwidth, RTT, probability of segment drop, and 
      probability of TCP checksum escape
    - recommended iSCSI error handling approaches for Header
      digest and data digest errors
    - <slides link>
    - <spreadsheet link>

/Discussion regarding iSCSI error recovery complexity
* It was felt that 90% of the recovery complexity already exists
  for the sake of session recovery (aftet a TCP connection failure)
  and if only "within-command" recovery was eliminated, it wouldn't 
  substantially simplify the protocol or its specification. 
  However, this assertion needs to be validated.
* It was felt that complete command recovery would probably be a
  dequate for the expected error incidence (not a noticable impact), 
  but it hasn't been shown how adopting this approach would reduce
  complexity.

/Discussion re: IPSec SA's
* There was some discussion regarding use of IPSec and concerns
  that the set of Security Associations would not fit into on-chip
  memory, forcing the SAs to be cached in off-chip memory.

/Jonathan Stone slides
* Jonathan presented data analysis from his soon-to-be-completed
  dissertation regarding the nature of empirically-observed 
  transport-level errors, and the error detection performance of 
  CRC and checksum algorithms on such.

---------------- List of Attendees -----------------------

Mark Bakke, Stephen Bailey, Uri Elzur, Somesh Gupta, Randy Haagens, 
John Hufferd, Jim Pinkerton, Venkat Rangan, Allyn Romanow, 
Costa Sapuntzakis, Julian Satran, Jonathan Stone, Matt Wakeley, 
Jim Wendt, Jim Williams

--------------------------------------------------------------
Follow-Ups:
- Re: F2F Mtg Summary for Framing and Error Recovery
  - From: Mark Bakke <mbakke@cisco.com>
Prev by Date: RE: iSCSI: SNACK wording clarification
Next by Date: RE: London: Call for agenda items
Prev by thread: I-D ACTION:draft-ietf-ips-fcip-mib-00.txt
Next by thread: Re: F2F Mtg Summary for Framing and Error Recovery
Index(es):
- Date
- Thread
Home
Last updated: Tue Sep 04 01:04:17 2001
6315 messages in chronological order