SORT BY:

LIST ORDER
THREAD
AUTHOR
SUBJECT


SEARCH

IPS HOME


    [Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

    No Subject



    SMTP id XAA25025
         for <ips@ece.cmu.edu>; Sat, 5 Aug 2000 23:12:30 -0600 (MDT)
    Received: from 15.56.8.172 by xboibrg2.boi.hp.com (InterScan E-Mail
    VirusWall NT); Sat, 05 Aug 2000 23:12:29 -0600 (Mountain Daylight Time)
    Received: by xboibrg2.cv.hp.com with Internet Mail Service (5.5.2650.21)
         id <Q1NYGQTK>; Sat, 5 Aug 2000 23:12:29 -0600
    Message-ID: <499DC368E25AD411B3F100902740AD652E9728@xrose03.rose.hp.com>
    From: "HAAGENS,RANDY (HP-Roseville,ex1)" <randy_haagens@hp.com>
    To: "IPS (E-mail)" <ips@ece.cmu.edu>
    Subject: Re: Multiple TCP connections
    Date: Sat, 5 Aug 2000 23:12:27 -0600
    MIME-Version: 1.0
    X-Mailer: Internet Mail Service (5.5.2650.21)
    Content-Type: multipart/mixed;
         boundary="----_=_NextPart_000_01BFFF64.E9DA3EF0"
    Sender: owner-ips@ece.cmu.edu
    Precedence: bulk
    
    This message is in MIME format. Since your mail reader does not understand
    this format, some or all of this message may not be legible.
    
    ------_=_NextPart_000_01BFFF64.E9DA3EF0
    Content-Type: text/plain;
         charset="iso-8859-1"
    
    This memo recaps some of the reasons for the iSCSI design committee's
    chosing multiple TCP connections and the session concept.  Also discussed
    is
    the question of whether TCP connections should be related directly to LUNs.
    
    We chose to support multple TCP connections in order to benefit from
    concurrency in the fabric (primarily) and also in end node implementations
    (hardware and software).  This is related to the stated requirement for
    bandwidth aggregation.  The notion is that no matter how fast an individual
    link (100 Mbps, 1Gbps or 10 Gbps), it will always be desirable to build end
    nodes and fabrics that can use multiple links, in parallel, for aggregated
    bandwidth.
    
    The existence of the 802.3ad link aggregation standard is evidence that the
    Ethernet community values bandwidth aggregation.  Unfortunately,
    802.3ad-compliant networks will achieve parallel flows on link trunks only
    for traffic from different "conversations" (see Pat Thaler's memo dated
    8/03).  Our understanding is that for today, at least, all level-2 and -3
    switches will forward the frames from a single TCP connection over the same
    link of a multilink trunk.  This is because the hash key used to assign a
    frame to a trunk is based on a combination of the MAC and IP source and
    destination addresses, plus the TCP source and destination port numbers.
    (The more sophisicated the switch, the more of these values may be used in
    the hash key.)  For a single TCP connection, all of these values remain
    identical for all of the frames in that connection.  Hence, all of the
    frames of that connection will take the same route through the L3 or L2
    switched Ethernet fabric.
    
    Pat alludes to the possibility of discriminating at the session layer,
    where
    session-layer connection IDs would in fact be different.  This doesn't
    solve
    the problem, however, because it would result in all the frames of a
    session
    taking the same link in a trunk.  That's not what we want.
    
    Our understanding is that to leverage existing infrastructure, and achieve
    parallel flows through the Ethernet fabric, we must use different TCP
    connections (therefore different port number), at the very least.  This
    practice will allow L4 switches to assign different TCP conversations to
    different 802.3ad links.  While we're at it, it's helpful also to use
    different IP and MAC addresses, so that L3 and L2 switches also will do the
    right thing.
    
    For a moment, assume that the IP/Ethernet fabric were able to support
    multi-link concurrency for a single TCP stream.  Then, the question of
    in-order arrival occurs.  Unquestionably, in-order arrival would be
    preferable, as it would ease the TCP segment re-assembly process.
    Arguably,
    however, out-of-order arrival could be reasonably handled by a TCP hardware
    engine, provided that the time skew of the arrivals was tightly controlled.
    (This limits the amount of memory required for the reassembly buffer.)  On
    the other hand, early hardware implementations of TCP will likely assume
    in-order arrival for the fast-path implementation, and escalate to firmware
    for handling out-of-order segment arrival, which should normally happen
    only
    in the error case (dropped segment or IP route change).  Allowing the
    routine arrival of segments out of order is probably not a wise choice.
    
    Alternatively, it's conceivable that switches could be designed that would
    distribute TCP frames across multiple links, while maintaining the order of
    their reassembly at the receiving switch ports.  Note that the end nodes,
    with their multiple links, would also have to participate in such a new
    aggregation protocol.  This new class of switches, if they were to emerge,
    would make it feasible to consider limiting iSCSI sessions to a single
    TCP/IP connection, at least for local area Ethernet fabrics.  Similar
    developments would be required in wide-area switching.  Even assuming these
    developments, one possible problem would remain: the TCP engine at the two
    ends of the link would have to handle the aggregated traffic of several
    links.  Aggregating TCP connections into a session allows us to deploy
    multiple TCP engines, typically one per IP address, and requires only that
    the TCP engine implementation scales with the link speed.
    
    The next question is, given multiple TCP connections per end node, how many
    should we support?  The iSCSI design committee concluded that the right
    answer was "several".  Consider the case of a multiport storage controller
    (or host computer).  To use each of the ports, we certainly need one TCP
    connection per host per port at a minimum.  If 100 host computers, each
    with
    16 connections to the Ethernet fabric, share a storage array that also has
    16 connections to the Ethernet fabric, then the storage array needs to
    support 1600 connections, which is reasonable.  If the hosts actually use
    one connection group (aka "session") for writes, and a second one for
    reads,
    in order to allow reads that are unordered with respect to those writes,
    then 3200 connections are needed.  Still reasonable for a large storage
    array.
    
    Some have suggested a single connection per LU.  This might be reasonable
    for a disk that contains only a single LU.  But a storage controller
    contains today 1024 LUs, and in the future, perhaps 10,000 LUs.  Sometimes
    an LU will be shared between multiple hosts, meaning that the number of
    connections per LU will be greater than one.  Assume that 128 hosts are
    arranged in 16 clusters of 8 hosts, running a distributed file or database
    system between them.  Then each LU will have to support 8 host connections.
    Assume further that a second connection per host is needed for asynchronous
    reads. 16 connections per LU, or 160,000 connections in total.  If each
    connection state record is 64B (a totally wild guess), this amounts to 10
    MB
    of memory needed for state records.  As a point of comparison,
    first-generation TCP hardware acclerators are planned with support for
    approximately 1000 connections.
    
    If this weren't bad enough, it turns ot that one (or two, in the case of
    asynch reads) connection per LU isn't enough to meet performance
    requirements.  While the large number of TCP connections required for the
    many LUs certainly will deliver enough aggregate throughput for unrelated
    traffic, only one (or two) connections are available for a single LU.  Bear
    in mind that for storage controllers, writes to an LU really are writes to
    storage controller memroy, and not to disk.  (A background process destages
    data to disk, typically at a much lower rate that data is delivered to
    cache, due the benefits that the cache provides, which are too nuanced to
    go
    into here.)  Today's storage controllers can absorb write bursts at
    typically 1 GB (that's Gigabyte) per second, which would require the
    aggregation of 8 1 Gbps Ethernet links.  By the time 10 GbE emerges,
    storage
    controller bandwidth will have scaled up to the 10 GBps range.
    
    Conclusion: one (or two) TCP connections per LU is both too many (resulting
    in too much memory devoted to state records) and too few (insufficient
    bandwidth for high-speed IO to controller cache).  Decoupling the number of
    TCP connections from the number of LUs is the necessary result.
    
    If you still don't buy this argument, consider the evolution to
    object-based
    storage, where SCSI LUs are replaced by objects.  Objects may be used for
    the same purposes that LUs are today (to contain a file system, for
    example); or they may be used to contain file subtrees, individual files,
    or
    even file extents.  They will be much more numerous than LUs.
    
    iSCSI allows the host to bind n TCP connections together into an iSCSI
    session, which provides the SCSI "transport" function.  The connections of
    this session typically will use n different Ethernet interfaces and their
    respective TCP engines.  The session is connected to an abstract iSCSI
    "target", which is a collection of SCSI LUNs named by a url.  Within the
    session, thousands of IOs may be outstanding at a given time, involving
    perhaps 1600 or so LUs (128 hosts are organized into 16 clusters; the
    10,000
    LUs are divided among the 16 clusters of hosts.)
    
    Because the iSCSI session is a SCSI transport, we've chosen to support
    ordered command delivery within the iSCSI session.  SCSI requires this
    functionality of any transport, so that the SCSI attributes "ordered" and
    "simple" will have some meaning.  This mechanism dates to the SCSI bus
    (which is a link), which always delivers commands in order.  Under the
    assumption of in-order command delivery, the SCSI device server can
    meaningfully use the task attributes to control the order of task
    execution.
    (Actually, the SCSI SAM-2 equivocates on whether ordered command delivery
    is
    a requirement; this is probably a compromise to permit support FCP-1, which
    doesn't-support ordered command delivery, to be a legal SCSI transport.
    Notably, FCP-2 has adopted a command numbering scheme similar to our own,
    for in-order command delivery.)
    
    Command ordering is accomplished by numbering the commands.  Command
    numbering has two additional benefits: (1) We can apply flow control to
    command delivery, in order to prevent the hosts from overruning the storage
    array; (2) We can know, through a cumulative acknowledgement mechanism,
    that
    a command has been received at the storage controller.  A similar mechanism
    is used for reponse message delivery, so that the target can know that its
    response (status) message was received at the initiator, and that command
    retry will not be subsequently attempted by the host.  This permits the
    target to discard its command replay buffer.
    
    Sequencing of commands was chosen by the design committee after lengthy
    consideration of an alternative: numbering every iSCSI session-layer PDU.
    The latter approach actually would have made recovery after TCP connection
    failure a lot easier, at least conceptually, since it would be handled at
    the iSCSI PDU (message) level, and not at the higher SCSI task (command)
    level.  But there was a problem in the implementation: the central iSCSI
    session layer would need to be involved in numbering every iSCSI PDU sent
    by
    any of the iSCSI/TCP engines.  This would require an undersirable amount of
    communication between these engines.  The method we've chosen requires only
    that commands be numbered as they leave the SCSI layer, and similarly, that
    response window variables be updated only when response messages are
    returned to the SCSI layer.  This assures that iSCSI code in the host will
    run only when SCSI code runs, during startIO and completion processing.
    
    R
    
    Randy Haagens
    Director, Networked Storage Architecture
    Storage Organization
    Hewlett-Packard Co.
    tel. +1 916 785 4578
    e-mail: Randy_Haagens@hp.com
     <<Randy Haagens (E-mail).vcf>>
    
    ------_=_NextPart_000_01BFFF64.E9DA3EF0
    Content-Type: application/octet-stream;
         name="Randy Haagens (E-mail).vcf"
    Content-Disposition: attachment;
         filename="Randy Haagens (E-mail).vcf"
    
    BEGIN:VCARD
    VERSION:2.1
    N:Haagens;Randy;;;
    FN:Randy Haagens (E-mail)
    ORG:Hewlett-Packard Company;Architecture and Performance
    TITLE:Director, Networked Storage Architecture
    TEL;WORK;VOICE:+1 (916) 785-4578
    TEL;CELL;VOICE:
    TEL;WORK;FAX:+1 (916) 785-1911
    ADR;WORK:;Roseville, R5U-P5/R5;8000 Foothills Blvd. MS
    5668;Roseville;CA;95747-5668;United States of America
    LABEL;WORK;ENCODING=QUOTED-PRINTABLE:Roseville, R5U-P5/R5=0D=0A8000
    Foothills Blvd. MS 5668=0D=0ARoseville, CA 95=
    747-5668=0D=0AUnited States of America
    EMAIL;PREF;INTERNET:Randy_Haagens@hp.com
    REV:20000609T224154Z
    END:VCARD
    
    ------_=_NextPart_000_01BFFF64.E9DA3EF0--
    
    ------_=_NextPart_000_01C001B6.C7EE9030--
    
    
    


Home

Last updated: Tue Sep 04 01:07:56 2001
6315 messages in chronological order