|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] No Subjectid XAA25025 for <ips@ece.cmu.edu>; Sat, 5 Aug 2000 23:12:30 -0600 (MDT) Received: from 15.56.8.172 by xboibrg2.boi.hp.com (InterScan E-Mail VirusWall NT); Sat, 05 Aug 2000 23:12:29 -0600 (Mountain Daylight Time) Received: by xboibrg2.cv.hp.com with Internet Mail Service (5.5.2650.21) id <Q1NYGQTK>; Sat, 5 Aug 2000 23:12:29 -0600 Message-ID: <499DC368E25AD411B3F100902740AD652E9728@xrose03.rose.hp.com> From: "HAAGENS,RANDY (HP-Roseville,ex1)" <randy_haagens@hp.com> To: "IPS (E-mail)" <ips@ece.cmu.edu> Subject: Re: Multiple TCP connections Date: Sat, 5 Aug 2000 23:12:27 -0600 MIME-Version: 1.0 X-Mailer: Internet Mail Service (5.5.2650.21) Content-Type: multipart/mixed; boundary="----_=_NextPart_000_01BFFF64.E9DA3EF0" Sender: owner-ips@ece.cmu.edu Precedence: bulk This message is in MIME format. Since your mail reader does not understand this format, some or all of this message may not be legible. ------_=_NextPart_000_01BFFF64.E9DA3EF0 Content-Type: text/plain; charset="iso-8859-1" This memo recaps some of the reasons for the iSCSI design committee's chosing multiple TCP connections and the session concept. Also discussed is the question of whether TCP connections should be related directly to LUNs. We chose to support multple TCP connections in order to benefit from concurrency in the fabric (primarily) and also in end node implementations (hardware and software). This is related to the stated requirement for bandwidth aggregation. The notion is that no matter how fast an individual link (100 Mbps, 1Gbps or 10 Gbps), it will always be desirable to build end nodes and fabrics that can use multiple links, in parallel, for aggregated bandwidth. The existence of the 802.3ad link aggregation standard is evidence that the Ethernet community values bandwidth aggregation. Unfortunately, 802.3ad-compliant networks will achieve parallel flows on link trunks only for traffic from different "conversations" (see Pat Thaler's memo dated 8/03). Our understanding is that for today, at least, all level-2 and -3 switches will forward the frames from a single TCP connection over the same link of a multilink trunk. This is because the hash key used to assign a frame to a trunk is based on a combination of the MAC and IP source and destination addresses, plus the TCP source and destination port numbers. (The more sophisicated the switch, the more of these values may be used in the hash key.) For a single TCP connection, all of these values remain identical for all of the frames in that connection. Hence, all of the frames of that connection will take the same route through the L3 or L2 switched Ethernet fabric. Pat alludes to the possibility of discriminating at the session layer, where session-layer connection IDs would in fact be different. This doesn't solve the problem, however, because it would result in all the frames of a session taking the same link in a trunk. That's not what we want. Our understanding is that to leverage existing infrastructure, and achieve parallel flows through the Ethernet fabric, we must use different TCP connections (therefore different port number), at the very least. This practice will allow L4 switches to assign different TCP conversations to different 802.3ad links. While we're at it, it's helpful also to use different IP and MAC addresses, so that L3 and L2 switches also will do the right thing. For a moment, assume that the IP/Ethernet fabric were able to support multi-link concurrency for a single TCP stream. Then, the question of in-order arrival occurs. Unquestionably, in-order arrival would be preferable, as it would ease the TCP segment re-assembly process. Arguably, however, out-of-order arrival could be reasonably handled by a TCP hardware engine, provided that the time skew of the arrivals was tightly controlled. (This limits the amount of memory required for the reassembly buffer.) On the other hand, early hardware implementations of TCP will likely assume in-order arrival for the fast-path implementation, and escalate to firmware for handling out-of-order segment arrival, which should normally happen only in the error case (dropped segment or IP route change). Allowing the routine arrival of segments out of order is probably not a wise choice. Alternatively, it's conceivable that switches could be designed that would distribute TCP frames across multiple links, while maintaining the order of their reassembly at the receiving switch ports. Note that the end nodes, with their multiple links, would also have to participate in such a new aggregation protocol. This new class of switches, if they were to emerge, would make it feasible to consider limiting iSCSI sessions to a single TCP/IP connection, at least for local area Ethernet fabrics. Similar developments would be required in wide-area switching. Even assuming these developments, one possible problem would remain: the TCP engine at the two ends of the link would have to handle the aggregated traffic of several links. Aggregating TCP connections into a session allows us to deploy multiple TCP engines, typically one per IP address, and requires only that the TCP engine implementation scales with the link speed. The next question is, given multiple TCP connections per end node, how many should we support? The iSCSI design committee concluded that the right answer was "several". Consider the case of a multiport storage controller (or host computer). To use each of the ports, we certainly need one TCP connection per host per port at a minimum. If 100 host computers, each with 16 connections to the Ethernet fabric, share a storage array that also has 16 connections to the Ethernet fabric, then the storage array needs to support 1600 connections, which is reasonable. If the hosts actually use one connection group (aka "session") for writes, and a second one for reads, in order to allow reads that are unordered with respect to those writes, then 3200 connections are needed. Still reasonable for a large storage array. Some have suggested a single connection per LU. This might be reasonable for a disk that contains only a single LU. But a storage controller contains today 1024 LUs, and in the future, perhaps 10,000 LUs. Sometimes an LU will be shared between multiple hosts, meaning that the number of connections per LU will be greater than one. Assume that 128 hosts are arranged in 16 clusters of 8 hosts, running a distributed file or database system between them. Then each LU will have to support 8 host connections. Assume further that a second connection per host is needed for asynchronous reads. 16 connections per LU, or 160,000 connections in total. If each connection state record is 64B (a totally wild guess), this amounts to 10 MB of memory needed for state records. As a point of comparison, first-generation TCP hardware acclerators are planned with support for approximately 1000 connections. If this weren't bad enough, it turns ot that one (or two, in the case of asynch reads) connection per LU isn't enough to meet performance requirements. While the large number of TCP connections required for the many LUs certainly will deliver enough aggregate throughput for unrelated traffic, only one (or two) connections are available for a single LU. Bear in mind that for storage controllers, writes to an LU really are writes to storage controller memroy, and not to disk. (A background process destages data to disk, typically at a much lower rate that data is delivered to cache, due the benefits that the cache provides, which are too nuanced to go into here.) Today's storage controllers can absorb write bursts at typically 1 GB (that's Gigabyte) per second, which would require the aggregation of 8 1 Gbps Ethernet links. By the time 10 GbE emerges, storage controller bandwidth will have scaled up to the 10 GBps range. Conclusion: one (or two) TCP connections per LU is both too many (resulting in too much memory devoted to state records) and too few (insufficient bandwidth for high-speed IO to controller cache). Decoupling the number of TCP connections from the number of LUs is the necessary result. If you still don't buy this argument, consider the evolution to object-based storage, where SCSI LUs are replaced by objects. Objects may be used for the same purposes that LUs are today (to contain a file system, for example); or they may be used to contain file subtrees, individual files, or even file extents. They will be much more numerous than LUs. iSCSI allows the host to bind n TCP connections together into an iSCSI session, which provides the SCSI "transport" function. The connections of this session typically will use n different Ethernet interfaces and their respective TCP engines. The session is connected to an abstract iSCSI "target", which is a collection of SCSI LUNs named by a url. Within the session, thousands of IOs may be outstanding at a given time, involving perhaps 1600 or so LUs (128 hosts are organized into 16 clusters; the 10,000 LUs are divided among the 16 clusters of hosts.) Because the iSCSI session is a SCSI transport, we've chosen to support ordered command delivery within the iSCSI session. SCSI requires this functionality of any transport, so that the SCSI attributes "ordered" and "simple" will have some meaning. This mechanism dates to the SCSI bus (which is a link), which always delivers commands in order. Under the assumption of in-order command delivery, the SCSI device server can meaningfully use the task attributes to control the order of task execution. (Actually, the SCSI SAM-2 equivocates on whether ordered command delivery is a requirement; this is probably a compromise to permit support FCP-1, which doesn't-support ordered command delivery, to be a legal SCSI transport. Notably, FCP-2 has adopted a command numbering scheme similar to our own, for in-order command delivery.) Command ordering is accomplished by numbering the commands. Command numbering has two additional benefits: (1) We can apply flow control to command delivery, in order to prevent the hosts from overruning the storage array; (2) We can know, through a cumulative acknowledgement mechanism, that a command has been received at the storage controller. A similar mechanism is used for reponse message delivery, so that the target can know that its response (status) message was received at the initiator, and that command retry will not be subsequently attempted by the host. This permits the target to discard its command replay buffer. Sequencing of commands was chosen by the design committee after lengthy consideration of an alternative: numbering every iSCSI session-layer PDU. The latter approach actually would have made recovery after TCP connection failure a lot easier, at least conceptually, since it would be handled at the iSCSI PDU (message) level, and not at the higher SCSI task (command) level. But there was a problem in the implementation: the central iSCSI session layer would need to be involved in numbering every iSCSI PDU sent by any of the iSCSI/TCP engines. This would require an undersirable amount of communication between these engines. The method we've chosen requires only that commands be numbered as they leave the SCSI layer, and similarly, that response window variables be updated only when response messages are returned to the SCSI layer. This assures that iSCSI code in the host will run only when SCSI code runs, during startIO and completion processing. R Randy Haagens Director, Networked Storage Architecture Storage Organization Hewlett-Packard Co. tel. +1 916 785 4578 e-mail: Randy_Haagens@hp.com <<Randy Haagens (E-mail).vcf>> ------_=_NextPart_000_01BFFF64.E9DA3EF0 Content-Type: application/octet-stream; name="Randy Haagens (E-mail).vcf" Content-Disposition: attachment; filename="Randy Haagens (E-mail).vcf" BEGIN:VCARD VERSION:2.1 N:Haagens;Randy;;; FN:Randy Haagens (E-mail) ORG:Hewlett-Packard Company;Architecture and Performance TITLE:Director, Networked Storage Architecture TEL;WORK;VOICE:+1 (916) 785-4578 TEL;CELL;VOICE: TEL;WORK;FAX:+1 (916) 785-1911 ADR;WORK:;Roseville, R5U-P5/R5;8000 Foothills Blvd. MS 5668;Roseville;CA;95747-5668;United States of America LABEL;WORK;ENCODING=QUOTED-PRINTABLE:Roseville, R5U-P5/R5=0D=0A8000 Foothills Blvd. MS 5668=0D=0ARoseville, CA 95= 747-5668=0D=0AUnited States of America EMAIL;PREF;INTERNET:Randy_Haagens@hp.com REV:20000609T224154Z END:VCARD ------_=_NextPart_000_01BFFF64.E9DA3EF0--
Home Last updated: Tue Sep 04 01:08:02 2001 6315 messages in chronological order |