|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] RE: An IPS Transport Protocol (was A Transport Protocol Without ACK)> From: stewrtrs@stewart.chicago.il.us > > Any transport protocol proposal is ok. As long as it can be seen and > reviewed. So far I have seen only two TCP and SCTP. > > Oh, a little side note, any transport protocol proposed MUST be able to > show TCP like behavior in the face of congestion. And I think, IMHO, that > this means that if it is NOT using RFC2581 procedures it MUST show that > it does backoff and share with TCP. It also has a HEAVY burden of proof to > show this facility at least in my mind and I would think in the > IESG's mind > as well... I will try to describe a transport protocol for iSCSI herein. This proposal addresses the RFC2581 for congestion management as well as queuing and resource management for iSCSI initiator and target devices. I will call this IPS (IP Storage) Protocol which is a hybrid between FCP of fibre channel and TCP of IP. The way this email is written, it is not a formal proposal by any stretch of imagination. I am a career adapter designer and I don't do RFC or windows and floors. Therefore, in describing this IPS Protocol if I misuse any words that have specific meanings to RFCs, my sincere apology to this working group. Herein I assume the iSCSI IETF effort can be broken into two parts: one for mapping a SCSI request and response to one or more iSCSI PDUs and another for accommodating a transport protocol such as TCP, SCTP, or this proposed protocol, IPS. This proposal addresses the second effort of the IETF. If this assumption is wrong, hit the delete key now so you won't waste any more time. 1. The Needs The speed of light travels about 5 us per kilometer or 8 us per mile. With 3000 miles between New York and Los Angeles, the Round Trip Time (RTT) is 3000 x 8 x 2, or 48 msec,, not counting the queuings and delays in the switches and routers. Comparing to the latency of just a few microseconds on locally attached devices, to make iSCSI device a meaningful alternative, it must have an appropriate transport protocol that deals with the long latency. Furthermore, the congestion of the Internet Network that drops and duplicates datagrams demands an efficient and reliable detection of error and retransmission. Finally, given TCP/IP is a well accepted and proven transport protocol, iSCSI must support TCP/IP. 2. Executive Summary For those do not have time to read this long posting, this IPS proposal describes the processing -- both creating and parsing -- of an iSCSI PDU encapsulated within an Internet TCP/IP datagram. Hence the proposal complements the current IETF effort that defines the iSCSI PDUs. An iSCSI PDU starts with a media header such as Ethernet or Fibre Channel, followed by an IP header, an TCP header, an iSCSI header, and, finally, the data payload with CRC. An iSCSI service provider, either a iSCSI driver running on top of a simple old fashion NIC adapter or a sophisticated fiber-channel-like-iSCSI adapter with large amount of microcode and local memory, will perform the protocol processing. This proposal describes the processing -- the semantics -- that solves the iSCSI needs above. Since, the iSCSI PDU has a TCP/IP header, this proposal does not preclude the using of TCP/IP protocol for iSCSI. This IPS protocol addresses congestion management like that in RFC2581 that describes a "good citizenship behavior" of a protocol on how to start and to retransmit data segments on a busy network. This protocol modifies the RFC2581 to deal with long Internet latency of delivery of datagrams. The protocol ensures efficient and yet reliable delivery. By stealing some ideas from fibre channel adapters, which is now targeted for 50,000 IOs per second, this protocol also describes the creation of an exchange table which deals with thousands of concurrent iSCSI requests and responses without the problems of deadlock and resource allocations. 3. Terms A segment -- a term used in the RFC2581, same as an iSCSI PDU ACK and ACK-0 -- an acknowledge PDU. Refer ACK-0 to the FC-PH spec. An Exchange -- roughly like a session defined by the working group except it is executed on a single TCP connection An iSCSI Request/Response Message -- an APL to an iSCSI Provider describing sending/receiving an iSCSI request/response. BB-Credit -- refer to the FC-PH spec. cwnd and rwnd -- Congestion and Receive Windows, terms used in the RFC2581. They have the same value in this protocol SOCKET, CONNECT, BIND Systems Calls -- same meaning as the TCP/IP implementation Delay Constant -- the time units between transfers of sequences Data Descriptors -- in the form of a memory handle or a scatter/gather list in an iSCSI request/response for sending/receiving segments DMA -- Direct Memory Access to transfer iSCSI data payloads to/from iSCSI application software using the data descriptors inside a iSCSI request/response message EE-Credit -- refer to the FC-PH spec. Exchange ID -- OX_ID and RX_ID, please refer to the FC-PH spec. iSCSI Provider -- an iSCSI driver together with an old fashion NIC adapter or a modern superfast iSCSI adapter iSCSI PDU -- as defined by this working group Sequences -- an exchange has many sequences each of which has many segments Tag Queuing -- refer to the SCSI SAM spec. TCP Connection -- A pair of IP-Address and TCP port that uniquely identifies an application process that transmits/receives an iSCSI PDU. Retransmission -- A part of error recovery to retransmit a lost sequence 4. Congestion Management The RFC2581 is not specific to TCP. It should be used by every transport protocol sharing the network, although the authors of the RFC based their experiments and conclusions using TCP. The RFC covers four specific topics: slow start, congestion avoidance, fast transmit, and fast recovery. If other protocols on the network are not following the same rules, while a TCP client/server using slow start waits patiently on a congested network, other protocols will continue flood the network with new data segments, hence, defeating the congestion management. The RFC2581 definitely is not the best thing for a network with extreme long latency. Let me use an example to describe the problem before describing the solution. Assume the latency delay or round-trip time of two iSCSI devices between N.Y. and L.A. is 50 msec. In addition, assume data segment is 2K. Using the slow start algorithm of the RFC2581, a sender only sends two segments at beginning and waits for the ACKs before increase its cwnd. After waiting 50 msec, the sender increases its cwnd to 3, sends 3 segments, and waits again. On a not-so-busy network, to send one MB of data or 500 segments, the sender being a good citizen on the network, will repeat the wait 32 times to send all 500 2K segments. The total time for delivering one MB of data is 50 msec times 32, or about 1.6 seconds. One may argue that given enough time, the cwnd can be increased to 500 and the whole one MB of data can be transferred once. However, any lost packet or out-of-order delivery -- which we assume happening often and is the reason for having the slow start -- the sender seeing the duplicated ACKs slows down immediately by reducing cwnd quickly. Furthermore, the RFC also does slow start after some idle time. This is because the network congestion status is no longer known after some idle time. In this super fast Internet era, when we are designing adapters to process each fibre channel request in 20 microseconds and 50,000 IO's per second, the 50 msec wait and 1.6 sec for moving one MB of data using slow start simply sounds awful. This problem becomes much worse when the MTU is not 2K but reduced to 512 bytes. In this case, there are 2000 segments for a one MB transfer. I don't need to challenge your imagination when the iSCSI is used to back up one TB of data. Now the solution. In the IPS protocol breaks down the 1MB data into 25 20K-sequences. Each sequence has ten 2K segments. Each sequence will be acknowledged individually. We define a Delay Constant between the transfer of two consecutive sequences. On a not-so-busy network, the delay should be zero. Hence, the sender sends all 25 sequences or 500 segments without delay. Using a 1 Gb adapter, the whole 1 MB of data goes out in 10 msec. 25 msec later they arrive at the destination. Each sequence is acknowledged individually. 25 msec later, all 25 ACKs come back to the sender. The whole one 1 MB is transferred in 60 msec, not 1.6 sec. Comparing to the 10 msec transfer on a local network, 60 msec is not so great, but it is the best we can do because the 50 msec delay is contributed by the speed-of-light. A thousand TCP connections will not rid the 50 msec delay. If we decide not to keep this IPS Protocol simple and stupid, we can make the ACK a little more specific by specifying which particular segment is missing. Only missing segments are retransmitted. We can even bundle the missing segments from different sequences by defining a transmitted sequence which contains only retransmitted segments. As an adapter design, I prefer keeping it simple and stupid by retransmit the whole sequence. Instead, we fine tuning it by changing the size of a sequence. When retransmit is necessary, the sender will act as a good citizen by increase the delay constant between sequences. On successful transmit, the sender will decrease the delay constant. Exactly how aggressively should we back away from a congested network -- by a large jump of the delay constant -- will be left for simulation. I do believe the result will depend on the segment sizes the latency values. Note, the performance of this protocol does not depend on the MTU size because it is designed to stream the segments. Notice, this IPS protocol takes an optimistic view about the Internet traffic, i.e., assuming the traffic is light. If not true, it backs off quickly. I believe this is necessary for a network with long latency delay because we can't afford the slow start. A second thing about this IPS Protocol is that one ACK is generated on each sequence instead of each segment. Using the bulk ACK on a busy network with long latency reduces the ACK traffic. The third thing about the IPS is it assumes the receiver is very intelligent to generate the bulk ACK. Of course, if an ACK is missing, the missing sequence is detected by timeout and must be retransmitted. We should also use the ACK-0 of the fibre channel to signal the sender that everything is OK even some ACKs are not received by the sender. ACK-0 will greatly reduce the retransmission by a missing ACK. 5. Queuing Management An IPS request/response message is transaction-oriented, i.e. the whole "iSCSI session" is described in a single request/response message to the IPS provider. Within a request, SCSI command, one or more endpoints, i.e. IP address and TCP port pairs, and data descriptors in the form of a memory handle or a scatter/gather list, and other needed variables are provided. The IPS request/response message is sent to a iSCSI provider that is responsible for creating outgoing PDUs and receiving incoming PDUs. To the provider, each message is an exchange between two endpoints. The initiator give it an OX_ID and the target gives it a RX_ID. Each exchange is executed atomically, i.e. the IPS provider is responsible for sequencing the SCSI command, data, and status. There are no command queuing or head-of-queue deadlock problems. This is because the IPS provider creates a giant exchange table. Whenever a data PDU is received, using OX_ID or RX_ID to find the exchange, the IPS refers to the exchange table to determine what to do. Data PDUs are served on demand, hence, no head-of-queue blocking problem. Outgoing data PDUs are broken down into sequences. After the transfer of each sequence the IPS provider can switch to another exchange to avoid long delay behind a large exchange. For those who familiar with a fibre channel adapter, executing an IPS request is like executing an FCP request, except for the congestion management described earlier. If more than one endpoint is in the iSCSI request/response message, the IPS provider can take the liberty of selecting another endpoint to transmit or retransmit. However, when a different endpoint is used, the whole message, or session, is repeated. A Task Management PDU like ABORT may be needed to avoid confusion on the receiver side. I do appreciate that some people will implement the iSCSI provider in the old fashion stream-oriented TCP protocol instead of this IPS protocol. I don't have any problem for the working group in trying to solve their problems. Personally, I will never implement an iSCSI provider using TCP stream oriented protocol. I will implement the aforementioned congestion management in a fibre channel adapter today as an IPS provider. As long as an IPS provider deals with the PDU's correctly, it should always interoperate with another node which uses TCP stream oriented protocol. Of course, how do two endpoints generate the ACKs must be uniform. In dealing with an IPS provider using TCP, the concept of transfer sequence disappear. Each sequence is a single segment which is ACK'ed individually. By the way, I will never consider multiple TCP paths to reduce latency time because the IPS provider like a fibre channel adapter is targeted to deliver 50,000 IOs per second going to 100,000 IOs in the near future. The context switch time between multiple TCP paths will make the 100,000 IOs impossible. Keeping the segments streaming on the same connection path is the only good solution for long latency delay. 6. Resource Management There are three layers of resource management. First, the BB credit takes care of two nodes connecting point-to-point or on the same arbitrated loop. Using BB credit, one node can never overrun the incoming buffer of another node. This does not apply to iSCSI device connecting to Ethernet due to the collision avoidance protocol, i.e, one has no control of the sender of the incoming segments. Second, the EE credit is equivalent to the rwnd variable of the RFC2581. It manages how many segments a receiver is willing to receive. The EE credit concept is unpractical on a network with long latency. Using the example of the one MB transfer earlier, if the EE credit is small, the sender must wait after its EE credit is exhausted. Only ACKs can replenish the EE credit. The wait is 50 msec each time. In fact, it is imperative for an IPS provider to use DMA to empty incoming segments from its buffer in lieu of EE-credit management. Using EE-credit to slow down the sender on a network with long latency makes the performance unpractical. Finally, the third, the number of SCSI commands can be sent to a target device is governed by the SCSI tag queuing concept. The initiator is always aware of the number of SCSI commands can be sent to a target. It simply does not make sense to send ten commands to a target who can only accept five. After command #6 is rejected with queue busy, the is no guarantee that command #7 will also be rejected. This is because command #1 could be completed before #7 arrives. If #7 is not rejected, then, #6 and #7 will be executed out of order and not acceptable. With the exception of SCSI tag queuing, an IPS provider can not use either BB or EE credits. It must use DMA to empty the incoming segments quickly. For those who implement the IPS provider in TCP, EE credit can be used. Then, one must pay the price of a network with long latency delay. Last, but not least, in the IPS protocol, an IPS provider never needs to allocate cache memory to receive PDUs. This is because it uses the memory supplied by application software with the data descriptor in the request/response message. Each message sets up one exchange table entry which saves the data descriptor. When a PDU is received without an exchange table entry, the segment is unsolicited and thrown away. In other words, the IPS provider is not responsible for an incoming segment when there is no application program waiting for it. This is like the TCP receiving an incoming segment which has an invalid port number. Like setting up the TCP port, an application program must always instruct the IPS provider to create an exchange table entry to receive incoming iSCSI segments. It is OK to send data to a target right after a SCSI command without waiting the Read-To-Transfer from the target. This is known as streaming transfer. When a target uses a IPS message to receive a SCSI command, it can also have the option to provide data descriptors to receive the streamed data without the need of returning R2T first. The streaming transfer is OK'ed when a connection is made. 7. Multiple NICs We certainly do not exclude multiple IPS providers. I believe a wedge driver sits on top of the IPS providers may choose different one for load balance as long as they can reach the same destination. Note, since each IPS request/response message is executed atomically by one IPS provider, there is no synchronization between them. One the receiving end, the application software can set up multiple IPS provider to receive incoming requests. I don't know enough about this area to make meaningful comments. 8. Multiple Paths to Same Destination The IPS protocol uses the SOCKET, CONNECT, and BIND system calls to make a TCP connection. It is assumed that when there are multiple IP addresses to reach a same destination, the SOCKET data structure will provide such information which in turn will be given to the IPS provider for retransmission consideration. Y.P. Cheng, CTO, ConnectCom Solutions Corp.
Home Last updated: Tue Sep 04 01:07:07 2001 6315 messages in chronological order |