|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] RE: A Transport Protocol Without ACK(My apology for this long reply. I hope it worthies your reading.) > From: randall@stewart.chicago.il.us > [mailto:randall@stewart.chicago.il.us] > I am a bit confused by the above Y.P. you state " by the returning of > status PDU."... Both SCTP and TCP will carry a piggyback > ACK with that PDU, so you end up accomplishing the same thing. What are > you trying to say that I am missing??? The piggybacked ACK saves extra PDUs but does not solve the buffer requirement for long latency. In my example, if we have 20-milliseconds of round-trip time on a IP network with gigabit backbone, in order to keep the data streaming on the net we must have 2MB of buffer just in case we need to retransmit the data. By the way, in SCSI read/write data transfer, the receiver sends nothing back until status phase. Therefore, there is nothing to piggyback on. The SCSI protocol for data transfer is basically half-duplex, not full-duplex. Sending ACKs requires extra PUDs. > Y.P. please enumerate the protocols that have this property that also > provide TCP friendly congestion control. If you could enumerate the > exact > protocols and pointers to the specifications I would be more than > glad to have a look at these and see if I can support them. Making vague > references to "not limiting itself to TCP/IP" does not do anything for > me and I think nothing for the WG. We need specific transport protocols > listed that are capable of transporting iSCSI AND have TCP friendly > congestion control principles built into them... I am not an expert in making a transport protocol proposal. However, let me use a bottom-up approach by saying how an iSCSI transport layer should work. Other people can help in making it an IETF proposal. I participated in this discussion with an intention to provide the working group information on the latest NIC adapter technology so the iSCSI proposal can better serve the NIC adapter industry as well as the community who uses TCP/IP. In this response, I will address two topics: 1) The inefficiency of using TCP/IP to implement iSCSI 2) What we can do in the transport layer to overcome the inefficiency of iSCSI on TCP/IP (In here, I am stealing ideas from VI, TCP/RDMA, and FCP.) (Disclaimer: my apology in advance if my view on TCP/IP is incorrect herein. After all, I am a career adapter designer.) For iSCSI to use TCP/IP, it uses SOCKET, CONNECT or BIND to first make a connection point which is a (IP address, TCP port) pair. The asymmetric model provides a second TCP port; a multi-path to another node has a second IP address. After connecting -- with one or more connection points and paths -- the iSCSI creates multiple PDUs: command, data, and status. A SCSI initiator uses a WRITE call to tell an IP NIC to send the PDUs. The iSCSI driver is aware that there could be multiple NIC cards. A SCSI target LISTEN to the incoming PDUs. It may listen to multiple NIC cards. I will not repeat the queuing and blocking problems of the iSCSI driver in dealing with multiple application software with many TCP/IP ports, and the issues of connecting to multiple targets or initiators. We will address only the performance issue of the stream- and connection-oriented delivery of TCP/IP. As in the example of my previous posting, to keep write data streaming on a 1 gigabit connection with 20 milliseconds round-trip latency time, the initiator must have 2000 1K buffers hanging around for retransmitting lost data packets. If it has 200 1K buffers allocated for a target, the initiator can only send 200K of data in 2 milliseconds and wait for 18 milliseconds for the first ACK to come back. Therefore, it runs at 10% of the possible maximum throughput. A target uses RTT to control how much resources each initiator can consume. However, it has no choice but to provide 2000 1K buffers to receive the incoming data for maximum possible performance. To get TCP/IP data, the target uses READs to get data from the IP NIC cards. The memory-to-memory copy to process the TCP stack looking for a TCP port number in the IP packet is the greatest culprit of all of the TCP/IP performance problem. Companies like Alacritec builds special TCP adapter to solve the performance problem to doing the port look up in the adapter. The good news is the above performance problem has already been addressed by VI and FCP implementation in the latest NIC adapters. Here is my proposed iSCSI transport layer protocol: A TRANSACTION ORIENTED WITH BULK ACKNOWLEDGMENT protocol. Instead of using READs and WRITEs for data streams for TCP/IP, A iSCSI driver should send a SCSI request or response to a transport layer using SEND-REQUEST and RECEIVE-RESPONSE MESSAGEs. These message contains the IP end-point connection, SCSI command bytes, and data buffer descriptors that supplied by the application software. Each message describes a transaction EXCHANGE, which can have an exchange-ID. (iSCSI calls the Initiator Task Tag, although a task can have multiple SCSI commands.) The iSCSI driver still use SOCKET, CONNECT, and BIND to create connections. It is true using a total connectionless protocol like UDP to transmit 10 megabytes of data on a busy Internet, we will be forever trying to retransmit due to the lost-frame error. However, instead of sending an ACK for every data frame, we can steal the ideal from fibre channel by breaking down a transaction exchange into data sequences each with a collection of data frames. The receiver needs only to acknowledge a sequence which has a unique sequence ID. A sequence with lost data frame will be retransmitted. Using sliding window, multiple sequences can be transmitted. This is how we keep data frames streaming on a network with long latency time. The size of data sequence is of course network dependent. Having the data descriptors provided by application software for a transaction layer is the greatest benefit of this proposal. There is no data buffering like TCP/IP. The transport layer does not have to allocate a huge buffer to keep data frames streaming on a network with long latency delay. It uses the buffers provided the application software. It can always retransmit a data frame because the application software must stay around until the transaction exchange is complete. In VI, the application software allocates a memory segment, gives it a handle, and passes to a remote node to allow remote DMA. Therefore, the data descriptors of this transport protocol can be simply a memory-handle for a memory segment previously created. TCP/RDMA is copying this idea. Each transaction exchange is executed by a NIC driver atomically. Hundred or even thousands of SEND-REQUEST and RECEIVE-RESPONSE can be outstanding in the driver. After sending the SCSI command PDUs, the data and status PDUs are handled on demand by the NIC driver. There is no queue and deadlock problem. The detection of lost data frame is a function of the transport layer which specifies the QoS (Quality of Service). Flow control is done by EE-credit granting by a receiver so no one can overflow its resources. This is the same as the Max---RN discussed in iSCSI. Congestion control is managed by alternative NIC or IP endpoints. Both should be a part of the transport protocol. I don't claim any credit about this transport layer protocol. Every fibre channel and Infiniband adapter designer knows about this protocol -- although there is no standard. I am sure the TCP accelerator card is doing the same. This protocol is a great alternative to the use of TCP/IP and should be incorporated into iSCSI.
Home Last updated: Tue Sep 04 01:07:14 2001 6315 messages in chronological order |