|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] RE: a vote for asymmetric connections in a sessionDear Mr. Cheng, I hove some trouble with your note as it contains many items some that where repeatedly discussed and already agreed upon, some with smaller or larger misunderstandings (like the duplex issue - the links are in fact used in duplex mode even for iSCSI as R2T and data can flow on the same links and outbound and inbound data can be used simultaneously with different commands, deadlocks are not caused by execution speed - or lack of, a NT miniport serves a port driver which in turns serves a class driver, UDP is not more efficient than TCP - although is has a better matching datagram model it lacks reliable delivery and congestion control, etc.). I will try to summarize your subject line position for my (and our list records) - and please correct if I am wrong: - you are against the asymmetric model as it requires more work to execute a SCSI command than the symmetric model. Regards, Julo "Y P Cheng" <ycheng@advansys.com> on 11/09/2000 02:13:02 Please respond to "Y P Cheng" <ycheng@advansys.com> To: John Hufferd/San Jose/IBM@IBMUS, Julian Satran/Haifa/IBM@IBMIL, black_david@emc.com cc: ips@ece.cmu.edu Subject: RE: a vote for asymmetric connections in a session (I apologize for this long response. However, I hope it worthies your reading.) John Wrote: >I think I understood what you said in the context >of the Symmetric model, but could you please take >me through how this would occur in the Asymmetric >when you have at least two connections? Juliano Wrote: >A note of caution. The most serious dead-lock sitaution >we are aware of steams from a mix of RTT (or should we >call it R2T to accommodate Doug Ottis?) and unsolicited >immediate data. If channels are full with unsolicited >data and the target requests something else - that something >else will not get through. This dead-lock, as far as I can >tell exists in all transports. A target should be able to >detect it and iSCSI has provided for the target >to be able to drop data and reclaim them later with R2T. An asymmetric connection, while doable, makes NO sense in today's technologies used in NIC adapters. Before people flaming me on such a statement, I will support my position with my understanding of today's NIC and its driver and iSCSI protocol. I don't know enough about mainframe NIC design. Therefore, in a different context, my position could be wrong. However, I welcome the teaching of mainframe NIC designs from someone. Beyond the position of asymmetric connection, I also take the position that failover is a function of a protocol and could be incorporated in iSCSI. Load balance is a function a NIC adapter and its driver. Detection of incomplete SCSI session due to traffic congestion or lack of resource is the responsibility of a SCSI initiator, not a SCSI target. Here are reasons for my positions: To understand my arguments, it requires one to understand the context of my analysis. Let's start with understanding of terms used herein. "Transport Connection" -- It is a unique pair of endpoints (IP address, port number), one sending and one receiving. A SCSI initiator may have many connections to send commands to different SCSI targets who in turn have many connections to different initiators to receive commands. The SOCKET system call returns a handle and data structure which stores an IP address and port number. The BIND and CONNECT system calls duplicates the socket structure and returns a unique port number. (Note, this is a software port number. Later, we will mention the hardware ports on a NIC adapter.) By duplicating the sockets and their handles, an initiator or target supports multiple connections. Often, people mistake the SCSI transport connection -- a socket -- to TCP connection. In fact, UDP is a better connection protocol for iSCSI as I should argue for later. "Server Client Protocol" -- A SCSI target is a server which enters a passive state listening to incoming SCSI commands. It does so by the SOCKET and BIND system calls which establish a receiving endpoint. A SCSI initiator is a client which enters an active state to send SCSI commands. It does so by the SOCKET and CONNECT system calls which establish a sending endpoint. The domain name to IP address conversion in the BIND and CONNECT are provided by an IP name server or the Address Resolution Protocol (ARP). "Peer to Peer" -- A SCSI target may become an initiator to start third party SCSI commands. Acting as an initiator, it does so by starting a new transport connection to another SCSI target. An iSCSI endpoint is either sending or receiving, but never both. Therefore, a SCSI storage device uses one connection to receive commands and another connection for send a third party copy commands. If the sending initiator can also act as a target, then, we have the appearance of peer-to-peer with two transport connections. Note, SCSI is never a full-duplex protocol. "Multi-path Connection" -- If a SCSI target can be reached by more than one IP addresses, the CONNECT system call on a SCSI initiator returns a list of addresses in the socket data structure. This list will be used for load balance and failover recovery. "SCSI Session" -- It is a stateless transaction between an initiator and a target. The session has a request and response relationship. The request is a SCSI command and the response is data-transfer-and-status. SCSI commands like mode select and sense and iSCSI messages like security and authentication create state information. But, they are uninteresting in this scope of this discussion. "iSCSI driver" -- A NIC adapter driver supports one or more NIC adapters who in turn support iSCSI protocols. For a legacy NIC adapter like old Ethernet cards, the driver much build the iSCSI messages. For a new NIC adapter like the fibre channel adapters or even gigabit Ethernet, the driver simply sends a SCSI request to the adapter which in turn builds the iSCSI messages. The new NIC adapter can accept a few hundreds or even thousands of SCSI requests. The iSCSI driver is a miniport driver -- in Windows/NT terminology -- running under the SCSI class driver which sends SCSI requests to SCSI devices. Needless to say, application software or file system code send requests to SCSI devices. "iSCSI NIC adapter" -- A NIC adapter with one or more functional interfaces and one or more ports connecting to the IP gateways executes the iSCSI requests and responses. It sends requests for a SCSI initiator and receives them for a SCSI target. Therefore, a NIC adapter is running in either initiator mode or target mode or in both. A multiple functional NIC adapter can accept FCP requests from one functional interface and iSCSI from another and even VI from a third interface. A dual-channel adapter will have two ports connecting to two different physical paths, say, one to intranet and another to internet. A NIC adapter has transmit- and receive-buffers for incoming and outgoing SCSI messages and data. When the receive buffer is full, incoming messages will be dropped and lost. Now, here is my argument for why asymmetric connections makes NO sense in the context of the NIC adapter technologies that I understand. For asymmetric connection, if the iSCSI driver is running on old legacy NIC adapters, it must send the SCSI command to one adapter and set up data transfer on another. While with great difficulty one may make these two adapters talking to each other to coordinate the command and data sequences, the newer NIC adapters execute hundreds or even thousands SCSI requests and responses "atomically." Therefore, there is no deadlock problem between processing commands and data in the context of either a SCSI initiator or a target. Furthermore, even with the NIC adapters built with the latest technology having two functional interfaces accepting command and data requests separately, there is nothing gained because the SCSI requests are executed atomically by the adapters. In the era of a NIC adapter that execute a whole iSCSI request in 25 microseconds, it does not make sense to have two NIC adapters to split the command and data processing with the coordination itself taking more much time. For the problem of lost SCSI messages due to traffic congestion, it must be detected by the sender who times out the responses, in this case, a SCSI initiator. The congestion problem can not be managed by BB credit used in FCP. For end-to-end connection to a switch or that of an arbitrated loop, one can use BB credit to manage the traffic. But, there is no way to manage that in an Ethernet connection because the collision avoidance protocol. In addition, the gateway can loss packets too. When an initiator is in New York and a target in Los Angeles, one can't afford a zero initial BB credit due to the long latency time. With a non-zero initial BB credit, hundreds of initiators around the world may send requests at the same time. Therefore, traffic jam and lost of packets must use smooth recovery in iSCSI. Only the initiator sends requests and target only returns responses. Therefore, it is very easy for initiator to detect the lost of messages by setting proper time out values. A target must accept at least one request from an initiator; it must manage its resource allocation with RTT. Once a time out on a SCSI request is detected by an initiator, the microcode on the NIC adapter is quite capable of sending the request again, even on another path for failover recovery -- if the adapter has a second port to reach the same target. If not, the NIC iSCSI driver can try another path. In resending the request, yes, the issue of duplicated requests must be managed. However, this is a well-understood problem when retry is allowed in a protocol. Notice, I never say the SCSI target will initiate a retry. If necessary, the target always sends a status message requesting the initiator to retry. Similar to retry, for load balancing the NIC adapter microcode and the iSCSI driver of an initiator is quite capable of selecting a different port or a NIC adapter to send a SCSI request as long as the adapter or the driver are made aware to the multiple paths in the socket data structure which was filled at time of making the connection. To keep the design simple, the target does not, should not, or must not take on the responsibility of load balance. On stripping the data transfers on multiple connections, I do believe we are using four 2.5 gigabit MAC chips to get the 10 gigabit fibre channel, Ethernet, and InfibiBand connections. In fact, the 12x option of InfiniBand stripes the data on 12 MAC chips to get three gigabyte per second data rate. Stripping data across multiple NIC adapters would be too difficult for the poor adapter designers to do. On using UDP instead of TCP for iSCSI, I am having trouble with the TCP because it is stream based. The READ and WRITE system calls are extremely inefficient for block oriented SCSI data transfer. On the other hand, the UDP datagram using and SEND and RECEIVE system calls is better suite for iSCSI. In fact, I believe NFS is built on UDP. The request and response relationship between SCSI initiator and target makes the connectionless UDP protocol possible. FCP is implemented using the class 3 fibre channel protocol which is designed for datagrams. Finally, comments on the resources used by the initiator and target. A SCSI initiator has self-regulating resource allocation. Where there is no resources to start new processes to initiate new SCSI requests, the SCSI requests cease. For each SCSI request, the required resources are pre-allocated waiting for responses from a target. A SCSI target receives requests from everyone on the net. While it must have room to accept new SCSI requests -- which can be done at login by specifying the queue depth -- it needs RTT (R2T) to control the buffer space for data transfers. This position has already been expressed by many storage controller people. I need not repeat the position here. Y.P. Cheng, CTO, ConnectCom Solutions Corp.
Home Last updated: Tue Sep 04 01:07:25 2001 6315 messages in chronological order |