|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: single vs multiple channels for iSCSI commandsI wanted to respond with a few concerns I had regarding the following proposal, but I must begin with two warnings: (1) I did not attend the meeting in Haifa, so I may be unaware of substansive discussions in this area, and (2) I am leaving on vacation shortly and will be unable to sustain a dialog based on my comments until I return on July 10th. That said, my primary concern is that the focus on an iSCSI-enabled NIC seems to be diverging from "legacy" implementation focus. More precisely, the focus of this Proposal and related discussions is in identifying physical paths on the basis of a TCP connection to allow iSCSI-NIC offloads at the TCP-level, but this path/virtual-circuit association is not established or required by normal TCP functionality. Stated another way, it is entirely possible for a single TCP connection to be distributed across multiple physical NICs in existing networking implementations. As a result of this, I see the following constraints, summarized here and also noted in the Proposal text below: 1) In the iSCSI work it is imperative that a Channel be a virtual entity, tightly bound to the TCP virtual circuit concept. A Channel should not be confused with a physical connection. 2) The specification should enable offloading and traffic distribution, but not require it. Each end must be capable of operating in full legacy mode, regardless of the configuration of the other end. a) It is not possible for the recipient to pre-configure DMA buffers for a Data Channel transfer (as stated in the Proposal) unless that DMA configuration is applicable across all recipient NICs or duplicated across those NICs. MAC-level load balancing will insure that the packets comprising a Data Transfer are spread across receiver ports. b) Recipient offload functionality will be limited by this effect. The recipient host system (/embedded controller/whatever) will need to be involved to successfully collate all the information needed to reconstruct and manage a Data Transfer. 3) Whilst a separate Control Channel sounds like a good idea from the perspective of preventing data transfers from perturbing management capabilities, I have become less convinced of the usefulness on further inspection, and I can forsee a large number of problems for the recipient due to this separation (note that the recipient is the target in a WRITE or the initiator in a READ). a) What happens to the recipient from the Data perspective if a Command is sent over the Control Channel, subsequently cancelled via the Control Channel, and then a new Command is issued? The recipient must have some way of determining if the Data (read from a separate TCP stream on a separate controller with separate buffering and data servicing) is associated with the cancelled Command or the new Command. A CRN, Exchange ID, or similar identification tag will help, but not entirely. Imagining multiple Gigabit Ethernet connections and keeping in mind a target of 30000 - 40000 SCSI operations per second and the aforementioned Gigabit Ethernet wraparound duration of just a few seconds, it's very conceivable that lots and lots of management operations have happened before the data manages to work it's way into the processor's attention. Fibre Channel can deal with most of this (although there are still issues and active, current discussions on FC reflectors about those issues) because it has a tightly-coupled low-latency configuration. It's usually very possible for the initiator to know whether a target has a SCSI operation or is in the process of a Data Transfer on behalf of that specific SCSI operation, so cancelling or otherwise affecting that SCSI operation is fairly deterministic. For IP, and especially WAN considerations, with large NIC and socket buffers, the hysteresis window becomes very large and it becomes very difficult to deterministically control individual SCSI operations. b) I'm not sure the Command/Data channel focus addresses the correct issues. In an implementation, an iSCSI packet should not sit for a long period of time on inbound NIC queues or sockets unless the receiving host processor is becoming overloaded. If the receiver is becoming overloaded, expediting Command operation handling will probably only increase the overload (i.e. the recipient will get further and further behind). Instead, standard network flow control seems appropriate to indicate the recipient's overload to the originator. A Task Management Channel does seem appropriate to allow operations like Target Reset and such to be performed independently of command processing, with expedited handling of Task Management Channel communications. This is fairly infrequent traffic, however. I'll be glad to address any responses when I return on the 10th of July. Regards, Kevin ________________________________________________________________________ Kevin Quick Interphase Product Development Project UDI kquick@iphase.com Dallas, Texas Chairman +1 214 654 5173 www.iphase.com www.projectudi.org On Tue, 27 Jun 2000 meth@il.ibm.com wrote: : Date: Tue, 27 Jun 2000 11:49:42 +0300 : From: meth@il.ibm.com : To: ips@ece.cmu.edu, scsi-tcp@external.cisco.com : Subject: single vs multiple channels for iSCSI commands : : : : : : Proposal to support single Control Channel with multiple Data Channels : in the iSCSI protocol. : : by Kalman Meth : 27 June 2000 : : In our discussions on the iSCSI protocol, we came to the conclusion that : we needed to send data over multiple channels in order to make best use : of the available network resources. We also were inclined to : have all of the channels acting in a symmetric manner so as to simplify : the protocol by not having to deal differently with some channels. : This allows vendors to introduce uniform iSCSI NICs for all of the : network connections that will be exploited by iSCSI. : : We decided on allowing commands to be sent over any of the multiple : connections, with the command's data and status being sent in the same : channel that was used to issue the command. I like this model. Each command is wholly contained within a Channel, therefore ordering and management sequencing is preserved. I'm not sure if this was discussed in Haifa, but I would think that the best focus would be to associate a Channel with a LU. All SCSI operations to that LU must be performed over that Channel; operations for other LU's might share this channel or use a unique channel. This preserves command ordering for the target LU. : The use of multiple channels to pass commands introduced a complication : of servicing the commands on the receiving end in the original order : that the commands were issued. We had a further complication when one : of the connections failed; how do we determine which command got lost : on a broken connection, and what actions are required to recover from : the failed connection. The solution we found to these problems : (introducing a Command Reference Number and placing the commands back : in order on the receiver's end) introduced flow control problems, : such as maintaining a window on commands to ensure that we don't overrun : the reference count, and that we don't block up all of the channels : just because one channel failed and its lost command causes us to fill : up the command queue on the target (while we wait for the lost command : to arrive). : : I would like us to go back and consider a variation of the model we : originally proposed with one Command Channel and multiple Data Channels. : Some ideas that came up during our discussions are included below and : also apply to the symmetric model. : : Session establishment: as in existing draft. : Naming: as in existing draft with adjustments from design discussions. : security: as decided in design discussions. : (0) none : (1) challenge/response : (2) IPSec or SSL : : : : Normal case: : : An iSCSI session between and initiator and a target consists of a : number of TCP connections. Each TCP connection between initiator and : target requires an iSCSI login. The first established connection of a : session between initiator and target (numbered 0) is the Control : Connection (also called Control Channel). : Subsequent connections between the same initiator and target can be : added to an existing session upon request of the initiator during login. : These connections are numbered 1,2,3, etc, and are called Data : Connections (also called Data Channels). : An initiator may establish several sessions with the same target, each : session having its own Control Channel and its own set of Data Channels. : : All SCSI commands and task management messages will go over the Control : Connection. Order is maintained within a single session by virtue of all : commands going through the same TCP connection. : The iSCSI packets for RTT and Data may go over any of the channels. : iSCSI Login must be performed on each of the connections. : iSCSI Ping may be performed over any of the connections. : : It is recommended that large data transfers be performed on the Data : Channels (rather than the Control Channel) so as to ensure that the : Control Channel is always free. It is permissible, however, to : establish a single connection and perform all iSCSI operations on that : single channel. : : On a READ or WRITE command, the initiator specifies on which channel it : expects to perform the data transfer. This gives the initiator and : target a chance to set up buffers for DMA ahead of time. Legacy networking considerations would seem to prevent this unless the DMA setup was replicated across all recipient NICs or unless the platform DMA characteristics allow a single mapping for multiple NICs. : Once a data transfer for a particular SCSI command begins on a : particular Data Channel, all subsequent data that is transferred for the : same SCSI command is to be transferred over the same Data Channel. For this model: why? As long as the response isn't sent until all the data is confirmed as received, this doesn't seem necessary as a requirement, although it may be desirable as an option (e.g. for an iSCSI-NIC). : On RTT, the target confirms on which channel it is expecting the data : transfer. An RTT request will be sent over the same channel as the : expected data transfer (as was specified by the initiator). In a READ operation, I would think the initiator would want to control the data channel use. : If the target decides (for whatever reason) that it wants to receive the : data transfer on another channel, it sends the RTT over the Control : Channel with an indication as to which Data Channel it wants to use. : It is understood that this may entail a performance : cost on the initiator's side to now move the data transfer to another : Data Channel (which may be another NIC, thus requiring DMA to be set : up all over again). A target will usually change the connection for : a data transfer only in case of some problem it has with the originally : specified connection (unresponsive connection, or couldn't handle : large amount of data on specified connection, etc). I'm not sure I understand the purposes for changing a connection, especially from a recipient's perspective. Generally, the sender is much more aware of any network difficulties than the recipient (designing a NIC that can't handle a large amount of data transfer is an implementation weakness IMO). The sender is usually aware of connectivity or responsiveness problems, so as long as the receiver (or the specification/protocol) *doesn't* impose any unnecessary restrictions on the data transfer, it seems like it should be the sender's prerogative to determine how the data is sent. : : Commands may be sent with immediate data (in the Control Channel) if the : immediate data is small (say less than 8K), thereby avoiding the need to : later match up the data with the corresponding command. A bit in the : iSCSI command header indicates that there is immediate data. : An initiator may also send unsolicited data (no RTT) over the Data : Channels, in case the initiator and target have agreed (during login : on the Control Channel) to not use RTT. SCSI commands have a desireable attribute of being small. They can usually be received in a single packet and receiver buffer, even for WAN. Sending an entire 8K of data along with the command imposes significant resource requirements on the recipient and is at cross-purposes with the flow control inherent in the SCSI XFER-READY phase. Again, don't forget WAN issues. : : The initiator and target may renegotiate the use (or non-use) of RTT : between commands, using an iSCSI Text command. : The initiator sends the request to the target and does not send any : other commands to the target until the target has responded. : The change in using RTT will take affect with the command following the : response of the target. I'm not sure I fully understand what an RTT is, but I don't much like the statement above. It sounds like all SCSI operations handling is stalled while this RTT renegotiation is performed. Viewing one end of the spectrum as a couple of GE connections between two machines that are about 3 feet apart and probably crunching through 30000-40000 SCSI operations/second, a full command stall sounds pretty expensive. From the other end of the spectrum, with a 24Kbaud WAN connection halfway across the globe, anything that isn't queued and requires a complete round-trip before continuing sounds pretty expensive. Maybe this RTT renegotiation isn't too frequent, but the "between commands" text makes it sound like it is. : : The status of a READ command is sent with the last data packet, : thus allowing hardware implementations to perform a single interrupt : when the entire data transfer has completed. : Similarly, a flag in a data packet sent from initiator to target : indicates the last data buffer in an unsolicited WRITE operation. SCSI is a command/response protocol, so I'm not sure what an unsolicited WRITE is. Perhaps another RTT issue that I'm not aware of? Are you doing away with the SCSI Status packet? I would think that an iSCSI-NIC would be capable of generating a single interrupt on receipt of the Status packet, regardless of the number of intervening data packets. This is how most Fibre Channel cards operate. : If the initiator sends unsolicited data for a WRITE operation : (i.e. without an RTT) over one of the Data Channels, it is possible : that the data will arrive before the command arrived on the Control : Channel. It is also possible that the target will not have enough : buffers to receive the unsolicited data. The target has the option of : placing the unsolicited data in reserve buffers or of completely : discarding the data. If the target discards the data, the target will : later issue an RTT to instruct the initiator to resend the data. OK, I'm surmising that RTT = Ready_To_Transfer and is different from SCSI's XFER_RDY operation in that RTT applies across multiple commands (requiring command stalling as above). I'm still not clear as to why RTT would be issued except as a response to a Command. An unsolicited WRITE sounds like a SCSI Write with "Auto-XFER_RDY" mode, which is much easier to handle (and grant) from a target's perspective if the data follows the command in the same channel. If separate channels occur, you have the possible data-lead problem you noted above... to me this is yet another complication of separating Command and Data channels. : : : Multiple iSCSI NICs: : : One argument to support the symmetric model was to allow having : identical iSCSI NICs to handle all iSCSI connections. In the : symmetric model, since all channels look alike, all of the (identical) : NICs can be fully utilized. : : We argue that even in the model with one Control connection and many : Data Connections that we can still utilize the NICs to their maximum. : : The main operations to be implemented by iSCSI NICs will be to send : data packets and RTTs. Data Channels can be spread across these iSCSI : NICs. The less frequent iSCSI operations (and especially recovery) : can be performed in software in a device driver. : Note also that a Control Channel and a Data Channel can : go over the same wire (NIC) even if they are different TCP connections. : In order to handle additional iSCSI operations in hardware, : vendors can introduce fancier NICs that also handle some other iSCSI : operations. : : A target may use one NIC to handle the Control channel from one : initiator, and another NIC to handle the Control channel from another : initiator. Thus, even if all NICs can handle the entire iSCSI set, I'm not sure how the target would be capable of directing initiators in this way. : they can still be utilized to the maximum by using each NIC for the : Control Channel of a different session. Similarly, if an initiator has : devices on several targets, it can use each NIC to handle the Control : Connection of a different session. : An initiator can also open multiple sessions with the same target : using a different NIC for the Control Channel of the different sessions. Your method would require NIC-to-NIC communications to complete individual iSCSI operations if full host offload is desired. OTOH, because of MAC-level load-balancing causing inbound traffic distribution across multiple NICs, offloading in a system with multiple NICs is problematic under either model. : : Recovery: : : An initiator must hold on to data it has sent via a WRITE operation : until it has received the status for the corresponding command. : Even if the initiator sends immediate data (in the Control Channel) or : unsolicited data (in one of the Data Channels), the target may discard : the data in case it didn't have the resources to handle the data at that : instant. The target may then request that the data be resent with an : RTT. : A target need not keep a copy of the data buffers it has sent, if : such data can be regenerated from the storage device. : However, the target must keep around the status information until it has : been acknowledged by the initiator. The initiator sends Status Ack info : (a new iSCSI message type) over the Control Channel. : If strict ordering between commands is needed (such as reading and : writing of the same device) then the application must perform the : proper synchronization by not issuing the second command until it has : received the status of the first command (as in linked commands). Changing heuristics for applications will cause compatibility issues. : : If it seems that a connection has stopped functioning, then either : the initiator or the target may issue an iSCSI Ping command to determine : if the connection is still alive. (A bit in the Ping header determines : which side initiated the ping operation.) If the Ping operation times : out, then it may be assumed that the connection is not functioning : properly. When a Ping operation fails, the connection should immediately : be closed. : : Note: It is not required to support iSCSI level recovery. : It is sufficient for the initiator to report failure for the commands : that did not complete and let the upper layer protocol handle the : recovery. : In this case, all channels of the session should be closed, all data : structures should be cleaned up, and a new session may be established : between the initiator and target. : : There is an advanced recovery mechanism that MAY be implemented by : the initiator and target, as described below. [elided]
Home Last updated: Tue Sep 04 01:08:13 2001 6315 messages in chronological order |