Re: single vs multiple channels for iSCSI commands

To: meth@il.ibm.com
Subject: Re: single vs multiple channels for iSCSI commands
From: Kevin Quick <kquick@iphase.com>
Date: Tue, 27 Jun 2000 07:45:34 -0500 (CDT)
cc: ips@ece.cmu.edu, scsi-tcp@external.cisco.com
Content-Type: TEXT/PLAIN; charset=US-ASCII
In-Reply-To: <C125690B.00308256.00@d12mta02.de.ibm.com>
Sender: owner-ips@ece.cmu.edu

I wanted to respond with a few concerns I had regarding the following
proposal, but I must begin with two warnings:  (1) I did not attend the
meeting in Haifa, so I may be unaware of substansive discussions in this
area, and (2) I am leaving on vacation shortly and will be unable to
sustain a dialog based on my comments until I return on July 10th.


That said, my primary concern is that the focus on an iSCSI-enabled NIC
seems to be diverging from "legacy" implementation focus.  More precisely,
the focus of this Proposal and related discussions is in identifying
physical paths on the basis of a TCP connection to allow iSCSI-NIC
offloads at the TCP-level, but this path/virtual-circuit association is
not established or required by normal TCP functionality.

Stated another way, it is entirely possible for a single TCP connection to
be distributed across multiple physical NICs in existing networking
implementations.

As a result of this, I see the following constraints, summarized here and 
also noted in the Proposal text below:

1) In the iSCSI work it is imperative that a Channel be a virtual entity,
   tightly bound to the TCP virtual circuit concept.  A Channel should not
   be confused with a physical connection.

2) The specification should enable offloading and traffic distribution,
   but not require it.  Each end must be capable of operating in full
   legacy mode, regardless of the configuration of the other end.

   a) It is not possible for the recipient to pre-configure DMA buffers
      for a Data Channel transfer (as stated in the Proposal) unless that
      DMA configuration is applicable across all recipient NICs or
      duplicated across those NICs.  MAC-level load balancing will insure
      that the packets comprising a Data Transfer are spread across
      receiver ports.

   b) Recipient offload functionality will be limited by this effect.  The
      recipient host system (/embedded controller/whatever) will need to
      be involved to successfully collate all the information needed to
      reconstruct and manage a Data Transfer.

3) Whilst a separate Control Channel sounds like a good idea from the
   perspective of preventing data transfers from perturbing management
   capabilities, I have become less convinced of the usefulness on further
   inspection, and I can forsee a large number of problems for the
   recipient due to this separation (note that the recipient is the
   target in a WRITE or the initiator in a READ). 

   a) What happens to the recipient from the Data perspective if a Command
      is sent over the Control Channel, subsequently cancelled via the
      Control Channel, and then a new Command is issued?  The recipient
      must have some way of determining if the Data (read from a separate
      TCP stream on a separate controller with separate buffering and
      data servicing) is associated with the cancelled Command or the new
      Command.  A CRN, Exchange ID, or similar identification tag will
      help, but not entirely.  Imagining multiple Gigabit Ethernet
      connections and keeping in mind a target of 30000 - 40000 SCSI
      operations per second and the aforementioned Gigabit Ethernet
      wraparound duration of just a few seconds, it's very conceivable
      that lots and lots of management operations have happened before the
      data manages to work it's way into the processor's attention.

      Fibre Channel can deal with most of this (although there are still
      issues and active, current discussions on FC reflectors about those
      issues) because it has a tightly-coupled low-latency configuration.
      It's usually very possible for the initiator to know whether a
      target has a SCSI operation or is in the process of a Data Transfer
      on behalf of that specific SCSI operation, so cancelling or
      otherwise affecting that SCSI operation is fairly deterministic.

      For IP, and especially WAN considerations, with large NIC and socket
      buffers, the hysteresis window becomes very large and it becomes
      very difficult to deterministically control individual SCSI
      operations.

   b) I'm not sure the Command/Data channel focus addresses the correct
      issues.  In an implementation, an iSCSI packet should not sit for a
      long period of time on inbound NIC queues or sockets unless the
      receiving host processor is becoming overloaded.  If the receiver is
      becoming overloaded, expediting Command operation handling will
      probably only increase the overload (i.e. the recipient will get
      further and further behind).  Instead, standard network flow control
      seems appropriate to indicate the recipient's overload to the
      originator.

      A Task Management Channel does seem appropriate to allow operations
      like Target Reset and such to be performed independently of command
      processing, with expedited handling of Task Management Channel
      communications.  This is fairly infrequent traffic, however.


I'll be glad to address any responses when I return on the 10th of July.

Regards,
  Kevin

________________________________________________________________________
Kevin Quick         Interphase Product Development           Project UDI
kquick@iphase.com           Dallas,  Texas                      Chairman
+1 214 654 5173             www.iphase.com            www.projectudi.org


On Tue, 27 Jun 2000 meth@il.ibm.com wrote:

: Date: Tue, 27 Jun 2000 11:49:42 +0300
: From: meth@il.ibm.com
: To: ips@ece.cmu.edu, scsi-tcp@external.cisco.com
: Subject: single vs multiple  channels for iSCSI commands
: 
: 
: 
: 
: 
: Proposal to support single Control Channel with multiple Data  Channels
: in the iSCSI protocol.
: 
: by Kalman Meth
: 27 June 2000
: 
: In our discussions on the iSCSI protocol, we came to the conclusion that
: we needed to send data over multiple channels in order to make best use
: of the available network resources. We also were inclined to
: have all of the channels acting in a symmetric manner so as to simplify
: the protocol by not having to deal differently with some channels.
: This allows vendors to introduce uniform iSCSI NICs for all of the
: network connections that will be exploited by iSCSI.
: 
: We decided on allowing commands to be sent over any of the multiple
: connections, with the command's data and status being sent in the same
: channel that was used to issue the command.

I like this model.  Each command is wholly contained within a Channel,
therefore ordering and management sequencing is preserved.  I'm not sure
if this was discussed in Haifa, but I would think that the best focus
would be to associate a Channel with a LU.  All SCSI operations to that LU
must be performed over that Channel; operations for other LU's might share
this channel or use a unique channel.  This preserves command ordering for
the target LU.


: The use of multiple channels to pass commands introduced a complication
: of servicing the commands on the receiving end in the original order
: that the commands were issued. We had a further complication when one
: of the connections failed; how do we determine which command got lost
: on a broken connection, and what actions are required to recover from
: the failed connection. The solution we found to these problems
: (introducing a Command Reference Number and placing the commands back
: in order on the receiver's end) introduced flow control problems,
: such as maintaining a window on commands to ensure that we don't overrun
: the reference count, and that we don't block up all of the channels
: just because one channel failed and its lost command causes us to fill
: up the command queue on the target (while we wait for the lost command
: to arrive).
: 
: I would like us to go back and consider a variation of the model we
: originally proposed with one Command Channel and multiple Data Channels.
: Some ideas that came up during our discussions are included below and
: also apply to the symmetric model.
: 
: Session establishment: as in existing draft.
: Naming: as in existing draft with adjustments from design discussions.
: security: as decided in design discussions.
:      (0) none
:      (1) challenge/response
:      (2) IPSec or SSL
: 
: 
: 
: Normal case:
: 
: An iSCSI session between and initiator and a target consists of a
: number of TCP connections. Each TCP connection between initiator and
: target requires an iSCSI login. The first established connection of a
: session between initiator and target (numbered 0) is the Control
: Connection (also called Control Channel).
: Subsequent connections between the same initiator and target can be
: added to an existing session upon request of the initiator during login.
: These connections are numbered 1,2,3, etc, and are called Data
: Connections (also called Data Channels).
: An initiator may establish several sessions with the same target, each
: session having its own Control Channel and its own set of Data Channels.
: 
: All SCSI commands and task management messages will go over the Control
: Connection. Order is maintained within a single session by virtue of all
: commands going through the same TCP connection.
: The iSCSI packets for RTT and Data may go over any of the channels.
: iSCSI Login must be performed on each of the connections.
: iSCSI Ping may be performed over any of the connections.
: 
: It is recommended that large data transfers be performed on the Data
: Channels (rather than the Control Channel) so as to ensure that the
: Control Channel is always free. It is permissible, however, to
: establish a single connection and perform all iSCSI operations on that
: single channel.
: 
: On a READ or WRITE command, the initiator specifies on which channel it
: expects to perform the data transfer. This gives the initiator and
: target a chance to set up buffers for DMA ahead of time.

Legacy networking considerations would seem to prevent this unless the DMA
setup was replicated across all recipient NICs or unless the platform DMA
characteristics allow a single mapping for multiple NICs.

: Once a data transfer for a particular SCSI command begins on a
: particular Data Channel, all subsequent data that is transferred for the
: same SCSI command is to be transferred over the same Data Channel.

For this model: why?  As long as the response isn't sent until all the
data is confirmed as received, this doesn't seem necessary as a
requirement, although it may be desirable as an option (e.g. for an
iSCSI-NIC).

: On RTT, the target confirms on which channel it is expecting the data
: transfer. An RTT request will be sent over the same channel as the
: expected data transfer (as was specified by the initiator).

In a READ operation, I would think the initiator would want to control the
data channel use.

: If the target decides (for whatever reason) that it wants to receive the
: data transfer on another channel, it sends the RTT over the Control
: Channel with an indication as to which Data Channel it wants to use.
: It is understood that this may entail a performance
: cost on the initiator's side to now move the data transfer to another
: Data Channel (which may be another NIC, thus requiring DMA to be set
: up all over again). A target will usually change the connection for
: a data transfer only in case of some problem it has with the originally
: specified connection (unresponsive connection, or couldn't handle
: large amount of data on specified connection, etc).

I'm not sure I understand the purposes for changing a connection,
especially from a recipient's perspective.  Generally, the sender is much
more aware of any network difficulties than the recipient (designing a NIC
that can't handle a large amount of data transfer is an implementation
weakness IMO).  The sender is usually aware of connectivity or
responsiveness problems, so as long as the receiver (or the
specification/protocol) *doesn't* impose any unnecessary restrictions on
the data transfer, it seems like it should be the sender's prerogative to
determine how the data is sent.

: 
: Commands may be sent with immediate data (in the Control Channel) if the
: immediate data is small (say less than 8K), thereby avoiding the need to
: later match up the data with the corresponding command. A bit in the
: iSCSI command header indicates that there is immediate data.
: An initiator may also send unsolicited data (no RTT) over the Data
: Channels, in case the initiator and target have agreed (during login
: on the Control Channel) to not use RTT.

SCSI commands have a desireable attribute of being small.  They can
usually be received in a single packet and receiver buffer, even for WAN.

Sending an entire 8K of data along with the command imposes significant
resource requirements on the recipient and is at cross-purposes with the
flow control inherent in the SCSI XFER-READY phase.  Again, don't forget
WAN issues.

: 
: The initiator and target may renegotiate the use (or non-use) of RTT
: between commands, using an iSCSI Text command.
: The initiator sends the request to the target and does not send any
: other commands to the target until the target has responded.
: The change in using RTT will take affect with the command following the
: response of the target.

I'm not sure I fully understand what an RTT is, but I don't much like the
statement above.  It sounds like all SCSI operations handling is stalled
while this RTT renegotiation is performed.

Viewing one end of the spectrum as a couple of GE connections between two
machines that are about 3 feet apart and probably crunching through
30000-40000 SCSI operations/second, a full command stall sounds pretty
expensive.

From the other end of the spectrum, with a 24Kbaud WAN connection halfway
across the globe, anything that isn't queued and requires a complete
round-trip before continuing sounds pretty expensive.

Maybe this RTT renegotiation isn't too frequent, but the "between
commands" text makes it sound like it is.

: 
: The status of a READ command is sent with the last data packet,
: thus allowing hardware implementations to perform a single interrupt
: when the entire data transfer has completed.
: Similarly, a flag in a data packet sent from initiator to target
: indicates the last data buffer in an unsolicited WRITE operation.

SCSI is a command/response protocol, so I'm not sure what an unsolicited
WRITE is.  Perhaps another RTT issue that I'm not aware of?

Are you doing away with the SCSI Status packet?  I would think that an
iSCSI-NIC would be capable of generating a single interrupt on receipt of
the Status packet, regardless of the number of intervening data packets.
This is how most Fibre Channel cards operate.


: If the initiator sends unsolicited data for a WRITE operation
: (i.e. without an RTT) over one of the Data Channels, it is possible
: that the data will arrive before the command arrived on the Control
: Channel. It is also possible that the target will not have enough
: buffers to receive the unsolicited data. The target has the option of
: placing the unsolicited data in reserve buffers or of completely
: discarding the data. If the target discards the data, the target will
: later issue an RTT to instruct the initiator to resend the data.

OK, I'm surmising that RTT = Ready_To_Transfer and is different from
SCSI's XFER_RDY operation in that RTT applies across multiple commands
(requiring command stalling as above).  I'm still not clear as to why RTT
would be issued except as a response to a Command.

An unsolicited WRITE sounds like a SCSI Write with "Auto-XFER_RDY" mode,
which is much easier to handle (and grant) from a target's perspective if
the data follows the command in the same channel.  If separate channels
occur, you have the possible data-lead problem you noted above... to me
this is yet another complication of separating Command and Data channels.

: 
: 
: Multiple iSCSI NICs:
: 
: One argument to support the symmetric model was to allow having
: identical iSCSI NICs to handle all iSCSI connections. In the
: symmetric model, since all channels look alike, all of the (identical)
: NICs can be fully utilized.
: 
: We argue that even in the model with one Control connection and many
: Data Connections that we can still utilize the NICs to their maximum.
: 
: The main operations to be implemented by iSCSI NICs will be to send
: data packets and RTTs. Data Channels can be spread across these iSCSI
: NICs. The less frequent iSCSI operations (and especially recovery)
: can be performed in software in a device driver.
: Note also that a Control Channel and a Data Channel can
: go over the same wire (NIC) even if they are different TCP connections.
: In order to handle additional iSCSI operations in hardware,
: vendors can introduce fancier NICs that also handle some other iSCSI
: operations.
: 
: A target may use one NIC to handle the Control channel from one
: initiator, and another NIC to handle the Control channel from another
: initiator. Thus, even if all NICs can handle the entire iSCSI set,

I'm not sure how the target would be capable of directing initiators in
this way.

: they can still be utilized to the maximum by using each NIC for the
: Control Channel of a different session. Similarly, if an initiator has
: devices on several targets, it can use each NIC to handle the Control
: Connection of a different session.
: An initiator can  also open multiple sessions with the same target
: using a different NIC for the Control Channel of the different sessions.

Your method would require NIC-to-NIC communications to complete individual
iSCSI operations if full host offload is desired.

OTOH, because of MAC-level load-balancing causing inbound traffic
distribution across multiple NICs, offloading in a system with
multiple NICs is problematic under either model.


: 
: Recovery:
: 
: An initiator must hold on to data it has sent via a WRITE operation
: until it has received the status for the corresponding command.
: Even if the initiator sends immediate data (in the Control Channel) or
: unsolicited data (in one of the Data Channels), the target may discard
: the data in case it didn't have the resources to handle the data at that
: instant. The target may then request that the data be resent with an
: RTT.
: A target need not keep a copy of the data buffers it has sent, if
: such data can be regenerated from the storage device.
: However, the target must keep around the status information until it has
: been acknowledged by the initiator. The initiator sends Status Ack info
: (a new iSCSI message type) over the Control Channel.
: If strict ordering between commands is needed (such as reading and
: writing of the same device) then the application must perform the
: proper synchronization by not issuing the second command until it has
: received the status of the first command (as in linked commands).

Changing heuristics for applications will cause compatibility issues.

: 
: If it seems that a connection has stopped functioning, then either
: the initiator or the target may issue an iSCSI Ping command to determine
: if the connection is still alive. (A bit in the Ping header determines
: which side initiated the ping operation.) If the Ping operation times
: out, then it may be assumed that the connection is not functioning
: properly. When a Ping operation fails, the connection should immediately
: be closed.
: 
: Note: It is not required to support iSCSI level recovery.
: It is sufficient for the initiator to report failure for the commands
: that did not complete and let the upper layer protocol handle the
: recovery.
: In this case, all channels of the session should be  closed,  all data
: structures should be cleaned up, and a  new  session  may be established
: between the initiator and target.
: 
: There is an advanced recovery mechanism that MAY be implemented by
: the initiator and target, as described below.

[elided]
References:
- single vs multiple channels for iSCSI commands
  - From: meth@il.ibm.com
Prev by Date: single vs multiple channels for iSCSI commands
Next by Date: Re: another issue with error recovery...
Prev by thread: single vs multiple channels for iSCSI commands
Next by thread: Re: single vs multiple channels for iSCSI commands
Index(es):
- Date
- Thread
Home
Last updated: Tue Sep 04 01:08:13 2001
6315 messages in chronological order