RE: a vote for asymmetric connections in a session

To: "John Hufferd/San Jose/IBM" <hufferd@us.ibm.com>, <julian_satran@il.ibm.com>, <black_david@emc.com>
Subject: RE: a vote for asymmetric connections in a session
From: "Y P Cheng" <ycheng@advansys.com>
Date: Sun, 10 Sep 2000 16:13:02 -0700
Cc: <ips@ece.cmu.edu>
Content-Transfer-Encoding: 7bit
Content-Type: text/plain;charset="iso-8859-1"
Importance: Normal
In-Reply-To: <OF5E486E89.B40B8302-ON88256955.005EA3DF@LocalDomain>
Sender: owner-ips@ece.cmu.edu
(I apologize for this long response.  However, I hope it worthies your
reading.)
John Wrote:
>I think I understood what you said in the context
>of the Symmetric model, but could you please take
>me through how this would occur in the Asymmetric
>when you have at least two connections?
Juliano Wrote:
>A note of caution. The most serious dead-lock sitaution
>we are aware of steams from a mix of RTT (or should we
>call it R2T to accommodate Doug Ottis?) and unsolicited
>immediate data. If channels are full with unsolicited
>data and the target requests something else - that something
>else will not get through. This dead-lock, as far as I can
>tell exists in all transports. A target should be able to
>detect it and iSCSI has provided for the target
>to be able to drop data and reclaim them later with R2T.

An asymmetric connection, while doable, makes NO sense in today's
technologies used in NIC adapters.  Before people flaming me on such a
statement, I will support my position with my understanding of today's NIC
and its driver and iSCSI protocol. I don't know enough about mainframe NIC
design.  Therefore, in a different context, my position could be wrong.
However, I welcome the teaching of mainframe NIC designs from someone.
Beyond the position of asymmetric connection, I also take the position that
failover is a function of a protocol and could be incorporated in iSCSI.
Load balance is a function a NIC adapter and its driver. Detection of
incomplete SCSI session due to traffic congestion or lack of resource is the
responsibility of a SCSI initiator, not a SCSI target.  Here are reasons for
my positions:

To understand my arguments, it requires one to understand the context of my
analysis.  Let's start with understanding of terms used herein.
"Transport Connection" -- It is a unique pair of endpoints (IP address, port
number), one sending and one receiving.  A SCSI initiator may have many
connections to send commands to different SCSI targets who in turn have many
connections to different initiators to receive commands.  The SOCKET system
call returns a handle and data structure which stores an IP address and port
number.  The BIND and CONNECT system calls duplicates the socket structure
and returns a unique port number. (Note, this is a software port number.
Later, we will mention the hardware ports on a NIC adapter.)  By duplicating
the sockets and their handles, an initiator or target supports multiple
connections.   Often, people mistake the SCSI transport connection -- a
socket -- to TCP connection.  In fact, UDP is a better connection protocol
for iSCSI as I should argue for later.

"Server Client Protocol" -- A SCSI target is a server which enters a passive
state listening to incoming SCSI commands.  It does so by the SOCKET and
BIND system calls which establish a receiving endpoint.  A SCSI initiator is
a client which enters an active state to send SCSI commands.  It does so by
the SOCKET and CONNECT system calls which establish a sending endpoint.  The
domain name to IP address conversion in the BIND and CONNECT are provided by
an IP name server or the Address Resolution Protocol (ARP).

"Peer to Peer" -- A SCSI target may become an initiator to start third party
SCSI commands.  Acting as an initiator, it does so by starting a new
transport connection to another SCSI target.  An iSCSI endpoint is either
sending or receiving, but never both.  Therefore, a SCSI storage device uses
one connection to receive commands and another connection for send a third
party copy commands.  If the sending initiator can also act as a target,
then, we have the appearance of peer-to-peer with two transport connections.
Note, SCSI is never a full-duplex protocol.

"Multi-path Connection" -- If a SCSI target can be reached by more than one
IP addresses, the CONNECT system call on a SCSI initiator returns a list of
addresses in the socket data structure.  This list will be used for load
balance and failover recovery.

"SCSI Session" -- It is a stateless transaction between an initiator and a
target.  The session has a request and response relationship.  The request
is a SCSI command and the response is data-transfer-and-status.  SCSI
commands like mode select and sense and iSCSI messages like security and
authentication create state information.  But, they are uninteresting in
this scope of this discussion.

"iSCSI driver" -- A NIC adapter driver supports one or more NIC adapters who
in turn support iSCSI protocols.  For a legacy NIC adapter like old Ethernet
cards, the driver much build the iSCSI messages.  For a new NIC adapter like
the fibre channel adapters or even gigabit Ethernet, the driver simply sends
a SCSI request to the adapter which in turn builds the iSCSI messages.  The
new NIC adapter can accept a few hundreds or even thousands of SCSI
requests.  The iSCSI driver is a miniport driver -- in Windows/NT
terminology -- running under the SCSI class driver which sends SCSI requests
to SCSI devices.  Needless to say, application software or file system code
send requests to SCSI devices.

"iSCSI NIC adapter" -- A NIC adapter with one or more functional interfaces
and one or more ports connecting to the IP gateways executes the iSCSI
requests and responses.  It sends requests for a SCSI initiator and receives
them for a SCSI target.  Therefore, a NIC adapter is running in either
initiator mode or target mode or in both.  A multiple functional NIC adapter
can accept FCP requests from one functional interface and iSCSI from another
and even VI from a third interface.  A dual-channel adapter will have two
ports connecting to two different physical paths, say, one to intranet and
another to internet.  A NIC adapter has transmit- and receive-buffers for
incoming and outgoing SCSI messages and data. When the receive buffer is
full, incoming messages will be dropped and lost.

Now, here is my argument for why asymmetric connections makes NO sense in
the context of the NIC adapter technologies that I understand.  For
asymmetric connection, if the iSCSI driver is running on old legacy NIC
adapters, it must send the SCSI command to one adapter and set up data
transfer on another.  While with great difficulty one may make these two
adapters talking to each other to coordinate the command and data sequences,
the newer NIC adapters execute hundreds or even thousands SCSI requests and
responses "atomically."  Therefore, there is no deadlock problem between
processing commands and data in the context of either a SCSI initiator or a
target.  Furthermore, even with the NIC adapters built with the latest
technology having two functional interfaces accepting command and data
requests separately, there is nothing gained because the SCSI requests are
executed atomically by the adapters.  In the era of a NIC adapter that
execute a whole iSCSI request in 25 microseconds, it does not make sense to
have two NIC adapters to split the command and data processing with the
coordination itself taking more much time.

For the problem of lost SCSI messages due to traffic congestion, it must be
detected by the sender who times out the responses, in this case, a SCSI
initiator.   The congestion problem can not be managed by BB credit used in
FCP.  For end-to-end connection to a switch or that of an arbitrated loop,
one can use BB credit to manage the traffic. But, there is no way to manage
that in an Ethernet connection because the collision avoidance protocol.  In
addition, the gateway can loss packets too.  When an initiator is in New
York and a target in Los Angeles, one can't afford a zero initial BB credit
due to the long latency time.  With a non-zero initial BB credit, hundreds
of initiators around the world may send requests at the same time.
Therefore, traffic jam and lost of packets must use smooth recovery in
iSCSI.  Only the initiator sends requests and target only returns responses.
Therefore, it is very easy for initiator to detect the lost of messages by
setting proper time out values.  A target must accept at least one request
from an initiator; it must manage its resource allocation with RTT.

Once a time out on a SCSI request is detected by an initiator, the microcode
on the NIC adapter is quite capable of sending the request again, even on
another path for failover recovery -- if the adapter has a second port to
reach the same target.  If not, the NIC iSCSI driver can try another path.
In resending the request, yes, the issue of duplicated requests must be
managed.  However, this is a well-understood problem when retry is allowed
in a protocol.  Notice, I never say the SCSI target will initiate a retry.
If necessary, the target always sends a status message requesting the
initiator to retry.

Similar to retry, for load balancing the NIC adapter microcode and the iSCSI
driver of an initiator is quite capable of selecting a different port or a
NIC adapter to send a SCSI request as long as the adapter or the driver are
made aware to the multiple paths in the socket data structure which was
filled at time of making the connection.  To keep the design simple, the
target does not, should not, or must not take on the responsibility of load
balance.

On stripping the data transfers on multiple connections, I do believe we are
using four 2.5 gigabit MAC chips to get the 10 gigabit fibre channel,
Ethernet, and InfibiBand connections.  In fact, the 12x option of InfiniBand
stripes the data on 12 MAC chips to get three gigabyte per second data rate.
Stripping data across multiple NIC adapters would be too difficult for the
poor adapter designers to do.

On using UDP instead of TCP for iSCSI, I am having trouble with the TCP
because it is stream based.  The READ and WRITE system calls are extremely
inefficient for  block oriented SCSI data transfer.  On the other hand, the
UDP datagram using and SEND and RECEIVE system calls is better suite for
iSCSI.  In fact, I believe NFS is built on UDP.  The request and response
relationship between SCSI initiator and target makes the connectionless UDP
protocol possible.  FCP is implemented using the class 3 fibre channel
protocol which is designed for datagrams.

Finally, comments on the resources used by the initiator and target.  A SCSI
initiator has  self-regulating resource allocation.  Where there is no
resources to start new processes to initiate new SCSI requests, the SCSI
requests cease.  For each SCSI request, the required resources are
pre-allocated waiting for responses from a target.  A SCSI target receives
requests from everyone on the net.  While it must have room to accept new
SCSI requests -- which can be done at login by specifying the queue depth --
it needs RTT (R2T) to control the buffer space for data transfers.  This
position has already been expressed by many storage controller people.  I
need not repeat the position here.

Y.P. Cheng, CTO, ConnectCom Solutions Corp.
References:
- Re: a vote for asymmetric connections in a session
  - From: "John Hufferd/San Jose/IBM" <hufferd@us.ibm.com>
Prev by Date: Re: TCP speed
Next by Date: RE: Enhancements for the iSCSI
Prev by thread: Re: a vote for asymmetric connections in a session
Next by thread: RE: a vote for asymmetric connections in a session
Index(es):
- Date
- Thread
Home
Last updated: Tue Sep 04 01:07:25 2001
6315 messages in chronological order