RE: iSCSI: Flow Control

To: Matt Wakeley <matt_wakeley@agilent.com>, IPS Reflector <ips@ece.cmu.edu>
Subject: RE: iSCSI: Flow Control
From: "GUPTA,SOMESH (HP-Cupertino,ex1)" <somesh_gupta@am.exch.hp.com>
Date: Mon, 16 Oct 2000 13:52:52 -0600
Content-Type: text/plain;charset="iso-8859-1"
Sender: owner-ips@ece.cmu.edu
Matt,

I will try to explain below.

Somesh

> -----Original Message-----
> From: Matt Wakeley [mailto:matt_wakeley@agilent.com]
> Sent: Saturday, October 14, 2000 7:18 PM
> To: IPS Reflector
> Subject: Re: iSCSI: Flow Control
> 
> 
> Somesh,
> 
> I still don't understand what you are trying to solve.
> 
> With the iSCSI session wide command credit method, there is a 
> portion of the
> iSCSI layer that sits right below the SCSI layer.  It 
> receives the commands
> from the SCSI layer and passes the results of each I/O from 
> each NIC back to
> the SCSI layer. The MaxCmdRn indicates how many commands the 
> target (as a
> whole) can "buffer". The iSCSI layer will "scatter" the 
> commands to the NICs
> until it has used up the MaxCmdRn buffers. Each NIC, once 
> iSCSI has posted a
> command to it, will attempt to send the command as long as 
> the TCP window is
> open. Practically every message sent from the target to the 
> initiator contains
> the new MaxCmdRn.  Each in initiator NIC that receives a 
> message passes this
> (new) value to the common iSCSI.  This value does NOT have to 
> be sent to every
> other NIC, since once a command is posted to a NIC, it is 
> committed to send
> it.

What you describe is a good model for the initiator side (even
though there could be some implementation optimizations). As the
iSCSI host driver (IHD) receives commands from SCSI layer, it has
to check the following before it can post the command

Check whether there is space in the host
queues for each NIC (i.e. the host memory which has been designated
to be used for posting commands to a NIC - may be limited by NIC
limitations or host memory limitation). There may be models where
there is no such limit. This is also the time when (Mike's comment)
the scatter will be done on some algorithm and is independent of
the flow control model.

In the session-wide flow control model: The IHD has to be perform the
additional check of whether the MaxCmdRn is being exceeded or not.

In a connection-wide model: No such check has to be performed as the
NIC should be able to handle that on its own.

NOTE: There is a cost to performing each of these checks in SMP
servers if multiple processors are involved - lock and variable
moving from cache to cache.

--
Now in cases where the command cannot be posted to the NIC queue,
it must be left in another queue in the host which is then processed
when the condition is removed. The condition will be removed
when a command status is received (also could be RTT but that will
be useless if the model assumes interrupting the host - you really
don't want to interrupt the host on RTT) - and the host is interrupted

In a connection-wide model: The interrupt processing routine checks
the NICs command posting queue (or equivalent status) and if it had
been full, knows to check the common queue for more commands. If not,
then it know there is nothing to do for command posting. 

In a session-wide model: Update the global location of MaxCmdRn
(take a lock and release lock and thrash cache if multiple CPUs
active). Then always have to check is there are commands
waiting to be posted (again by checking variable and locks etc).
If yes, then post those commands - repeating the algorithm that
was used when the upper layer posted a command.

NOTE: If we feel that the SCSI layer will generate commands faster
than the session-wide credit then the session-wide credit will
cause extra processing. It is much more straightforward to be
able to post from the top half, then to have to try to post from
top-half and then actually post from the bottom. If there is
significant credit issue, then the outbound command queues will
 be going through starvation at times.


> 
> Each Target NIC will have a poll of buffers to receive 
> asynchronous (non DATA)
> iSCSI messages.  As each (small) command message is received, 
> it is placed
> into one of these buffers, processed by common iSCSI and the 
> CDB is passed to
> the SCSI layer which stores it into its command buffer. The 
> message buffer is
> then given back to the NIC for further messages.

The question is how much credit are you going to hand out to the
remote side. If there are N buffers posted per card and M cards, will
you make a credit of N available (underutlization) or N * M (which
assumes that the send will send evenly and is risky if there
is sudden congestion on one or more connections). Also the
same discussion of the system cost of a calculating and using a
centralized value of MaxCmdRn applies if arrays have multiple
processors.

> 
> "GUPTA,SOMESH (HP-Cupertino,ex1)" wrote:
> 
> > Yes I am trying to describe the synchronization pts and software
> > intervention caused by a session wide flow control model
> 
> But I still don't understand the "problem" that the credit 
> per connection
> solves over the credit per session model.
> 
> In your description, the initiator still "scatters" the 
> commands to the NICs,
> then the NICs have the burden of trying to figure out if they 
> can send the
> command or not.  Furthermore, if some NICs have open TCP 
> windows, but don't
> have command credit, the command can't be sent.

Look at it as an opportunity to differentiate and streamline
performance than as a burden. It would definitely be a feature
for multi-port NICs where all the ports used for a session
are on the same NIC. Saves host CPU cycles thereby improving
the attractiveness of the solution :-)

> 
> In the iSCSI session wide credit model, the initiator will 
> not post commands
> to any NIC if it doesn't have credit.  Any commands posted to 
> a NIC will be
> sent as long as it's TCP window is open.
> 
> > 1. Post a large enough number at each NIC. OK. The window open up
> > (indicated through a new MaxCmdRn received on one connection). This
> > value now must be communicated to the other connections, so that
> > they can not be flow controlled also. Or the new value must be
> > received on each connection.
> 
> As I indicated above, the goal is to not overflow the SCSI 
> command buffer, so
> the command is not discarded causing a lot of error recovery. 
>  A command CDB
> is only 16 bytes.  It does not make sense to allocate 16 byte 
> buffers to NICs
> for command reception. As I indicated above, the NIC receives 
> the message, the
> iSCSI layer strips out the CDB and hands it to SCSI, then 
> reposts the message
> buffer to the NIC.
> 
> > Also since you have posted a large enough number at each NIC,
> > you are really not having any benefit at all from the session-wide
> > value - what is the advantage?
> 
> Having a session wide MaxCmdRn allows the initiator to stop 
> sending SCSI
> commands, while still enabling non command messages to be 
> sent.  They are
> received by each NIC and passed to iSCSI for processing, but 
> since they are
> not
> passed up to SCSI, nothing is overflowed.

Again, there is no benefit over what a connection-wide flow control
would provide. So that is a tie.

In terms of being flow controlled by TCP window, or ability to scatter
commands across the connections appropriately or not overflowing, or
letting data/status packets continue flowing, there is no difference.
> 
> 
> > 2. Have the NICs grab them from a pool through an atomic bus
> > transaction. That has got to be tougher to implement than it
> > looks, and the bus performance issues due to the need to maintain
> > ordering etc?
> 
> As indicated above, each NIC passes the iSCSI messages to a 
> central iSCSI
> message processor that sends the appropriate SCSI messages to SCSI.
> 
> -Matt
> 
>
Follow-Ups:
- RE: iSCSI: Flow Control
  - From: Michael Krause <krause@cup.hp.com>
Prev by Date: Re: iSCSI virtualization
Next by Date: (no subject)
Prev by thread: RE: iSCSI: Flow Control
Next by thread: RE: iSCSI: Flow Control
Index(es):
- Date
- Thread
Home
Last updated: Tue Sep 04 01:06:37 2001
6315 messages in chronological order