Proposed Connection Recovery Additions for Draft 03

To: ips@ece.cmu.edu
Subject: Proposed Connection Recovery Additions for Draft 03
From: Mark Bakke <mark.bakke@nuspeed.com>
Date: Thu, 29 Jun 2000 08:10:36 -0500
Content-Transfer-Encoding: 7bit
Content-Type: text/plain; charset=us-ascii
Sender: owner-ips@ece.cmu.edu

Draft 03 seems to be pretty close as far as dealing with
connection recovery.  Here are a few additions that we
(NuSpeed) think will help complete the picture, along with
what we believe are some of the requirements.

Other than hopefully clarifying initiator and target behavior,
this scheme adds one field to the Command Request, and one event
type to the Asynchronous Event message.

I've attempted to include some of our reasoning behind this.

Assumptions

- iSCSI is only a transport for SCSI.  Its recovery scheme
  does not attempt to retry failed commands.  However, it
  is a reliable transport for SCSI, and must deliver commands
  within a session in order to the target.
  
- TCP handles any losses from a given connection.  The target
  end of the byte stream is either valid, or the connection is
  lost.  Within a connection, there is no such thing as losing
  a packet.  We will, however, need to deal with stronger error
  checking over the SCSI data, but that's (mostly) orthogonal
  to connection recovery.

- This scheme will work with either single or multiple
  connections per session.  It is up to the implementation
  of the initiator and target whether either one supports
  multiple connections.  The initiator can simply not use
  more than one connection per session; the target can deny
  the login if the client requests more than one connection
  per session.

- It is also up to the initiator to determine whether to
  multiplex access to targets and luns over a single session,
  or to use a session for each, or some combination.  If the
  initiator chooses to use multiple sessions to the same
  device, it must be prepared to deal with multipath command
  ordering issues itself.

- We can be fairly optimistic about the longevity of TCP
  connections; if the network is so slow, overloaded, or
  poorly designed as to lose connections regularly, it is
  not likely a good candidate for storage access.  Connection
  recovery should still handle these cases, especially if
  the problems are transient, but need not be optimized for
  these cases.  If a connection fails, it should be acceptable
  to re-send write data with the re-sent command, and to re-send
  read data with the re-sent status.  There may, however, be
  simple optimizations to avoid this, too, especially when
  transporting larger blocks, such as tape reads and writes.

- Some commands, such as FORMAT UNIT and REWIND, may take several
  minutes or more to complete.  Thousands of operations may complete
  before status is returned for these commands.  (Are 16-bit
  reference numbers enough?)

- If multiple connections are used, they are symmetrical (no special
  control or data connections).  Command-Data-Status connection
  allegiance is also assumed, and CmdRN and StatRN are used to
  ensure in-order delivery.

- CmdRN and StatRN are implemented as in Draft 03, as 16-bit,
  per-session incrementing counters.


Requirements for Connection Recovery

- Protocol fields associated with the connection recovery
  scheme will work with either a single connection per
  session, or multiple connections per session.

- iSCSI must preserve ordered delivery within a session.

- The transport may re-send commands, data, and status at
  any time, but must not attempt to re-try the actual command
  at the target without involving the upper (SCSI) layer for
  recovery.  This means that, as in section 4.1 of draft 03,
  the client should keep sufficient information handy to re-send
  commands and data until status is received.

- We can generally make an exception to the above for commands
  issued to a block device (disk); reads and writes are idempotent,
  as long as the commands are re-issued in the original order.

- Commands must be issued at the target end of an iSCSI session
  in-order, but status may, of course, be returned from the iSCSI
  target to the initiator in any order.

- Either the initiator or target may decide to terminate a
  connection.  It is the responsibility of the initiator to
  reconnect if it so chooses.

- A connection must be recoverable quickly.  At most, a connection
  must fail, be detected as failed, be restarted, have commands
  reissued, and get status back (except on high-latency commands)
  within a portion of the normal SCSI timeout window (30 seconds).
  The actual time for this depends on the network, the commands
  issued, etc.  At any rate, connection recovery must be as 
  transparent as possible to the end user or application.

- Connection recovery should work for target reboot or failover.

- Basically, we have to handle the following steps for each
  connection:

  1. Detection - deciding when a connection is down, or should be.
  2. Disconnection - terminating a connection.
  3. Reconnection - re-connecting to the target.
  4. Resend - re-sending commands to the target.

  Besides these procedures, normal mechanisms such as reference
  numbers and response caching will be in place to support these
  procedures when they are needed.

Support Mechanisms

  The initiator and target must keep some state around in order to
  support connection recovery and resending of commands, data, and
  status that may have been lost.  Their responsibilities are
  outlined in section 4.1 of Draft 03.

  Basically, an Initiator must:
  
  - Increment CmdRN for each new command request sent.

  - Keep information required to rebuild and resend each command
    with its data until the matching command response is received
    from the target.

  - Acknowledge command responses soon after they are received
    from the target.

  A Target must:

  - Increment StatRN for each new status response sent.

  - Keep a cache of responses (status & sense data) until the
    StatRN is acknowledged by the initiator.

  - For non-disk devices, keep data response (read data) along
    with the cached command response (although this might be
    difficult with large-block devices).

  Reclaiming Cached Responses - section 4.1 already mentioned most
  of the above; however, there was no mechanism for notifying the
  target that its cached responses were no longer needed.  In this
  scheme, an AckStatRN is sent from the initiator to the target,
  as the highest (honoring wrap) consecutive value received for
  StatRN in a response on any connection in the session.  All
  cached responses up to and including this StatRN value may be
  safely de-allocated.

Detecting Connection Failure

  During an initiator, target, or intervening network outage, whether
  temporary or permanent, TCP connections will normally be retried
  for much longer than most SCSI drivers can handle.  In many cases,
  new connections can be made and started long before the old
  connection times out.  For this reason, we have to detect connections
  that have gone away.  Both the initiator and the target may detect
  these conditions, and should detect them in a timely manner (let's
  say 5 seconds for now, but we need to think about this).

  From the initiator's point of view, the connection can fail for
  several reasons (temporary or permanent):

  - Target powered down or removed from network
  - Target reboot or failover
  - Lost network route
  - Backed-off (slow) tcp connection
  - Unexpected message fields received (software error on target)?

  If no responses are being received from the target, and there are
  outstanding commands, the initiator will periodically send a ping
  request, and expect a ping response within a small amount of time.
  If no ping response is received, the connection is considered
  to have failed.  This is mentioned in section 4.1 as well.

  From the target's point of view, the connection can fail for
  several reasons (temporary or permanent):

  - Initiator powered down or removed from network
  - Initiator reboot or failover
  - Lost network route
  - Backed-off (slow) tcp connection
  - Unexpected message fields received (software error on initiator)?

  Since the target does not send requests, it could do one of two
  things:

  1. During the login phase, negotiate a maximum inactivity time 
     for the incoming target connection.  If this time will be
     exceeded, the client promises to send an iSCSI ping request
     on the connection to keep it alive.  If the inactivity timer
     expires on the target, the connection is assumed to have failed.

  2. Add an asychronous event requesting that the initiator ping
     the target.  Send this when approaching the target's
     maximum inactivity time; if the timer expires anyway, the
     connection is assumed to have failed.

  In any case, the target must detect connection failure to avoid
  having connections from powered-down clients hang around for
  long periods of time.


Disconnecting

  When a connection fails, the initiator, target, or both will
  close it.  The initiator can generally not wait around for the
  close to complete before starting a new connection; the target
  will need to accept a new (recovered) connection from an
  initiator, even if the target has not realized the original
  connection's failure.  These are implementation issues.

  The initiator may disconnect for reasons other than failure:

  - Normal host shutdown (reboot or power off)
  - Application (and disk) failover to another host (e.g. using
    HP, Veritas, or other application failover software).

  The target may also disconnect for reasons other than failure:
  
  - If the target is to be rebooted or failed over to another
    physical unit, it may wish to gracefully shut down the connection
    before restarting another.

  To make target reboot or failover more graceful, a target should
  attempt to send an asynchronous event "connection shutdown", to
  the initiator on each connection.  This new event contains two
  values:

  - MaxUpTime - the number of seconds (can be zero) before this
    connection is expected to cease functioning.  The initiator
    should not attempt to issue more commands than can be expected
    to complete and receive status within this amount of time.
    The target will wait this amount of time before it shuts
    its connections down.

  - MinHoldTime - the number of seconds (can also be zero) after
    MaxUpTime before this entity will be available for re-connection.
    After this, the initiator has a good chance of reconnecting
    to the target.  This should be set to the amount of time the
    server is expected to take to fail over, reboot, etc.  We should
    probably define a value (-1?) for "never".  Note that an
    initiator could just reconnect right away, however, it could
    either connect to the running server just before it reboots, 
    or it could lose several SYN segments while waiting for the
    server, causing exponential backoff to make the ultimate
    connection take longer.

Reconnection

  The initiator always handles reconnection.  During the new
  connection's login phase, the initiator specifies that it is
  replacing a failed connection by including the non-zero CID
  of the old connection in the RecoverCID field.

  If a target supports stateful recovery (meaning it still has
  the cached responses for the session), it accepts the login.

  If the target does not support stateful recovery, or the
  target has rebooted and lost its state, or the target has
  dropped the cached responses due to an excessive amount of
  time passing (perhaps 60 seconds), it rejects the login with
  a "reject recovery" status.  The initiator then performs
  a new login, and does stateless recovery.

1. Stateful Recovery

  In a stateful recovery, the initiator resends all commands for
  which it has not received status.  If a command has already
  completed, the cached response is returned.  If a command has
  already been issued and is in progress, it is not re-issued;
  and will just be queued somewhere to wait for status.  If
  a command had not been received by the target (or incompletely
  received and thrown away), it will be issued as normal.

2. Stateless Recovery

  By default, stateless recovery means that all outstanding
  commands are terminated (to the SCSI layer) (check condition?);
  higher layers must perform recovery.


Non-Recovery

  Let's face it; there are times when things just can't be recovered
  at this level.  However, there are many higher-level entities that
  may recover for us:

  - Tape backup software (reload into a different tape drive)
  - Volume managers (break mirrors)
  - Multipath SCSI drivers (find alternate path or controller)
  - Host application clusters (move app to host with connectivity)

  This should be handled as specified in section 4.3.


Optimizations:


1. If RTT is in use, and a write request is re-sent to a target, and
   the target has already written the data, the target could send the
   Command Response back instead of the RTT.  The initiator would just
   accept this as the final status, and would not have to send the
   write data again.


iSCSI Draft 03 Message Modifications:

1. Remove StatRN from the Data Response, or make it equal the StatRN
   for the matching Command Response.  There should be no need for it
   to increment separatly from the Command Response, since this scheme
   assumes that if the response was not received, the data will be
   re-sent anyway.  The current draft does not specify how StatRN
   is used in a Data Response.

2. Add an AckStatRN field to Command Request, to acknowledge the
   highest (honoring wrap) consecutive StatRN received for the
   session.

3. Add a new event (Event Indicator 5) specifying that the connection
   will be closed by the target.  This event sends two parameters
   (using some of the reserved fields):

   - MaxUpTime - the number of seconds the target intends to keep
     the connection alive.
   - MinHoldTime - the number of seconds the initiator should wait
     before establishing a new connection.



Alternative Implementations

1. We considered a separate message to send the AckStatRN, but
   since this is generally done for every command, it seemed simpler
   to just piggyback it on the next command request.

2. CmdRN and StatRN are assumed to be per-session.  If they were
   made per-LUN for any reason, the initiator and target would
   simply have to demux requests and responses based on LUN + RN.


A Few Alternatives that Didn't Quite Work

   Here are some alternatives we went through, and why we did not 
   choose them:

1. We considered just tossing out StatRN, and acking the CmdRN
   (or the ITT) matching the status last received instead.  However,
   the target needed to keep track of status in the order in which
   it was sent (and NOT in the order in which the original command
   was received) to avoid trouble with commands which incur a long
   response delay (REWIND et al).  Keeping StatRN and acking it
   is much simpler, and makes it easier to preserve response
   ordering if multiple connections are used (and if response
   ordering is required).

2. Target Retries - if one just assumes disk, the target could avoid
   caching responses, and the initiator could avoid acking them; the
   target could just retry any requests, in order, sent over the
   new connection.  However, this excludes most of the SCSI Peripheral
   Device Types, and will likely not work in every case for this
   either.  In these cases, ALL connection recovery would be pushed
   up to SCSI to handle.  By caching status, we remain a truer
   transport, and will work better for these devices (especially tape).

3. Selective StatRN acks - one could individually acknowledge each
   response, to free its resources on the target.  However, if a
   selective ack is lost during a connection recovery, its resources
   would then hang around forever on the target (unless, of course,
   we wanted to ack the ack).  Cumulative acks may be lost at the
   end of a connection; the next command sent will just re-ack
   everything anyway.


-- 
Mark A. Bakke
NuSpeed, Inc.
mark.bakke@nuspeed.com
763.398.1054
Prev by Date: Re: Proposed Connection Recovery Additions for Draft 03
Next by Date: Re: Digest Login, iSCSI Discovery, and Other Draft 03 Comments
Prev by thread: Re: Proposed Connection Recovery Additions for Draft 03
Next by thread: 06/28 Draft with change bars
Index(es):
- Date
- Thread
Home
Last updated: Tue Sep 04 01:08:12 2001
6315 messages in chronological order