|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Proposed Connection Recovery Additions for Draft 03
Draft 03 seems to be pretty close as far as dealing with
connection recovery. Here are a few additions that we
(NuSpeed) think will help complete the picture, along with
what we believe are some of the requirements.
Other than hopefully clarifying initiator and target behavior,
this scheme adds one field to the Command Request, and one event
type to the Asynchronous Event message.
I've attempted to include some of our reasoning behind this.
Assumptions
- iSCSI is only a transport for SCSI. Its recovery scheme
does not attempt to retry failed commands. However, it
is a reliable transport for SCSI, and must deliver commands
within a session in order to the target.
- TCP handles any losses from a given connection. The target
end of the byte stream is either valid, or the connection is
lost. Within a connection, there is no such thing as losing
a packet. We will, however, need to deal with stronger error
checking over the SCSI data, but that's (mostly) orthogonal
to connection recovery.
- This scheme will work with either single or multiple
connections per session. It is up to the implementation
of the initiator and target whether either one supports
multiple connections. The initiator can simply not use
more than one connection per session; the target can deny
the login if the client requests more than one connection
per session.
- It is also up to the initiator to determine whether to
multiplex access to targets and luns over a single session,
or to use a session for each, or some combination. If the
initiator chooses to use multiple sessions to the same
device, it must be prepared to deal with multipath command
ordering issues itself.
- We can be fairly optimistic about the longevity of TCP
connections; if the network is so slow, overloaded, or
poorly designed as to lose connections regularly, it is
not likely a good candidate for storage access. Connection
recovery should still handle these cases, especially if
the problems are transient, but need not be optimized for
these cases. If a connection fails, it should be acceptable
to re-send write data with the re-sent command, and to re-send
read data with the re-sent status. There may, however, be
simple optimizations to avoid this, too, especially when
transporting larger blocks, such as tape reads and writes.
- Some commands, such as FORMAT UNIT and REWIND, may take several
minutes or more to complete. Thousands of operations may complete
before status is returned for these commands. (Are 16-bit
reference numbers enough?)
- If multiple connections are used, they are symmetrical (no special
control or data connections). Command-Data-Status connection
allegiance is also assumed, and CmdRN and StatRN are used to
ensure in-order delivery.
- CmdRN and StatRN are implemented as in Draft 03, as 16-bit,
per-session incrementing counters.
Requirements for Connection Recovery
- Protocol fields associated with the connection recovery
scheme will work with either a single connection per
session, or multiple connections per session.
- iSCSI must preserve ordered delivery within a session.
- The transport may re-send commands, data, and status at
any time, but must not attempt to re-try the actual command
at the target without involving the upper (SCSI) layer for
recovery. This means that, as in section 4.1 of draft 03,
the client should keep sufficient information handy to re-send
commands and data until status is received.
- We can generally make an exception to the above for commands
issued to a block device (disk); reads and writes are idempotent,
as long as the commands are re-issued in the original order.
- Commands must be issued at the target end of an iSCSI session
in-order, but status may, of course, be returned from the iSCSI
target to the initiator in any order.
- Either the initiator or target may decide to terminate a
connection. It is the responsibility of the initiator to
reconnect if it so chooses.
- A connection must be recoverable quickly. At most, a connection
must fail, be detected as failed, be restarted, have commands
reissued, and get status back (except on high-latency commands)
within a portion of the normal SCSI timeout window (30 seconds).
The actual time for this depends on the network, the commands
issued, etc. At any rate, connection recovery must be as
transparent as possible to the end user or application.
- Connection recovery should work for target reboot or failover.
- Basically, we have to handle the following steps for each
connection:
1. Detection - deciding when a connection is down, or should be.
2. Disconnection - terminating a connection.
3. Reconnection - re-connecting to the target.
4. Resend - re-sending commands to the target.
Besides these procedures, normal mechanisms such as reference
numbers and response caching will be in place to support these
procedures when they are needed.
Support Mechanisms
The initiator and target must keep some state around in order to
support connection recovery and resending of commands, data, and
status that may have been lost. Their responsibilities are
outlined in section 4.1 of Draft 03.
Basically, an Initiator must:
- Increment CmdRN for each new command request sent.
- Keep information required to rebuild and resend each command
with its data until the matching command response is received
from the target.
- Acknowledge command responses soon after they are received
from the target.
A Target must:
- Increment StatRN for each new status response sent.
- Keep a cache of responses (status & sense data) until the
StatRN is acknowledged by the initiator.
- For non-disk devices, keep data response (read data) along
with the cached command response (although this might be
difficult with large-block devices).
Reclaiming Cached Responses - section 4.1 already mentioned most
of the above; however, there was no mechanism for notifying the
target that its cached responses were no longer needed. In this
scheme, an AckStatRN is sent from the initiator to the target,
as the highest (honoring wrap) consecutive value received for
StatRN in a response on any connection in the session. All
cached responses up to and including this StatRN value may be
safely de-allocated.
Detecting Connection Failure
During an initiator, target, or intervening network outage, whether
temporary or permanent, TCP connections will normally be retried
for much longer than most SCSI drivers can handle. In many cases,
new connections can be made and started long before the old
connection times out. For this reason, we have to detect connections
that have gone away. Both the initiator and the target may detect
these conditions, and should detect them in a timely manner (let's
say 5 seconds for now, but we need to think about this).
From the initiator's point of view, the connection can fail for
several reasons (temporary or permanent):
- Target powered down or removed from network
- Target reboot or failover
- Lost network route
- Backed-off (slow) tcp connection
- Unexpected message fields received (software error on target)?
If no responses are being received from the target, and there are
outstanding commands, the initiator will periodically send a ping
request, and expect a ping response within a small amount of time.
If no ping response is received, the connection is considered
to have failed. This is mentioned in section 4.1 as well.
From the target's point of view, the connection can fail for
several reasons (temporary or permanent):
- Initiator powered down or removed from network
- Initiator reboot or failover
- Lost network route
- Backed-off (slow) tcp connection
- Unexpected message fields received (software error on initiator)?
Since the target does not send requests, it could do one of two
things:
1. During the login phase, negotiate a maximum inactivity time
for the incoming target connection. If this time will be
exceeded, the client promises to send an iSCSI ping request
on the connection to keep it alive. If the inactivity timer
expires on the target, the connection is assumed to have failed.
2. Add an asychronous event requesting that the initiator ping
the target. Send this when approaching the target's
maximum inactivity time; if the timer expires anyway, the
connection is assumed to have failed.
In any case, the target must detect connection failure to avoid
having connections from powered-down clients hang around for
long periods of time.
Disconnecting
When a connection fails, the initiator, target, or both will
close it. The initiator can generally not wait around for the
close to complete before starting a new connection; the target
will need to accept a new (recovered) connection from an
initiator, even if the target has not realized the original
connection's failure. These are implementation issues.
The initiator may disconnect for reasons other than failure:
- Normal host shutdown (reboot or power off)
- Application (and disk) failover to another host (e.g. using
HP, Veritas, or other application failover software).
The target may also disconnect for reasons other than failure:
- If the target is to be rebooted or failed over to another
physical unit, it may wish to gracefully shut down the connection
before restarting another.
To make target reboot or failover more graceful, a target should
attempt to send an asynchronous event "connection shutdown", to
the initiator on each connection. This new event contains two
values:
- MaxUpTime - the number of seconds (can be zero) before this
connection is expected to cease functioning. The initiator
should not attempt to issue more commands than can be expected
to complete and receive status within this amount of time.
The target will wait this amount of time before it shuts
its connections down.
- MinHoldTime - the number of seconds (can also be zero) after
MaxUpTime before this entity will be available for re-connection.
After this, the initiator has a good chance of reconnecting
to the target. This should be set to the amount of time the
server is expected to take to fail over, reboot, etc. We should
probably define a value (-1?) for "never". Note that an
initiator could just reconnect right away, however, it could
either connect to the running server just before it reboots,
or it could lose several SYN segments while waiting for the
server, causing exponential backoff to make the ultimate
connection take longer.
Reconnection
The initiator always handles reconnection. During the new
connection's login phase, the initiator specifies that it is
replacing a failed connection by including the non-zero CID
of the old connection in the RecoverCID field.
If a target supports stateful recovery (meaning it still has
the cached responses for the session), it accepts the login.
If the target does not support stateful recovery, or the
target has rebooted and lost its state, or the target has
dropped the cached responses due to an excessive amount of
time passing (perhaps 60 seconds), it rejects the login with
a "reject recovery" status. The initiator then performs
a new login, and does stateless recovery.
1. Stateful Recovery
In a stateful recovery, the initiator resends all commands for
which it has not received status. If a command has already
completed, the cached response is returned. If a command has
already been issued and is in progress, it is not re-issued;
and will just be queued somewhere to wait for status. If
a command had not been received by the target (or incompletely
received and thrown away), it will be issued as normal.
2. Stateless Recovery
By default, stateless recovery means that all outstanding
commands are terminated (to the SCSI layer) (check condition?);
higher layers must perform recovery.
Non-Recovery
Let's face it; there are times when things just can't be recovered
at this level. However, there are many higher-level entities that
may recover for us:
- Tape backup software (reload into a different tape drive)
- Volume managers (break mirrors)
- Multipath SCSI drivers (find alternate path or controller)
- Host application clusters (move app to host with connectivity)
This should be handled as specified in section 4.3.
Optimizations:
1. If RTT is in use, and a write request is re-sent to a target, and
the target has already written the data, the target could send the
Command Response back instead of the RTT. The initiator would just
accept this as the final status, and would not have to send the
write data again.
iSCSI Draft 03 Message Modifications:
1. Remove StatRN from the Data Response, or make it equal the StatRN
for the matching Command Response. There should be no need for it
to increment separatly from the Command Response, since this scheme
assumes that if the response was not received, the data will be
re-sent anyway. The current draft does not specify how StatRN
is used in a Data Response.
2. Add an AckStatRN field to Command Request, to acknowledge the
highest (honoring wrap) consecutive StatRN received for the
session.
3. Add a new event (Event Indicator 5) specifying that the connection
will be closed by the target. This event sends two parameters
(using some of the reserved fields):
- MaxUpTime - the number of seconds the target intends to keep
the connection alive.
- MinHoldTime - the number of seconds the initiator should wait
before establishing a new connection.
Alternative Implementations
1. We considered a separate message to send the AckStatRN, but
since this is generally done for every command, it seemed simpler
to just piggyback it on the next command request.
2. CmdRN and StatRN are assumed to be per-session. If they were
made per-LUN for any reason, the initiator and target would
simply have to demux requests and responses based on LUN + RN.
A Few Alternatives that Didn't Quite Work
Here are some alternatives we went through, and why we did not
choose them:
1. We considered just tossing out StatRN, and acking the CmdRN
(or the ITT) matching the status last received instead. However,
the target needed to keep track of status in the order in which
it was sent (and NOT in the order in which the original command
was received) to avoid trouble with commands which incur a long
response delay (REWIND et al). Keeping StatRN and acking it
is much simpler, and makes it easier to preserve response
ordering if multiple connections are used (and if response
ordering is required).
2. Target Retries - if one just assumes disk, the target could avoid
caching responses, and the initiator could avoid acking them; the
target could just retry any requests, in order, sent over the
new connection. However, this excludes most of the SCSI Peripheral
Device Types, and will likely not work in every case for this
either. In these cases, ALL connection recovery would be pushed
up to SCSI to handle. By caching status, we remain a truer
transport, and will work better for these devices (especially tape).
3. Selective StatRN acks - one could individually acknowledge each
response, to free its resources on the target. However, if a
selective ack is lost during a connection recovery, its resources
would then hang around forever on the target (unless, of course,
we wanted to ack the ack). Cumulative acks may be lost at the
end of a connection; the next command sent will just re-ack
everything anyway.
--
Mark A. Bakke
NuSpeed, Inc.
mark.bakke@nuspeed.com
763.398.1054
Home Last updated: Tue Sep 04 01:08:12 2001 6315 messages in chronological order |