|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Proposed Connection Recovery Additions for Draft 03Draft 03 seems to be pretty close as far as dealing with connection recovery. Here are a few additions that we (NuSpeed) think will help complete the picture, along with what we believe are some of the requirements. Other than hopefully clarifying initiator and target behavior, this scheme adds one field to the Command Request, and one event type to the Asynchronous Event message. I've attempted to include some of our reasoning behind this. Assumptions - iSCSI is only a transport for SCSI. Its recovery scheme does not attempt to retry failed commands. However, it is a reliable transport for SCSI, and must deliver commands within a session in order to the target. - TCP handles any losses from a given connection. The target end of the byte stream is either valid, or the connection is lost. Within a connection, there is no such thing as losing a packet. We will, however, need to deal with stronger error checking over the SCSI data, but that's (mostly) orthogonal to connection recovery. - This scheme will work with either single or multiple connections per session. It is up to the implementation of the initiator and target whether either one supports multiple connections. The initiator can simply not use more than one connection per session; the target can deny the login if the client requests more than one connection per session. - It is also up to the initiator to determine whether to multiplex access to targets and luns over a single session, or to use a session for each, or some combination. If the initiator chooses to use multiple sessions to the same device, it must be prepared to deal with multipath command ordering issues itself. - We can be fairly optimistic about the longevity of TCP connections; if the network is so slow, overloaded, or poorly designed as to lose connections regularly, it is not likely a good candidate for storage access. Connection recovery should still handle these cases, especially if the problems are transient, but need not be optimized for these cases. If a connection fails, it should be acceptable to re-send write data with the re-sent command, and to re-send read data with the re-sent status. There may, however, be simple optimizations to avoid this, too, especially when transporting larger blocks, such as tape reads and writes. - Some commands, such as FORMAT UNIT and REWIND, may take several minutes or more to complete. Thousands of operations may complete before status is returned for these commands. (Are 16-bit reference numbers enough?) - If multiple connections are used, they are symmetrical (no special control or data connections). Command-Data-Status connection allegiance is also assumed, and CmdRN and StatRN are used to ensure in-order delivery. - CmdRN and StatRN are implemented as in Draft 03, as 16-bit, per-session incrementing counters. Requirements for Connection Recovery - Protocol fields associated with the connection recovery scheme will work with either a single connection per session, or multiple connections per session. - iSCSI must preserve ordered delivery within a session. - The transport may re-send commands, data, and status at any time, but must not attempt to re-try the actual command at the target without involving the upper (SCSI) layer for recovery. This means that, as in section 4.1 of draft 03, the client should keep sufficient information handy to re-send commands and data until status is received. - We can generally make an exception to the above for commands issued to a block device (disk); reads and writes are idempotent, as long as the commands are re-issued in the original order. - Commands must be issued at the target end of an iSCSI session in-order, but status may, of course, be returned from the iSCSI target to the initiator in any order. - Either the initiator or target may decide to terminate a connection. It is the responsibility of the initiator to reconnect if it so chooses. - A connection must be recoverable quickly. At most, a connection must fail, be detected as failed, be restarted, have commands reissued, and get status back (except on high-latency commands) within a portion of the normal SCSI timeout window (30 seconds). The actual time for this depends on the network, the commands issued, etc. At any rate, connection recovery must be as transparent as possible to the end user or application. - Connection recovery should work for target reboot or failover. - Basically, we have to handle the following steps for each connection: 1. Detection - deciding when a connection is down, or should be. 2. Disconnection - terminating a connection. 3. Reconnection - re-connecting to the target. 4. Resend - re-sending commands to the target. Besides these procedures, normal mechanisms such as reference numbers and response caching will be in place to support these procedures when they are needed. Support Mechanisms The initiator and target must keep some state around in order to support connection recovery and resending of commands, data, and status that may have been lost. Their responsibilities are outlined in section 4.1 of Draft 03. Basically, an Initiator must: - Increment CmdRN for each new command request sent. - Keep information required to rebuild and resend each command with its data until the matching command response is received from the target. - Acknowledge command responses soon after they are received from the target. A Target must: - Increment StatRN for each new status response sent. - Keep a cache of responses (status & sense data) until the StatRN is acknowledged by the initiator. - For non-disk devices, keep data response (read data) along with the cached command response (although this might be difficult with large-block devices). Reclaiming Cached Responses - section 4.1 already mentioned most of the above; however, there was no mechanism for notifying the target that its cached responses were no longer needed. In this scheme, an AckStatRN is sent from the initiator to the target, as the highest (honoring wrap) consecutive value received for StatRN in a response on any connection in the session. All cached responses up to and including this StatRN value may be safely de-allocated. Detecting Connection Failure During an initiator, target, or intervening network outage, whether temporary or permanent, TCP connections will normally be retried for much longer than most SCSI drivers can handle. In many cases, new connections can be made and started long before the old connection times out. For this reason, we have to detect connections that have gone away. Both the initiator and the target may detect these conditions, and should detect them in a timely manner (let's say 5 seconds for now, but we need to think about this). From the initiator's point of view, the connection can fail for several reasons (temporary or permanent): - Target powered down or removed from network - Target reboot or failover - Lost network route - Backed-off (slow) tcp connection - Unexpected message fields received (software error on target)? If no responses are being received from the target, and there are outstanding commands, the initiator will periodically send a ping request, and expect a ping response within a small amount of time. If no ping response is received, the connection is considered to have failed. This is mentioned in section 4.1 as well. From the target's point of view, the connection can fail for several reasons (temporary or permanent): - Initiator powered down or removed from network - Initiator reboot or failover - Lost network route - Backed-off (slow) tcp connection - Unexpected message fields received (software error on initiator)? Since the target does not send requests, it could do one of two things: 1. During the login phase, negotiate a maximum inactivity time for the incoming target connection. If this time will be exceeded, the client promises to send an iSCSI ping request on the connection to keep it alive. If the inactivity timer expires on the target, the connection is assumed to have failed. 2. Add an asychronous event requesting that the initiator ping the target. Send this when approaching the target's maximum inactivity time; if the timer expires anyway, the connection is assumed to have failed. In any case, the target must detect connection failure to avoid having connections from powered-down clients hang around for long periods of time. Disconnecting When a connection fails, the initiator, target, or both will close it. The initiator can generally not wait around for the close to complete before starting a new connection; the target will need to accept a new (recovered) connection from an initiator, even if the target has not realized the original connection's failure. These are implementation issues. The initiator may disconnect for reasons other than failure: - Normal host shutdown (reboot or power off) - Application (and disk) failover to another host (e.g. using HP, Veritas, or other application failover software). The target may also disconnect for reasons other than failure: - If the target is to be rebooted or failed over to another physical unit, it may wish to gracefully shut down the connection before restarting another. To make target reboot or failover more graceful, a target should attempt to send an asynchronous event "connection shutdown", to the initiator on each connection. This new event contains two values: - MaxUpTime - the number of seconds (can be zero) before this connection is expected to cease functioning. The initiator should not attempt to issue more commands than can be expected to complete and receive status within this amount of time. The target will wait this amount of time before it shuts its connections down. - MinHoldTime - the number of seconds (can also be zero) after MaxUpTime before this entity will be available for re-connection. After this, the initiator has a good chance of reconnecting to the target. This should be set to the amount of time the server is expected to take to fail over, reboot, etc. We should probably define a value (-1?) for "never". Note that an initiator could just reconnect right away, however, it could either connect to the running server just before it reboots, or it could lose several SYN segments while waiting for the server, causing exponential backoff to make the ultimate connection take longer. Reconnection The initiator always handles reconnection. During the new connection's login phase, the initiator specifies that it is replacing a failed connection by including the non-zero CID of the old connection in the RecoverCID field. If a target supports stateful recovery (meaning it still has the cached responses for the session), it accepts the login. If the target does not support stateful recovery, or the target has rebooted and lost its state, or the target has dropped the cached responses due to an excessive amount of time passing (perhaps 60 seconds), it rejects the login with a "reject recovery" status. The initiator then performs a new login, and does stateless recovery. 1. Stateful Recovery In a stateful recovery, the initiator resends all commands for which it has not received status. If a command has already completed, the cached response is returned. If a command has already been issued and is in progress, it is not re-issued; and will just be queued somewhere to wait for status. If a command had not been received by the target (or incompletely received and thrown away), it will be issued as normal. 2. Stateless Recovery By default, stateless recovery means that all outstanding commands are terminated (to the SCSI layer) (check condition?); higher layers must perform recovery. Non-Recovery Let's face it; there are times when things just can't be recovered at this level. However, there are many higher-level entities that may recover for us: - Tape backup software (reload into a different tape drive) - Volume managers (break mirrors) - Multipath SCSI drivers (find alternate path or controller) - Host application clusters (move app to host with connectivity) This should be handled as specified in section 4.3. Optimizations: 1. If RTT is in use, and a write request is re-sent to a target, and the target has already written the data, the target could send the Command Response back instead of the RTT. The initiator would just accept this as the final status, and would not have to send the write data again. iSCSI Draft 03 Message Modifications: 1. Remove StatRN from the Data Response, or make it equal the StatRN for the matching Command Response. There should be no need for it to increment separatly from the Command Response, since this scheme assumes that if the response was not received, the data will be re-sent anyway. The current draft does not specify how StatRN is used in a Data Response. 2. Add an AckStatRN field to Command Request, to acknowledge the highest (honoring wrap) consecutive StatRN received for the session. 3. Add a new event (Event Indicator 5) specifying that the connection will be closed by the target. This event sends two parameters (using some of the reserved fields): - MaxUpTime - the number of seconds the target intends to keep the connection alive. - MinHoldTime - the number of seconds the initiator should wait before establishing a new connection. Alternative Implementations 1. We considered a separate message to send the AckStatRN, but since this is generally done for every command, it seemed simpler to just piggyback it on the next command request. 2. CmdRN and StatRN are assumed to be per-session. If they were made per-LUN for any reason, the initiator and target would simply have to demux requests and responses based on LUN + RN. A Few Alternatives that Didn't Quite Work Here are some alternatives we went through, and why we did not choose them: 1. We considered just tossing out StatRN, and acking the CmdRN (or the ITT) matching the status last received instead. However, the target needed to keep track of status in the order in which it was sent (and NOT in the order in which the original command was received) to avoid trouble with commands which incur a long response delay (REWIND et al). Keeping StatRN and acking it is much simpler, and makes it easier to preserve response ordering if multiple connections are used (and if response ordering is required). 2. Target Retries - if one just assumes disk, the target could avoid caching responses, and the initiator could avoid acking them; the target could just retry any requests, in order, sent over the new connection. However, this excludes most of the SCSI Peripheral Device Types, and will likely not work in every case for this either. In these cases, ALL connection recovery would be pushed up to SCSI to handle. By caching status, we remain a truer transport, and will work better for these devices (especially tape). 3. Selective StatRN acks - one could individually acknowledge each response, to free its resources on the target. However, if a selective ack is lost during a connection recovery, its resources would then hang around forever on the target (unless, of course, we wanted to ack the ack). Cumulative acks may be lost at the end of a connection; the next command sent will just re-ack everything anyway. -- Mark A. Bakke NuSpeed, Inc. mark.bakke@nuspeed.com 763.398.1054
Home Last updated: Tue Sep 04 01:08:12 2001 6315 messages in chronological order |