Re: iSCSI: error recovery

To: ips@ece.cmu.edu
Subject: Re: iSCSI: error recovery
From: julian_satran@il.ibm.com
Date: Thu, 9 Nov 2000 17:25:49 +0200
Content-Disposition: inline
Content-type: text/plain; charset=us-ascii
Sender: owner-ips@ece.cmu.edu


Pierre,

Interesting scenario - but ENTIRELY WRONG.
A more carefull reading of the draft would have solved your problem.
After a failed connection the two parties (I & T) are supposed to do some
cleanup.
In the old draft that was accomplished by having the initiator indicate in
the new login
what old connection it is replacing.

In the new draft there is an explicit logout that is required before
resending unacked command.

This mechanism was carefully designed to help avoid ghost commands
appearing at the target.

Nevertheless - as David Black has suggested - you are encouraged to look
for holes.
As to publish or not that is entirely a question of taste.
I would certainly expect the problems to be real or at least harder to
crack that this one
(no pun intended).

Regards,
Julo

Pierre Labat <pierre_labat@hp.com> on 07/11/2000 02:25:28

Please respond to Pierre Labat <pierre_labat@hp.com>

To:   ips@ece.cmu.edu
cc:
Subject:  Re: iSCSI: error recovery




Hello,


Some suggestions to simplify/secure the error recovery.

Regards,

Pierre




Using several TCP connections gives an unreliable media.
Requests,responses and data can be lost,duplicated or ghost
because TCP connection(s) can drop.


Trying to do a recovery can lead to some problems.
The following scenarios describe some of the problems
we will have.
I am sure one can find other ones.

Scenario 1
----------
In this first scenario the recovery is delayed
unecessary, the retry of a command will fail.

Initiator_ExpCmdRN = 1
Target_ExpCmdRN = 4

1) Cmd 5 and Cmd 6 sent over NIC1 on the way to the target

2) NIC1 fails

3) Initiator detecting that NIC1 failed, retries Cmd5 and Cmd 6
   on an other NIC and TCP connection
   with their unchanged CmdRN (5 and 6) because 5 and 6 are greater than
   than Initiator_ExpCmdRN. (It is the algorithm described in the draft)

4) The Cmd 5 and Cmd 6 (sent from the failed NIC1 enters the target)
   Target_ExpCmdRN is updated to 7. These commands have no chance to
complete
   correctly because their TCP connection has been dropped on the initiator
   side.

5) The retry of the Cmd enters the target (through another TCP connection)
   But their CmdRN (5 and 6) are less than Target_ExpCmdRN.
   Hence they are dropped by the target.

6) The retry mechanism fails. The initiator will have to wait for
   the timeout of the commands 5 and 6 to try another recovery.




Scenario 2
----------

Initiator_ExpCmdRN = 1
Target_ExpCmdRN = 4
Imagine the session has 4 TCP connections.

1) Initiator sends a command with CmdRN = 7 over the TCP connection 1.
   Commands 5 and 6 are on the flight between the initiator and
   the target (on the TCP connection 4 for example).

2) The command 7 is blocked somewhere on the network because of congestion.

3) The TCP connection 1 fails unexpectedly on the initiator side (for
   whatever reason: hard soft,cable disconnected...) and the target can't
   be notified.

4) The initiator (as specified in the draft) sends a retry with CmdRN
   unchanged (CmdRN=7) on the TCP connection 2.

5) The TCP connection 2 fails unexpectedly on the initiator side (for
   whatever reason: hard soft,cable disconnected...) and the target can't
   be notified.

6) The initiator (as specified in the draft) sends a retry with CmdRN
   unchanged (CmdRN=7) on the TCP connection 3.


5) The target receives the retry from the connection 3, then the retry
   from the connection 2 then the original command from the connection 1.
   In fact, no luck, it receives things in the inverse order the initiator
   sent them. All these retries/command have the same CmdRN(=7) and same
   initiator task tag, hence the target get several retry for the same
   command and has no clue how to re-order them.
   When the target receives the second retry (from cx 2) it doesn't know
   what to with it. If it supersedes the first retry, the retry will fail
   because the completion will be send on the connection 2 that is failed
   on the initiator side. If it doesn't supersede and if the retries
   would have come in order, the retry would have failed too.


Scenario 3
----------

1) Cmd 1 sent to the target but blocked in TCP connection 1

2) The initiator sends plenty of commands on other TCP connection(s)
   that are OK.

3) TCP connection 1 fails on initiator side

4) Abort of Cmd 1 sent on TCP connection 2. The Abort is non-numbered
  (CmdRN=0).
   The abort is received by the target
   that returns "function rejected" because there is no
   matching task tag.

5) At this point the initiator doesn't know what to do. Because it
   doesn't know if the command has been lost or if it will come
   in the target later.

6) The command 1 finally reaches the target (ghost IO), and is not aborted.


Scenario 4
----------
In this scenario, the whole traffic of a session is blocked
when one command fails.

1) Cmd 10 sent to the target but blocked in TCP connection 1
   and will never reach the target.

2) The initiator sends plenty of commands on other TCP connection(s)
   that are OK.

3) TCP connection 1 fails on initiator side

4) Abort of Cmd 10 sent on TCP connection 2. The Abort is numbered
   using a new CmdRN.
   The abort is received by the target but not processed because
   the CmdRN of the abort is greater that Target_ExpCmdRN
   that is blocked on 10.

5) The entire command processing (through all TCP connections) is blocked
   on the target at Target_ExpCmdRN = 10
   till SCSI retries the command 10 with the same CmdRN
  (that can takes several seconds). And if SCSI doesn't
   retry with the same CmdRN (10) we have a dead lock.

Scenario 5
----------
Initiator_ExpCmdRN=Target_ExpCmdRN=5
Initiator_MaxCmdRN=Target_MaxCmdRN=100
Two TCP connections are used.

1) The initiator sends the command CmdRN=5 over the connection 1
   then the commands CmdRN=6 to CmdRN=100 over the connection 2

2) The initiator can send no more command because
   current CmdRN = MaxCmdRN

3) The TCP connexion 1 breaks on the initiator side and the
   command 5 will never reach the target.

4) The initiator wants to do a recovery with numbered commands
   (abort task for example), but can't send it because CmdRN = MaxCmdRN.

5) The target doesn't want to increment MaxCmdRN because its already
buffered
   commands up to 100 and have no extra buffer space. It waits for
receiving
   command 5. It could be because it allocated a maximum amount of memory
   space for the non ordered commands it receives.

6) The initiator waits for MaxCmdRN to increase and the target waits for
   command 5 to come or be aborted. We have a dead lock.


Scenario 6
----------

1) the initiator sends the command CmdRN=1 on
   a TCP connection

2) the command is stuck in the network

3) The command timeout on the initiator

4) the initiator "retry" the command on the same
   TCP connexion and the retry command is in the network

5) the target receives the original command, executes it,
   and sends the completion.

6) the initiator receives the completion, it doesn't know
   if it is from the original command or from the
   "retry" command because the same initiator task tag is used
   in both commands


Solve these problems
====================
To get rid off all these corner cases and have a basic, simple
and robust recovery mechanism that avoids or manages
lost,duplicated or ghost we could do:

- keep the fact that every numbered command with a CmdRN out
  of the window [Target_ExpCmdRN,Target_MaxCmdRN]
  is discarded silently.

- recover commands always doing an abort then
  sending again the command with a new CmdRN
  and a new initiator task tag.

- modify sligthly the abort, send it non numbered
  and change a little bit the way non numbered messages are coded.

Below are listed the modifications:

Modification of the coding of the headers
-----------------------------------------
for non numbered commands:
--------------------------

Add a bit in the iSCSI header to indicate
if the transaction is numbered or not. It allows to use
(in case the command is non numbered) the CmdRN
field to reference a command the transaction is targeted to.
Currently to indicate that a command is non numbered CmdRN
must be set to 0.
When the non numbered bit is set, the target doesn't
discard the request if CmdRN is out of the window
[Target_ExpCmdRN,Target_MaxCmdRN].
CmdRN indicates the command the non numbered
transaction is targeted to. If the non numbered
transaction is not targeted to any specific command
CmdRN is set to 0.
Doing that gives an Abort more robust (see below).


Modification of Abort task:
---------------------------
The abort is sent non numbered (with the bit non numbered set)
The CmdRN is updated with the value corresponding
to the command to abort.

When the target receives an abort:

- If there is no task associated with CmdRN and
  if CmdRN is out of the window
  [Target_ExpCmdRN,Target_MaxCmdRN].
  The abort returns immediately with success.

- If there is no task associated with CmdRN but
  if CmdRN is in the window [Target_ExpCmdRN,Target_MaxCmdRN].
  The target marks CmdRN as "jump". It means that
  when Target_ExpCmdRN will reach CmdRN, it only will
  jump to CmdRN+1. It prevents a dead lock if the command
  to abort never comes to the target.

- If there is a task associated with CmdRN.
  The target aborts the task or cleans the ressources
  if the task was not yet in a task set, marks CmdRN as "jump",
  and returns successfully.



The recovery mechanism "retrying" the commands
==============================================

Beside the basic recovery abort/new command
the more sophisticated "retry" may be faster.

The initiator (instead of doing an abort and sending again
the command with a new initiator task tag and a new CmdRN)
can send a "retry" message.

To avoid the problems described in the scenarios, the "retry"
message must be more sophisticated than simply setting the
retry bit as specified in the draft.
It must combine a part of the job of an "abort task"
(to fill the holes in the CmdRN sequence to allow
Target_ExpCmdRN to make progress) and the job of sending
again the command.

Modification of the "retry"
---------------------------
This "retry" message has the format of the SCSI command pdu
except:
- a "referenced initiator task tag" field is added. It
  references the command to "retry"
- a "timestamp" field (integer) is added.


When the initiator sends a "retry" it:

- sets the retry bit and the non numbered bit
- updates the CmdRN field with the value of the CmdRN of
  the command to retry
- generates a new initiator task tag(not the
  one of the task to retry)
- updates the "referenced initiator task tag" with
  the one of the command to retry.
- sets the timestamp is 0.

For the following "retry(s)" of the same command
(in the case the first one failed) the initiator
generates a new initiator task tag and increments
the timestamp.

The target when receiving a retry:
   - check if there is a task already associated with CmdRN.
   - if NO (the command has been lost or will come later (ghost))
     the target acts as if it was receiving the original
     command. It records the timestamp.
   - if YES the target check the timestamp. If the one
     in the retry is older than the one in the target, the
     "retry" is discarded silently. If the timestamp in the "retry"
     is newer than the one in the target associated to the command,
     the current task is stopped and restarted, the new
     timestamp is recorded by the target.

Sending the retry non numbered allows the "retry" to reach
the target even if the command window is closed. That can
prevent the kind of dead lock described in scenario 5.
That solves the scenario 1 too.

In the case the first retry doesn't work and the
initiator needs to send another one (for the same command),
sending the retries with different "initiator task tags"
allows the initiator to do the correspondance between
the retries PDUs and their completions.
In general as the main goal of the initiator task tag
is to allow the initiator to do the correspondance
between the request and the responses, it is cleaner
for each initiator request to generate a new initiator
task tag.

Having a timestamp avoid the problems described
in the scenario 2. The target knows to sort
between new PDUs and the ghost ones.

Using the CmdRN to reference the command to retry,
allows the target to:
 - fill the holes in the CmdRN sequence at the target,
   even if the original command never reached the target.
   Target_ExpCmdRN can make progress.


A initiator must not send a "retry" if it acknowledged
the Status of the corresponding command.
The target can forget the CmdRN of a command as soon as
the corresponding status has been acknowledged.
If the target receives a retry with the CmdRN
that is not in the window [Target_ExpCmdRN,MaxCmdRN]
and that doesn't correspond to any task whose the status
as not yet been acknowledged by the initiator,
the target answers with an iSCSI status of the kind
"out of range".

It seems to me that these three modifications (non numbered command,
abort task, retry) allows to have a robust recovery
eliminating the problems generated by the duplicates,
ghosts, missing iSCSI PDUs. The target always knows what to do exactly,
it is specified, and the targe is never blocked.




The StatRN is usefull only if "retry" is used.
Prev by Date: RE: ISCSI: Urgent Flag requirement violates TCP.
Next by Date: [Fwd: iSCSI: error recovery]
Prev by thread: Re: iSCSI: error recovery
Next by thread: Re: iSCSI: error recovery
Index(es):
- Date
- Thread
Home
Last updated: Tue Sep 04 01:06:27 2001
6315 messages in chronological order