Re: iSCSI: more on StatRN

To: ips@ece.cmu.edu
Subject: Re: iSCSI: more on StatRN
From: Stephen Bailey <steph@cs.uchicago.edu>
Date: Fri, 20 Oct 2000 11:53:22 -0500
In-Reply-To: Message from "Prasenjit Sarkar/Almaden/IBM" <psarkar@almaden.ibm.com> of "Thu, 19 Oct 2000 21:24:00 PDT." <OFBC8C6BED.E8F7B85B-ON8825697E.001788EA@LocalDomain>
Sender: owner-ips@ece.cmu.edu

> While I know the values for certain operating systems, I would like to hear
> from people who can assert confidently that the TCP fail connection
> time < SCSI command failure time.

The SCSI task timeout depends wildly upon the task.  Something like an
INQUIRY might have 10 second timeout (gee, probably going to have to
diddle with this value for iSCSI).  Disk read and write operations
traditionally have ~60 second timeout.  Format operations can have
multihour timeouts.  Exabyte Joe also pointed out that some tape
operations can take tens of minutes.

Clearly, you can not make the claim that TCP connection failure time <
SCSI command failure time.

The approach we took in SST was to view transport connection and SCSI
operation viability as two different things.  When a SCSI operation
times out, the first step is to determine transport connection
viability with some form of keep-alive handshake.  If the transport
connection is viable, `recovery' (in the sense of state recovery on
the target) is attempted by following the abort protocol on the timed
out SCSI operation.  If the connection is dead, the connection close
protocol is followed, and ALL outstanding SCSI operations on the
connection are declared dead.

As Julian points out iSCSI has many fewer cases where per command
recovery is required.  The reliable transport ensures that the data
will get there or the connection will be broken.  For
implementation/consistency errors, I'd claim that summarily closing
the connection is an acceptable behavior.

Then the question remains of whether an individual timed out command
should be recovered using an abort protocol.  Traditionally, this
recovery mechanism has been used to deal with the infrequent transport
layer failures on relatively reliable media.  I have heard mention of
commands that just go off into the weeds on the mechanism, but you
could make the case that that also implies broken implementation
(either buggy or defective target), and the big hammer connection
close is OK there too.  I personally would NOT make that case.  I
still believe that it is appropriate to include an individual command
timeout recovery path in iSCSI.  It may not be used much, but the
architecture is already there at the SCSI layer, and it's well
understood how it should operate.  Not allowing would create an
awkward mismatch between the SCSI layer semantics and what iSCSI
provides.

Carefully defining the set of recovery steps for these cases is quite
important.  FCP had difficulty early on because it didn't do this.

> For very few - in which a recovery action must be done - there will
> be a check condition.

This is clearly not the case for SCSI operation timeouts.  In fact,
this type of recovery is traditional outside the specific SCSI
architecture.  In other words, when you receive SCSI status on an
operation, from the transport's viewpoint, that's a successful
operation completion.  What recovery would need to be done in this
case?

> For most the errors that result from an error in the initiator
> or even a malicious initiator the action taken will be to discard
> the PDU and (after a number of them) to close the session. The same
> si true for an initiator with regard to a "bad" target.

Discarding the PDU seems inadequate, and why, with a reliable
transport, would you want to pull the plug only after a `number' of
them (unless the number = 1)?  

Unreliable SCSI transports have had constants like this > 1 because
you need to ensure that you didn't get into the funky state because of
data loss in the transport.  I don't see how that applies to iSCSI.

Steph

References:
- iSCSI: more on StatRN
  - From: "Prasenjit Sarkar/Almaden/IBM" <psarkar@almaden.ibm.com>

Prev by Date: Re: iSCSI: Question on StatRN usage
Next by Date: RE: iSCSI: more on StatRN
Prev by thread: Re: iSCSI: more on StatRN
Next by thread: Re: iSCSI: more on StatRN
Index(es):
- Date
- Thread

Home

Last updated: Tue Sep 04 01:06:36 2001
6315 messages in chronological order