SORT BY:

LIST ORDER
THREAD
AUTHOR
SUBJECT


SEARCH

IPS HOME


    [Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

    Re: iSCSI: more on StatRN



    
    
    Steph,
    
    Thanks for your thoughtful comments.
    
    The reason I suggested dropping connections after several format errors was
    tolerance to
    software "glitches".
    
    The Check Condition is meant for cases in which SCSI can act - and yes from
    the transport POV
    the command has finished.
    
    Dropped PDUs will help us avid DOS attacks with badly formed PDUs.
    
    And I will suggest activating the TCP keep alive option for early detection
    of link failures.
    
    Julo
    
    Stephen Bailey <steph@cs.uchicago.edu> on 20/10/2000 19:53:22
    
    Please respond to Stephen Bailey <steph@cs.uchicago.edu>
    
    To:   ips@ece.cmu.edu
    cc:
    Subject:  Re: iSCSI: more on StatRN
    
    
    
    
    > While I know the values for certain operating systems, I would like to
    hear
    > from people who can assert confidently that the TCP fail connection
    > time < SCSI command failure time.
    
    The SCSI task timeout depends wildly upon the task.  Something like an
    INQUIRY might have 10 second timeout (gee, probably going to have to
    diddle with this value for iSCSI).  Disk read and write operations
    traditionally have ~60 second timeout.  Format operations can have
    multihour timeouts.  Exabyte Joe also pointed out that some tape
    operations can take tens of minutes.
    
    Clearly, you can not make the claim that TCP connection failure time <
    SCSI command failure time.
    
    The approach we took in SST was to view transport connection and SCSI
    operation viability as two different things.  When a SCSI operation
    times out, the first step is to determine transport connection
    viability with some form of keep-alive handshake.  If the transport
    connection is viable, `recovery' (in the sense of state recovery on
    the target) is attempted by following the abort protocol on the timed
    out SCSI operation.  If the connection is dead, the connection close
    protocol is followed, and ALL outstanding SCSI operations on the
    connection are declared dead.
    
    As Julian points out iSCSI has many fewer cases where per command
    recovery is required.  The reliable transport ensures that the data
    will get there or the connection will be broken.  For
    implementation/consistency errors, I'd claim that summarily closing
    the connection is an acceptable behavior.
    
    Then the question remains of whether an individual timed out command
    should be recovered using an abort protocol.  Traditionally, this
    recovery mechanism has been used to deal with the infrequent transport
    layer failures on relatively reliable media.  I have heard mention of
    commands that just go off into the weeds on the mechanism, but you
    could make the case that that also implies broken implementation
    (either buggy or defective target), and the big hammer connection
    close is OK there too.  I personally would NOT make that case.  I
    still believe that it is appropriate to include an individual command
    timeout recovery path in iSCSI.  It may not be used much, but the
    architecture is already there at the SCSI layer, and it's well
    understood how it should operate.  Not allowing would create an
    awkward mismatch between the SCSI layer semantics and what iSCSI
    provides.
    
    Carefully defining the set of recovery steps for these cases is quite
    important.  FCP had difficulty early on because it didn't do this.
    
    > For very few - in which a recovery action must be done - there will
    > be a check condition.
    
    This is clearly not the case for SCSI operation timeouts.  In fact,
    this type of recovery is traditional outside the specific SCSI
    architecture.  In other words, when you receive SCSI status on an
    operation, from the transport's viewpoint, that's a successful
    operation completion.  What recovery would need to be done in this
    case?
    
    > For most the errors that result from an error in the initiator
    > or even a malicious initiator the action taken will be to discard
    > the PDU and (after a number of them) to close the session. The same
    > si true for an initiator with regard to a "bad" target.
    
    Discarding the PDU seems inadequate, and why, with a reliable
    transport, would you want to pull the plug only after a `number' of
    them (unless the number = 1)?
    
    Unreliable SCSI transports have had constants like this > 1 because
    you need to ensure that you didn't get into the funky state because of
    data loss in the transport.  I don't see how that applies to iSCSI.
    
    Steph
    
    
    
    


Home

Last updated: Tue Sep 04 01:06:36 2001
6315 messages in chronological order