|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: iSCSI: more on StatRN> While I know the values for certain operating systems, I would like to hear > from people who can assert confidently that the TCP fail connection > time < SCSI command failure time. The SCSI task timeout depends wildly upon the task. Something like an INQUIRY might have 10 second timeout (gee, probably going to have to diddle with this value for iSCSI). Disk read and write operations traditionally have ~60 second timeout. Format operations can have multihour timeouts. Exabyte Joe also pointed out that some tape operations can take tens of minutes. Clearly, you can not make the claim that TCP connection failure time < SCSI command failure time. The approach we took in SST was to view transport connection and SCSI operation viability as two different things. When a SCSI operation times out, the first step is to determine transport connection viability with some form of keep-alive handshake. If the transport connection is viable, `recovery' (in the sense of state recovery on the target) is attempted by following the abort protocol on the timed out SCSI operation. If the connection is dead, the connection close protocol is followed, and ALL outstanding SCSI operations on the connection are declared dead. As Julian points out iSCSI has many fewer cases where per command recovery is required. The reliable transport ensures that the data will get there or the connection will be broken. For implementation/consistency errors, I'd claim that summarily closing the connection is an acceptable behavior. Then the question remains of whether an individual timed out command should be recovered using an abort protocol. Traditionally, this recovery mechanism has been used to deal with the infrequent transport layer failures on relatively reliable media. I have heard mention of commands that just go off into the weeds on the mechanism, but you could make the case that that also implies broken implementation (either buggy or defective target), and the big hammer connection close is OK there too. I personally would NOT make that case. I still believe that it is appropriate to include an individual command timeout recovery path in iSCSI. It may not be used much, but the architecture is already there at the SCSI layer, and it's well understood how it should operate. Not allowing would create an awkward mismatch between the SCSI layer semantics and what iSCSI provides. Carefully defining the set of recovery steps for these cases is quite important. FCP had difficulty early on because it didn't do this. > For very few - in which a recovery action must be done - there will > be a check condition. This is clearly not the case for SCSI operation timeouts. In fact, this type of recovery is traditional outside the specific SCSI architecture. In other words, when you receive SCSI status on an operation, from the transport's viewpoint, that's a successful operation completion. What recovery would need to be done in this case? > For most the errors that result from an error in the initiator > or even a malicious initiator the action taken will be to discard > the PDU and (after a number of them) to close the session. The same > si true for an initiator with regard to a "bad" target. Discarding the PDU seems inadequate, and why, with a reliable transport, would you want to pull the plug only after a `number' of them (unless the number = 1)? Unreliable SCSI transports have had constants like this > 1 because you need to ensure that you didn't get into the funky state because of data loss in the transport. I don't see how that applies to iSCSI. Steph
Home Last updated: Tue Sep 04 01:06:36 2001 6315 messages in chronological order |