|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: iSCSI: DLB's Last Call commentsThanks for the careful reading. I will answer to your items as I get to them and list them and proposed resolution in the usual place. You did not leave me to much time tough. Julo
I've spent the better part of 3 days reading -14, well, most of it - I didn't check the state machines thoroughly, and my eyes glazed over on the error handling pseudo-code, so I'll simply have to trust that Mallikarjun and the other folks involved got that right. This message contains my Technical comments and the editorial comments that may involve significant work on the draft. My entire set of 140 editorial comments is headed to Julian under separate cover. These comments are submitted as an individual for further discussion - I am *not* wielding my WG co-chair authority to say that these have to be done or else ... I can be talked out of many of these by suitably coherent and convincing technical arguments to the contrary. Please DO NOT REPLY to this message - there's too much stuff in here. Please send a new message with a new subject line to discuss specific comments below. Thanks, --David -- Technical [T.1] 2.2.2.2 Response/Status Numbering and Acknowledging A large absolute difference between StatSN and ExpStatSN may indi- cate a failed connection. Initiators undertake recovery actions if the difference is greater than an implementation defined constant that SHOULD NOT exceed 2**31-1. That's a requirement, not a description. It needs a "SHOULD" or a "MUST" between "Initiators" and "undertake". [T.2] 2.2.6.1 iSCSI Name Requirements The initiator MUST present both its iSCSI Initiator Name and the iSCSI Target Name to which it wishes to connect in the first login request of a new session or connection. The only exception is if a discovery session (see Section 2.3 iSCSI Session Types) is to be established; the iSCSI Initiator Name is still required, but the iSCSI Target Name may be ignored. "may be ignored" or "omitted"? Section 4.3 appears to indicate that "omitted" is the correct word, making the correct wording "MAY be omitted". [T.3] 2.2.6.1 iSCSI Name Requirements iSCSI names must adhere to the following requirements: Use upper case "MUST" instead of "must". Likewise for name encoding requirements a) through d). [T.4] 2.2.6.3.1 Type "iqn." (iSCSI Qualified Name) - A date code, in yyyy-mm format. This date MUST be a date during which the naming authority owned the domain name used in this format, and SHOULD be the date on which the domain name was acquired by this naming authority. Oh dear, what happens if a domain name changes hands twice in the same month??? One possible way out is to require that the naming authority owned the domain name at 00:01 GMT on the first day of the month. [T.5] 2.8 Message Synchronization and Steering Since the only mechanism in this class currently specified for iSCSI is markers, this general architectural framework needs to be removed (possibly to other drafts, as some of this material is probably appropriate for an RDDP architecture draft). This section needs to be refocused on the marker mechanism that iSCSI describes, and include a discussion of its use when an iSCSI header CRC check fails. I'd like to see something like one paragraph on what markers are/do, a sentence or two on how they can be used to recover from an iSCSI header CRC check failure, and at most one paragraph on how markers can help steer iSCSI PDUs/payloads into preallocated/queued buffers. The current 4+ pages devoted to 2.8 and its subsections is excessive and an invitation to further problems down the line. [T.6] 2.3 iSCSI Session Types b) Discovery-session - a session opened only for target discov- ery; the target MAY accept only text requests with the SendTar- gets key and a logout request with reason "close the session". Change "MAY" to "MUST", and say that other requests MUST be rejected. [T.7] 2.4.1 iSCSI Architecture Model d) Network Portal - a component of a Network Entity that has a TCP/IP network address and that may be used by an iSCSI Node within that Network Entity for the connection(s) within one of its iSCSI sessions. In an initiator, it is identified by its IP address. In a target, it is identified by its IP address and its listening TCP port. Use of just an IP address to identify an entity such as this doesn't work throuogh NAPTs (Network Address Port Translators). This issue needs to be explained at some point, although I don't think it affects protocol operation. [T.8] 2.4.1 iSCSI Architecture Model f) Portals within a Portal Group are expected to have similar hardware characteristics, as SCSI port specific mode pages may affect all portals within a portal group. (See Section 2.4.3.2 SCSI Mode Pages). "similar hardware characteristics" needs to be explained/qualified, as Section 2.4.3.2 does not contain an adequate explanation of what's going on here and its hardware implications. [T.9] 2.4.2 SCSI Architecture Model The SCSI Port Name is mandatory in iSCSI. When used in SCSI parameter data, the SCSI port name MUST be encoded as: - The iSCSI Name in UTF-8 format, followed by - a comma separator (1 byte), followed by - the ASCII character 'i' (for SCSI Initiator Port) or the ASCII character 't' (for SCSI Target Port), followed by - a comma separator (1 byte), followed by - zero to 3 null pad bytes so that the complete format is a multiple of four bytes long, followed by - the 6byte value of the ISID (for SCSI initiator port) or the 2byte value of the portal group tag (for SCSI target port) in network byte order (BigEndian). That's a peculiar format with padding nulls in the middle and a number concatenated after the padding - for example, it can't be passed in iSCSI login without format conversion. How about converting the number to 4 or 12 bytes of hex (ASCII characters) and moving the padding to the end so the result is actually a string, and excluding the padding from the definition of the name? This will increase the maximum length of port names, but produce names that are easier to deal with. [T.10] 2.4.3.2 SCSI Mode Pages Changes via mode pages to the behavior of a portal group via one iSCSI Target Node should not affect the behavior of this portal group with respect to other iSCSI Target Nodes, even if the underlying implementation of a portal group serves multiple iSCSI Target Nodes in the same Network Entity. I believe "should not" should be "SHOULD NOT". [T.11] 4.1 Text Format decimal-constant: an unsigned decimal number - the digit 0 or a string of 1 or more digits starting with a non-zero digit. This encoding is not used for numerical values equal or greater than 2**64. Decimal-constants are used to encode numerical values or binary strings. When used to encode binary strings decimal constants have an implicit byte-length that is the minimum number of bytes needed to represent the base2 encoding of the decimal number. The last sentence is the "decimal binary strings" issue that has been discussed extensively on the list. One example of the problem is that the decimal number 15 represents the binary string 0xF, but not the binary strings 0x0F or 0x000F, and the latter two are not representable in the decimal format as currently defined. If the difference between these three matters (e.g., SRP's salt [s] parameter is one example where this difference does matter) use of decimal format for binary strings can cause things not to work. Forbidding the use of decimal representation for binary strings would be the simplest change. [T.12] 4.1 Text Format large-numerical-value: an unsigned integer larger than or equal to 2**64 encoded as a hex constant, or base64-constant. Unsigned integer arithmetic applies to large-numeric-values. I believe "larger than or equal to" should be changed to "whose value may be larger than or equal to", as the current text forbids large numerical values from representing values less than 2**64, which seems wrong. [T.13] 4.1 Text Format Any iSCSI target or initiator MUST support receiving at least 16384 bytes of key=value data in a negotiation sequence except when indi- cating support for very long authentication items by offering or selecting authentication methods such as public key certificates in which case they MUST support receiving at least 64 kilobytes of key=value data. This comment is classified as technical to catch the 16kB minimum as an open issue. The "such as" text needs to be cleaned up - the point should be that the 64 kB minimum applies when a authentication method using long authentication items is offered or accepted, and should be accompanied by a specific list of authentication methods in this draft that carry that implication. [T.14] 4.2 Text Mode Negotiation During login, and thereafter, some session or connection parameters are either declared or negotiated through an exchange of textual information. How does a declaration differ from a negotiation? It isn't described in this section, and doesn't seem to be mentioned in any of the subsequent sections about login, until one gets to the text key definitions. [T.15] 4.3.1 Login Phase Start The first Login Response PDU during the Login Phase from the iSCSI target SHOULD return the TargetPortalGroupTag key that contains the tag value of the iSCSI portal group servicing the Login Request PDU. If the iSCSI target implementation supports altering the portal group configuration (including adding, deleting, and swapping of portals in a portal group), it MUST return the TargetPortalGroupTag key carry- ing the tag value of the servicing portal group. Let's make returning this key a MUST in all cases - less text, simpler protocol, simpler code at the Initiator. [T.16] 4.3.4 Connection reinstatement Targets should support opening a second connection even when they do not support multiple connections in Full Feature Phase. Looks like that ought to be an upper case "SHOULD". I think this needs further discussion. Section 5.2 appears to use a lower case "must" for this: Whenever a connection state machine (e.g., CSM-C) enters the CLEANUP_WAIT state (S8), it must go through the state transitions additionally described in the connection cleanup state diagram either a) using a separate full-feature phase connection (let's call it CSM- E) in the LOGGED_IN state in the same session, or b) using a new transport connection (let's call it CSM-I) in the FREE state that is to be added to the same session. Also, the implications of needing to spend resources (e.g., new transport connection) to clean up resources need to be described somewhere - in the worst case, resources for a new transport connection may need to be reserved to avoid deadlock. It doesn't look like there's a problem with the transport (TCP) resources themselves, but this could be an issue for iSCSI connection state resources (what happens if all connections are in CLEANUP_WAIT and there's no iSCSI connection state resource available to open a new transport connection?). The answer is likely to involve killing off a session, but I don't see an obvious explanation of it. Then I get to Section 5.2.2, which describes a transition from CLEANUP_WAIT to FREE (M1) that can happen based on a timeout, and presumably recovers the state resources and hence might make this all work. Whether an additional connection is needed to deal with CLEANUP_WAIT and whether it's MUST/SHOULD/MAY needs to be sorted out, and the various descriptions of it made consistent. [T.17] 6.4 Format Errors The following two explicit violations of PDU layout rules are format errors: a) illegal contents of the PDU header (except the Opcode) - for ex., out-of-range values for certain fields b) inconsistent contents - for ex., value of one field conflicts with that of another. Format errors indicate a major implementation flaw in one of the par- ties. The two "for ex."s aren't good enough. Details on what is a format error need to be spelled out explicitly, given the drastic consequences of committing one. [T.18] 6.7 SCSI Timeouts On a ULP timeout for a command (that carried a CmdSN of n), the iSCSI initiator MUST abort the command by either using the Abort Task task management function request, or a "close the connection" Logout if it intends to continue the session. I think the first part is over specified - there are a number of task management commands that abort the task - if the target is really sick, something much more serious than Abort Task may be used that not only aborts this task, but others. It's not clear whether "if it intends to continue the session" applies only to the logout or to both the task management command and the logout, nor is it clear what an initiator should do if it doesn't intend to continue the session. [T.19] 8.1.1 Conservative Reuse of ISIDs To both minimize the disruption of legacy applications and to better facilitate the SCSI features that rely on persistent names for SCSI ports, iSCSI implementations should attempt to provide a stable pre- sentation of SCSI Initiator Ports That's a requirement - change "should" to "SHOULD". This is a technical comment to (re)open the issue of whether conservative reuse ought to be a "MUST" - T10 is defining persistent reservation functionality that will not behave as one might expect in the absence of conservative reuse - at the very least, this consequence of ignoring the SHOULD needs to be stated. [T.20] 8.2 Autosense and Auto Contingent Allegiance (ACA) Autosense refers to the automatic return of sense data to the initia- tor in case a command did not complete successfully. iSCSI initia- tors and targets MUST support autosense. "MUST support" --> "MUST support and use", as this is not just "MUST implement". [T.21] 8.3 iSCSI timeouts iSCSI recovery actions are often dependent on iSCSI time-outs being recognized and acted upon before SCSI time-outs. Determining the right time-outs to use for various iSCSI actions (command acknowl- edgements expected, status acknowledgements, etc.) is very much dependent on infrastructure (hardware, links, TCP/IP stack, iSCSI driver). As a guidance the implementer may use an average Nop-Out/ Nop-In turnaround delay multiplied by a "safety factor" (2-3) as a good estimate for the basic delay of the iSCSI stack for a given con- nection. "(2-3}" strikes me as low and providing little resilience to "stupid network tricks". I'd change that to something like "(e.g., 5x, with a minimum of several milliseconds)". This is complicated by the fact that both initiators and targets will likely want to insert delays (in the hope that something useful turns up that has to be sent) before sending a NOP to update the various acknowledgement counters (e.g., ExpCmdSN). [T.22] 9.1 iSCSI PDU Length and Padding iSCSI PDUs are padded to the closest integer number of four byte words. The padding bytes SHOULD be 0. "MUST be transmitted as zero and ignored by the receiver, except for calculation of the digest CRCs if present" would be better. Same comment applies to padding bytes for header segments. [T.23] 9.4.1 Flags (byte 1) Bits O and U and bits o and u are mutually exclusive. Need to put some teeth into that. Say that in each pair, it is a protocol error if both bits are set to 1, and refer to the section for handling that error. [T.24] 9.4.4 Residual Count and 9.4.5 Bidirectional Read Residual Count These should be treated as reserved when they're not valid (O and U are zero for Residual Count, o and u are zero for Bidi Read Residual Count) - MUST be set to zero by sender, MUST be ignored by receiver. [T.25] 9.5.2 LUN This field is required for functions that address a specific LU (ABORT TASK, CLEAR TASK SET, ABORT TASK SET, CLEAR ACA, LOGICAL UNIT RESET) and is reserved in all others. Should TASK REASSIGN be added to the list in parentheses? [T.26] 9.6 Task Management Function Response For the functions ABORT TASK, ABORT TASK SET, CLEAR ACA, CLEAR TASK SET, LOGICAL UNIT RESET, TARGET COLD RESET and TARGET WARM RESET, the target performs the requested Task Management function and sends a Task Management Response back to the initiator. TASK REASSIGN does not get a response. Was this intended? [T.27] 9.12.10 ExpStatSN This is ExpStatSN for the old connection. This field is valid only if the Login request restarts a connection (see Section 4.3.4 Connection reinstatement). This is for the Login Request PDU. For the initial Login Request PDU in a login phase, the description is correct, but thereafter, Login Request should be using ExpStatSN to acknowledge the Login Responses via their increasing StatSN values (see 9.13.4). This raises the related question of why StatSN/ ExpStatSN is being used during login while CmdSN/ExpCmdSN is not being used. [T.28] 9.14 Logout Request A successful completion of a logout request with the reason code of "close the connection" or "remove the connection for recovery" results in the discarding of all tasks waiting in the command reor- dering queue that are allegiant to the connection being logged out. "discarding" is not what hapapens in the "remove the connection for recovery case according to the following text from Section 6.5: b) Logout the connection for recovery and continue the tasks on a different connection instance as described in Section 6.1 Retry and Reassign in Recovery. [OR] A "discarded" task cannot be "continue"-d. I suspect the text should say that "close the connection" terminates the tasks, anad "remove the connection for recovery" suspends the tasks with the following CmdSN side effects ... [T.29] 9.14 Logout Request The entire logout discussion in this section is completely applica- ble also for an implicit Logout effected by way of a connection rein- statement or session reinstatement. The Logout reason codes for implicit Logout are specified as below: How are these codes used and why are they specified here? Logout Request is an explicit logout, not implicit. [T.30] 9.16 SNACK Request If the initiator MaxRecvDataSegmenTLength changed Data-In PDUs requested with RunLength 0 (meaning all PDUs after this number) may be different from the ones originally sent, in order to reflect changes in MaxRecvDataSegmentLength. Their DataSN starts with the requested number and is increased by 1 for each resent Data-In PDU. If DataSN numbers change and a SCSI-Reponse PDU was sent reflecting the DataSN before retransmission it MUST be resent to reflect the new numbers. This was discussed on the list, but there are still some problems here: (1) If the MaxRecvDataSegmentLength has changed, the only valid Data SNACK is BegRun=0, RunLength=0 (i.e., resend everything). Attempts to be more clever than this are an invitation to miscount Data-In PDUs and cause problems in the initiator. Targets MUST reject all other Data SNACK requests in this situation. (2) The new SCSI-Response PDU needs a new StatSN to avoid the initiator discarding it as a duplicate. Section 2.2.2.2 is silent on duplicate detection for StatSN, but discarding duplicates would be a reasonable thing for an initiator to do. (3) The initiator needs some way to know that a new response is coming, and specifically whether to expect one or two responses. If it only expects one and two show up, the initiator could reuse the Task Tag once all the data arrives causing a race in which the new response could incorrectly complete an unrelated command (unlikely, but potentially nasty). This suggests calling out the <BegRun=0, RunLength=0> Data SNACK as having special behavior: - It may resegment Data-In PDUs to deal with MaxRecvDataSegmentLength. All other Data SNACK requests MUST NOT resegment. - It *always* generates a new SCSI Response due to the possibility of resegmentation. That's not a great solution, because if one ever sets <BegRun=0, RunLength=0> in a Data SNACK, the resulting behavior change is dramatic and unexpected. This leads to the final proposal: - Specify a new SNACK type code (3) for Resegmenting Data SNACK. SNACK Data-In resegmentation is allowed only when this is used. If resegmentation would be necessary for a Data SNACK (type 1), that SNACK MUST be rejected. - Both BegRun and RunLength MUST be zero for a Resegmenting Data SNACK, and (unlike reserved fields) these MUST be checked by the receiver (target). - A new SCSI Response is always generated as a result of a Resegmenting Data-In SNACK, and it has its own StatSN number to deal with the fact that the number of Data-In PDUs may have changed, causing a change to the ExpDataSN value. This new response also needs to be marked to distinguish it from a response that may have been generated earlier (so the initiator knows to wait for the new response) - using a bit in the flags field for this seems wrong, so specifying a new Response code value (0x02 - see 9.4.3) seems like a reasonable way to accomplish this. - Data SNACK (type 1) now has consistent behavior - it MUST NOT resegment and MUST NOT generate a new SCSI response, ever. This approach also has the potentially useful property of making it easy to yank out the Resegmenting Data SNACK wart if we ever put restrictions on the interaction of MaxRecvDatasegmentLength and Data SNACKs (yes, that's a hint ... this has gotten messy enough that forbidding Data SNACKs when MaxRecvDataSegmentLength has changed needs to be considered as a possible alternative). This issue also affects some text in 9.16.3. [T.31] 9.16.1 Type An iSCSI target that does not support recovery within connection MAY reject the status SNACK with a Reject PDU. If the target supports recovery within connection, it MAY reject the SNACK after which it MUST issue an Asynchronous Message PDU with an iSCSI event that indi- cates "Request Logout". This should be conditioned on the operational ErrorRecoveryLevel of the session, not whether the target supports recovery within connection. [T.32] 10. iSCSI Security Keys and Authentication Methods Only the following keys can be used during the SecurityNegotiation stage of the Login Phase: [ ... snip ...] AuthMethod and all keys listed under AuthMethod along with all of their associated keys. In practice, this forbids vendor-unique authentication methhods, as they would have to define their own text keys (reusing keys for an existing AuthMethod is a *bad* idea). OTOH, Section 10.1 appears to allow vendor-unique authentication methods. The authentication methods that can be used (appear in the list-of- options) are either those listed in the following table or are ven- dor-unique methods: One of these two needs to change, and see also editorial comment [E.136] against the above text from Section 10. Forbidding vendor-unique authentication methods would enhance interoperability. [T.33] 10.5 Challenge Handshake Authentication Protocol (CHAP) For the Algorithm, as stated in [RFC1994], one value is required to be implemented: 5 (CHAP with MD5) To guarantee interoperability, initiators SHOULD always offer it as one of the proposed algorithms. Change that "SHOULD" to a "MUST". [T.34] 11.1 HeaderDigest and DataDigest Proprietary algorithms MAY also be negotiated for digests. Whenever a proprietary algorithm is negotiated, "None" or "CRC32C" should be listed as an option in order to guarantee interoperability. Change "negotiatiated" to "offered" and "should" to "MUST". [T.35] 11.4 TargetName and 11.5 InitiatorName The descriptions of these keys need to have restrictions on changing them added - changing a name after it's been used for initial authentication/ authorization can cause security problems. These restrictions may already exist in the prohibitions on re-negotiating keys, but need to be (re)stated here. [T.36] 11.6 TargetAlias and 11.7 InitiatorAlias Add text here or in Section 10 to say that these keys MUST NOT be used to make authentication, or authorization (including access control) decisions. [T.37] 11.12 ImmediateData If ImmediateData is set to No and InitialR2T is set to No, then the initiator MUST NOT send unsolicited immediate data, but MAY send one unsolicited burst of Data-OUT PDUs. If ImmediateData is set to Yes and InitialR2T is set to No, then the initiator MAY send unsolicited immediate data and/or one unsolicited burst of Data-OUT PDUs. Both of the above uses of "MAY" are problems - my recollection of this is that if InitialR2T=No and there is data to be sent that falls under the implicit Initial R2T, then it MUST be sent as unsolicited Data-Out PDUs. [T.38] 12. IANA Considerations The temporary (user) well-known port number for iSCSI connections assigned by IANA is 3260. Delete "temporary (user)" insert "TCP" before "port number" or add instructions for IANA to allocate a system port when this draft is approved to become an RFC - I think just sticking with 3260 is better. [T.39] 12. IANA Considerations If vendor additions of values to existing keys is to be allowed (e.g., AuthMethod, HeaderDigest, DataDigest) an IANA registry is needed for each set of values - see [T.32] and [T.34], or the reversed DNS convention needs to be extended from keys to values - the latter doesn't sound like a good idea. -- Editorial Global Editorial comment: Both the login (4) and error handling (6) sections feel like one of those old Adventure mazes of twisty little passages - all the details seem to be there, for the most part they're correct, but it's very hard to get the proverbial "big picture" of how everything fits together, in terms of how the various mechanisms work with each other and how they accomplish the overall functionality. Both of these could use overview sections describing how the functionality breaks down into the pieces described in the subsections and how they fit together. Swapping the order of Sections 5 and 6 would also be a good idea so that the discussion of Error Handling and Recovery comes before the state machine descriptions that contain a lot of the details of how errors are handled. For error handling, moving Section 6.13 to the front of Section 6 would help organize the Section better, and care should be taken to define or replace terms like "restart login" and "recovery R2T" that are currently used without definitions. Also the following two editorial comments recommend serious restructuring of a portion of the draft: [E.80] 5.1 Standard Connection State Diagrams I'm probably going to take grief for this comment because it's a lot of work, but I think this section would be significantly clearer if the state and transition descriptions were split into separate initiator and target portions, so the order of the subsections would be: - Initiator diagram - Initiator state descriptions - Initiator transition descriptions - Target diagram - Target state descriptions - Target transition descriptions I guess it's ok to leave the descriptions merged in Section 5.2. [E.81] 5.3 Session State Diagrams This should be restructured in a fashion similar to Section 5.1 (see [E.80]). If you made it this far, thanks for taking the time to read everything, --David --------------------------------------------------- David L. Black, Senior Technologist EMC Corporation, 42 South St., Hopkinton, MA 01748 +1 (508) 249-6449 FAX: +1 (508) 497-8018 black_david@emc.com Mobile: +1 (978) 394-7754 ---------------------------------------------------
Home Last updated: Sun Jul 07 13:18:48 2002 11157 messages in chronological order |