|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] RDMA over TCP (Was Re: VI (Was: Avoiding deadlock in iSCSI))Stephen Byan wrote: >Michael Krause [mailto:krause@cup.hp.com] wrote: > >> [snip] >> RDMA != VI though VI does use RDMA technologies. > >Agreed. It is an example > >> Prefer to see discussion focused on what RDMA operations are >> required, what are the error and ordering requirements, etc. > [snip] >> [snip] >> As such, a general >> purpose RDMA solution which operates over TCP/IP is the optimal >> solution to pursue since it will lead to the broadest industry >> and customer adoption rate. > >I think we are in complete agreement. > >Regards, >-Steve In response to this I would offer the following proposal with the caution that it is very preliminary and has not been analyzed or reviewed. But I thought it might be worth posting in order to see what the general response to this approach is. ############################################################## RDMA / TCP 1. Abstract This document describes a format for encapsulating RDMA (remote direct memory access) information within a TCP data stream. No changes or modification to TCP of any sort are required. This is not intended to be a protocol, but rather a common format that may be shared by multiple client protocols, for instance VI/TCP and iSCSI. By using a common format it is hoped that design of NICs supporting these multiple protocols can be simplified. Sufficient information is included in the RDMA message format to allow determination of the protocol message units, as will as the ability to process an incoming RDMA request even if previous packets are missing and awaiting retransmission. In addition a CRC-32 is included in each segment to enhance the checksum coverage included in TCP. 2. Overview Data transfers consist of a sequence of messages. Each message is of one of four types: Send, RDMA_write, RDMA_Read_Request, and RDMA_Read_Response. The maximum size of a message is approximately 2^32. Each message is divided into one or more segments. It is RECOMMENDED that each TCP segment contain exactly one RDMA segment. The receive end of the connection cannot assume any alignment between the RDMA segments and TCP segments, however a receiver SHOULD optimize performance for the case where each TCP segment contains exactly one RDMA segment. 3. RDMA Segment Format The format of an RDMA segment depends on the message type. Shown below are the formats for the four different types of messages. All multibyte formats are to be represented in network byte order (i.e., big-endian). 3.1 Send and RDMA_Read_Response Message Type: | Byte 0 | Byte 1 | Byte 2 | Byte 3 | |7 6 5 4 3 2 1 0|7 6 5 4 3 2 1 0|7 6 5 4 3 2 1 0|7 6 5 4 3 2 1 0| +---------------+---------------+---------------+---------------+ | Version | res |B|E|typ| Segment Length | +---------------+---------------+---------------+---------------+ | | + Connection ID + | | +---------------+---------------+---------------+---------------+ | Message Number | +---------------+---------------+---------------+---------------+ | order | res. | CLEN | +---------------+---------------+---------------+---------------+ | Data Offset | +---------------+---------------+---------------+---------------+ | | | Control Data | | | +---------------+---------------+---------------+---------------+ | | | | | Payload Data | | | | | + +---------------+---------------+ | | Padding | +---------------+---------------+---------------+---------------+ | CRC-32 | +---------------+---------------+---------------+---------------+ 3.2 RDMA_Write Message Type: | Byte 0 | Byte 1 | Byte 2 | Byte 3 | |7 6 5 4 3 2 1 0|7 6 5 4 3 2 1 0|7 6 5 4 3 2 1 0|7 6 5 4 3 2 1 0| +---------------+---------------+---------------+---------------+ | Version | res |B|E|typ| Segment Length | +---------------+---------------+---------------+---------------+ | | + Connection ID + | | +---------------+---------------+---------------+---------------+ | Message Number | +---------------+---------------+---------------+---------------+ | order | res. | CLEN | +---------------+---------------+---------------+---------------+ | RDMA Buffer ID | +---------------+---------------+---------------+---------------+ | RDMA Buffer offset | +---------------+---------------+---------------+---------------+ | RDMA Length | +---------------+---------------+---------------+---------------+ | | | Control Data | | | +---------------+---------------+---------------+---------------+ | | | Payload Data | | | | | + +---------------+---------------+ | | Padding | +---------------+---------------+---------------+---------------+ | CRC-32 | +---------------+---------------+---------------+---------------+ 3.3 RDMA_Read_Request Message Type: | Byte 0 | Byte 1 | Byte 2 | Byte 3 | |7 6 5 4 3 2 1 0|7 6 5 4 3 2 1 0|7 6 5 4 3 2 1 0|7 6 5 4 3 2 1 0| +---------------+---------------+---------------+---------------+ | Version | res |B|E|typ| Segment Length | +---------------+---------------+---------------+---------------+ | | + Connection ID + | | +---------------+---------------+---------------+---------------+ | Message Number | +---------------+---------------+---------------+---------------+ | order | res. | CLEN | +---------------+---------------+---------------+---------------+ | RDMA Buffer ID | +---------------+---------------+---------------+---------------+ | RDMA Buffer offset | +---------------+---------------+---------------+---------------+ | RDMA Length | +---------------+---------------+---------------+---------------+ | | | Control Data | | | +---------------+---------------+---------------+---------------+ | CRC-32 | +---------------+---------------+---------------+---------------+ Note that RDMA_Read_Request messages always consist of exactly one segment and contain no payload data. 3.4 Segment Field Definitions Version: The version number refers to the version of the RDMA format, not to that of the client protocol. This document defines version 1, so this field should contain 0x1. Res: These four bits are reserved and not used by the RDMA mechanism. They may be used by the client protocol. B, E: These are the begin and end bits. The indicate that this segment is the beginning or end, respectively, of the message to which is belongs. Either, both, or neither of these bit may be set for a given segment. Type: Four Types of messages are supported. These are 0 – Send, 1 – RDMA_Write, 2 – RDMA_Read_Request, and 3- RDMA_Read_Response. Segment Length: This field contains the length of the RDMA segment in bytes. This length includes the RDMA segment header and payload up to, but not including, the padding and CRC. Connection ID: The Connection ID is a 64 bit value selected at random. This value is selected by the client side of the connection and included in the first message segment sent over the connection. The same value is then used for all subsequent segments sent in either direction. It is RECOMMENDED that a secure un-guessable random number generator be used to generate these values. The Connection ID serves two purposes. It allows framing to be recovered after a dropped segment, and it provides security against blind attacks. Message Number: For Send, RDMA_Write, and RDMA_Read_Request messages, the message number is assigned sequentially for each message, wrapping from 2^32-1 to 0. All three types of messages are part of a single sequence. For RDMA_Read_Response messages, the message number should be set equal to the originating RDMA_Read_Request. The initial message number for the first (non RDMA_Read_Response) message sent in each direction SHOULD be selected at random. Order: This 16 bit field defines the ordering requirements on a message. A value of "N" in the order field indicates that the payload data may not be read from or written to its ultimate source or destination until all EXCEPT the preceding N messages have been processed. The value of 0xFFFF (all one bits) is reserved to indicate that the operation may be done immediately when received unconditionally. As an example of the above, if a sequence of RDMA_Write and RDMA_Read_request messages are received with an order field containing zero, then the operations must be done in the order received, however the individual segments of a given RDMA_Write message may be written to the target buffer in arbitrary order. If an RDMA_Write or RDMA_Read_Request message is received with an order field containing 5 and a message number of 97, then message number 91 and all previous messages must have been processed. For a Send message, out of order processing implies that the client protocol actually receives these messages out of order. If, for instance, the send messages contain commands, the value in the order field of these messages should not allow reordering unless the client protocol is allowed to process the contained commands out of order. CLEN, control data: This 8 bit CLEN field defines the length of the control data included with a message. The value of CLEN is the number of 32 bit words of control data included. The meaning of the control data is determined by the client protocol, but the significance is that it is not part of the RDMA transfer and should not be written to the RDMA target or read response buffer. For Send messages, the control data designation is only for the convenience of the client protocol, and the only difference between control and payload data is that control data is not counted towards the computation of the data offset. Typically the control data will contain header information for the client protocol in addition to that provided by the RDMA segment format. Data Offset: This field is contained in Send and RDMA_Read_Response messages. The first segment of a message must have a data offset of zero. In each subsequent segment of the message, the offset will be equal to the number of payload bytes sent in all previous segments of the message. RDMA Buffer ID, RDMA Buffer Offset: These values determine the target address of an RDMA read or write. The value of the RDMA Buffer ID must be constant across all segments of an RDMA_Write. The value of the RDMA Buffer Offset can be anything in the first segment of an RDMA_Write, but must be incremented in each subsequent segment by the number of bytes transferred. The exact interpretation of these values is determined by the client protocol, however it is expected that the RDMA Buffer ID, possibly combined with some bits from the RDMA buffer offset, will be used as an index into a table of buffers. The actual data transfer will occur to or from this buffer starting at an offset determined by the RDMA buffer offset, or some bits extracted from the RDMA buffer offset. Unfortunately because of the differing addressing models used by different client protocols, it is not possible to exactly specify how buffer ID and offset are resolved to a physical address in the NIC. It is hoped, however, that even with this protocol dependent feature, the commonality in the RDMA format should allow more efficient implementation of protocol accelerating NICs that support multiple protocols requiring RDMA. RDMA Length: Indicates the number of bytes to be transferred in an RDMA operation. In the case of an RDMA_Write, this is the total number of bytes in the entire message, and the same value must be repeated in each segment of the message. Padding: Between 0 and 3 bytes of padding are used to make the segment a multiple of 4 bytes in length. The padding MUST be set to zero by the sender and ignored by the receiver. CRC-32 The CRC-32 is calculated across the entire segment (but does not cover other segments of the same message, or lower level protocol headers such as TCP). The algorithm used to calculate the CRC is exactly that used for the ethernet CRC except that a different generator polynomial is used. The generator polynomial for the RDMA CRC is x^32 + x^31 + x^30 + x^28 + x^27 + x^25 + x^24 + x^22 + x^21 + x^20 + x^16 + x^10 + x^9 + x^6 + 1. This polynomial is the standard ethernet polynomial with a left-right reversal. (Or mathematically, substitute y = x^-1 and multiply by y^32). In hex format with the x^32 term removed, this is 0xDB710641. It is desirable to use a different polynomial than ethernet so that when an RDMA segment is carried in an ethernet packet, the combined protection of two different polynomials is achieved, rather than checking twice with the same polynomial. [ Add reverence for ethernet CRC and detailed computation algorithm. ] 4. Segments and Messages The four types of messages are divided into two groups. The first group consists of Send, RDMA_Write, and RDMA_Read_Request messages, and the second group consists of RDMA_Read_Response messages. Within each of these two groups, all messages must be sent in order. Each message is divided into one or more segments, and all the segments of a particular message are sent in order. All segments of one message must be sent before the first segment of the next message is sent. However between the two groups, segments may be interleaved arbitrarily. [ Show example of a series of segments following these rules. ] 5. Determination of Framing The beginning of the first segment on a TCP connection occurs of course starting with the first data byte sent. Given the start of a segment, the start of the next segment can be determined by noting the length field in the header of the segment, rounding up to the next multiple of four (to account for padding) and adding four (for the CRC) and moving forward that many bytes in the TCP data stream. In this manner, the beginning of each segment can be determined from the last. When a packet is dropped, however, it is desirable to recover framing on subsequent segments so that they might be processed by the NIC and their payload data placed directly in its ultimate destination. The recommended method for doing this is to assume that the RDMA segment is aligned with a TCP segment, and verify the correctness of the header fields of the RDMA segment. If these header fields are not correct, then the NIC should fall back to buffering the packet until it can be processed in order. In particular, the 64 bit connection ID field was selected at random, so the only way that could match payload data is by pure chance. It can be easily shown that even if a miss-aligned packet arrives every 2us, the MTBF of mistakenly identifying this as an aligned packet is greater than one million years. Checking the message number, CRC, and other fields only enhances the confidence in this determination.
Home Last updated: Tue Sep 04 01:07:06 2001 6315 messages in chronological order |