|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Ordering Issues for VI over TCPThis proposal pertains to the layering of VI on top of TCP and is being sent to the VIDF and IPS mailing lists. For proposed VI/TCP spec see: http://www.ietf.org/internet-drafts/draft-dicecco-vitcp-00.txt TCP provides a reliable, in order, transport service. So when VI is layered on top of TCP, the VI layer should see all data from the remote end arrive in order. However it has been proposed that some optimized implementations may want to merge the TCP and VI layers, and do this in such a way that incoming packets, which are sometimes not in order, can be fully processed as they arrive, and the data written to its ultimate destination without needing to be buffered pending arrival of any intervening packets necessary to do full in order processing. To fully exploit this out of order processing, it would be necessary to modify the VI API definition to allow optionally relax ordering. To this end I would offer the following proposal. This proposal is NOT offered with respect to the 1.1 revision currently under discussion, and would only be considered in the 2.0 time frame. I am sending this now in hopes of getting some very preliminary feedback as to whether it makes sense to proceed in this direction. Two additional bits are defined in the control field of all transmit descriptors. These are "un-ordered" and "half-ordered". These bits are hint bits in that an implementation is free to ignore them and would still be fully compliant and interoperable. The meaning of these bits is as follows. If the neither bit is set, then the corresponding operation (send, RDMA read, or RDMA write) will be fully ordered as is required currently by VI. This is true even with respect to other operations which may be un-ordered. If the half-ordered bit is set, then the operation will not be completed on the remote host until after all preceding operations are complete. However if a half-ordered operation is followed by an un-ordered operation, an implementation is free to reorder these two. Half ordered operations are useful to send completion messages which guarantee that previous operations have completed. If two un-ordered operations (with no full ordered operations between them) are done, then an implementation is free to reorder these. The following chart indicates whether ordering is required between two operations. Second Op. first op ordered half-ordered un-ordered ------------------------------------------------------------ ordered yes yes yes half-ordered yes yes no unordered yes yes no The only exception to the above is already provided for in the VI spec in that an RDMA read (without the fence bit set) is not guaranteed to be ordered with respect to a subsequent send or RDMA write. More than one of the fence bit, un-ordered bit, and half-ordered bit should never be set. Note that if an application does two unordered sends, R followed by S, and the remote end posts two receive descriptors, X and Y, then message R may end up in the buffer designated by Y, and S in that by X. There may be cases where the application may want to put sequence numbers in the application level messages, and put them back in order after receiving them out of order. The advantage to doing this at the application layer rather than the TCP layer is that zero copy receives can be done directly to the application and only pointer reordering is needed. I would further propose that there be no ordering requirements on the order in which RDMA read responses are written to memory on the requesting host. I am not sure that this is currently spelled out one way or the other in the VI spec. The only restriction here would be that posting multiple RDMA reads pointing to overlapping local receive buffers would be unpredictable. But this is not something that makes any sense to do anyway. (Does anyone disagree with this?) The above proposed semantics are reflected in the VI/TCP protocol as follows. In addition to the currently defined message types of SEND, RDMA_WRITE, and RDMA_READ_REQUEST, three new types are defined: SEND_UNORDERED, RDMA_WRITE_UNORDERED, and RDMA_READ_REQUEST_UNORDERED. An implementation may (but is not required to) use an unordered message type when the following two conditions are met: 1. The un-ordered bit was set in the corresponding transmit descriptor. 2. There are no ordered messages sent for which TCP ACK has not yet been received. The significance of the half-ordered bit is that it allows subsequent un-ordered messages to be sent with the un-ordered message type without having to wait for the associated TCP ACK. On the receiving end, with a RDMA_WRITE_UNORDERED message type, the contained data may immediately be written directly to the buffer even if previous VI segments are missing. With a RDMA_READ_REQUEST_UNORDERED, the rdma read may likewise be done immediately. On receiving a SEND_UNORDERED, The send may be delivered, but the implementation must behave consistently in the case of segmented sends (i.e. if a pair of sends are reordered at the receiver, all segments of each send must be consistently reordered). Reordering of sends will most likely make sense in the presence of short sends which fit in a single packet. Provided the above paragraph on RDMA read responses is acceptable, the any received VI segment with message type RDMA_READ_RESPONSE may have the data written directly to the receive buffer without waiting for previous packets to arrive. This proposal assumes that VI segments and TCP segments are aligned in that there is exactly one complete VI segment contained in each TCP segments. The mechanism for doing that is not discussed here.
Home Last updated: Tue Sep 04 01:07:55 2001 6315 messages in chronological order |