|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] RE: An IPS Transport Protocol (was A Transport Protocol Without ACK)
> From: stewrtrs@stewart.chicago.il.us
>
> Any transport protocol proposal is ok. As long as it can be seen and
> reviewed. So far I have seen only two TCP and SCTP.
>
> Oh, a little side note, any transport protocol proposed MUST be able to
> show TCP like behavior in the face of congestion. And I think, IMHO, that
> this means that if it is NOT using RFC2581 procedures it MUST show that
> it does backoff and share with TCP. It also has a HEAVY burden of proof to
> show this facility at least in my mind and I would think in the
> IESG's mind
> as well...
I will try to describe a transport protocol for iSCSI herein. This proposal
addresses the RFC2581 for congestion management as well as queuing and
resource management for iSCSI initiator and target devices. I will call
this IPS (IP Storage) Protocol which is a hybrid between FCP of fibre
channel and TCP of IP. The way this email is written, it is not a formal
proposal by any stretch of imagination. I am a career adapter designer and I
don't do RFC or windows and floors. Therefore, in describing this IPS
Protocol if I misuse any words that have specific meanings to RFCs, my
sincere apology to this working group. Herein I assume the iSCSI IETF effort
can be broken into two parts: one for mapping a SCSI request and response to
one or more iSCSI PDUs and another for accommodating a transport protocol
such as TCP, SCTP, or this proposed protocol, IPS. This proposal addresses
the second effort of the IETF. If this assumption is wrong, hit the delete
key now so you won't waste any more time.
1. The Needs
The speed of light travels about 5 us per kilometer or 8 us per mile. With
3000 miles between New York and Los Angeles, the Round Trip Time (RTT) is
3000 x 8 x 2, or 48 msec,, not counting the queuings and delays in the
switches and routers. Comparing to the latency of just a few microseconds
on locally attached devices, to make iSCSI device a meaningful alternative,
it must have an appropriate transport protocol that deals with the long
latency. Furthermore, the congestion of the Internet Network that drops and
duplicates datagrams demands an efficient and reliable detection of error
and retransmission. Finally, given TCP/IP is a well accepted and proven
transport protocol, iSCSI must support TCP/IP.
2. Executive Summary
For those do not have time to read this long posting, this IPS proposal
describes the processing -- both creating and parsing -- of an iSCSI PDU
encapsulated within an Internet TCP/IP datagram. Hence the proposal
complements the current IETF effort that defines the iSCSI PDUs. An iSCSI
PDU starts with a media header such as Ethernet or Fibre Channel, followed
by an IP header, an TCP header, an iSCSI header, and, finally, the data
payload with CRC. An iSCSI service provider, either a iSCSI driver running
on top of a simple old fashion NIC adapter or a sophisticated
fiber-channel-like-iSCSI adapter with large amount of microcode and local
memory, will perform the protocol processing. This proposal describes the
processing -- the semantics -- that solves the iSCSI needs above. Since,
the iSCSI PDU has a TCP/IP header, this proposal does not preclude the using
of TCP/IP protocol for iSCSI. This IPS protocol addresses congestion
management like that in RFC2581 that describes a "good citizenship behavior"
of a protocol on how to start and to retransmit data segments on a busy
network. This protocol modifies the RFC2581 to deal with long Internet
latency of delivery of datagrams. The protocol ensures efficient and yet
reliable delivery. By stealing some ideas from fibre channel adapters,
which is now targeted for 50,000 IOs per second, this protocol also
describes the creation of an exchange table which deals with thousands of
concurrent iSCSI requests and responses without the problems of deadlock and
resource allocations.
3. Terms
A segment -- a term used in the RFC2581, same as an iSCSI PDU
ACK and ACK-0 -- an acknowledge PDU. Refer ACK-0 to the FC-PH spec.
An Exchange -- roughly like a session defined by the working group
except it is executed on a single TCP connection
An iSCSI Request/Response Message -- an APL to an iSCSI Provider describing
sending/receiving an iSCSI request/response.
BB-Credit -- refer to the FC-PH spec.
cwnd and rwnd -- Congestion and Receive Windows, terms used in the RFC2581.
They have the same value in this protocol
SOCKET, CONNECT, BIND Systems Calls -- same meaning as the TCP/IP
implementation
Delay Constant -- the time units between transfers of sequences
Data Descriptors -- in the form of a memory handle or a scatter/gather list
in an iSCSI request/response for sending/receiving segments
DMA -- Direct Memory Access to transfer iSCSI data payloads to/from iSCSI
application software using the data descriptors inside a iSCSI
request/response message
EE-Credit -- refer to the FC-PH spec.
Exchange ID -- OX_ID and RX_ID, please refer to the FC-PH spec.
iSCSI Provider -- an iSCSI driver together with an old fashion NIC adapter
or a modern superfast iSCSI adapter
iSCSI PDU -- as defined by this working group
Sequences -- an exchange has many sequences each of which has many segments
Tag Queuing -- refer to the SCSI SAM spec.
TCP Connection -- A pair of IP-Address and TCP port that uniquely identifies
an application process that transmits/receives an iSCSI PDU.
Retransmission -- A part of error recovery to retransmit a lost sequence
4. Congestion Management
The RFC2581 is not specific to TCP. It should be used by every transport
protocol sharing the network, although the authors of the RFC based their
experiments and conclusions using TCP. The RFC covers four specific topics:
slow start, congestion avoidance, fast transmit, and fast recovery. If
other protocols on the network are not following the same rules, while a TCP
client/server using slow start waits patiently on a congested network, other
protocols will continue flood the network with new data segments, hence,
defeating the congestion management. The RFC2581 definitely is not the best
thing for a network with extreme long latency. Let me use an example to
describe the problem before describing the solution. Assume the latency
delay or round-trip time of two iSCSI devices between N.Y. and L.A. is 50
msec. In addition, assume data segment is 2K. Using the slow start
algorithm of the RFC2581, a sender only sends two segments at beginning and
waits for the ACKs before increase its cwnd. After waiting 50 msec, the
sender increases its cwnd to 3, sends 3 segments, and waits again. On a
not-so-busy network, to send one MB of data or 500 segments, the sender
being a good citizen on the network, will repeat the wait 32 times to send
all 500 2K segments. The total time for delivering one MB of data is 50
msec times 32, or about 1.6 seconds. One may argue that given enough time,
the cwnd can be increased to 500 and the whole one MB of data can be
transferred once. However, any lost packet or out-of-order delivery --
which we assume happening often and is the reason for having the slow
start -- the sender seeing the duplicated ACKs slows down immediately by
reducing cwnd quickly. Furthermore, the RFC also does slow start after some
idle time. This is because the network congestion status is no longer known
after some idle time. In this super fast Internet era, when we are
designing adapters to process each fibre channel request in 20 microseconds
and 50,000 IO's per second, the 50 msec wait and 1.6 sec for moving one MB
of data using slow start simply sounds awful. This problem becomes much
worse when the MTU is not 2K but reduced to 512 bytes. In this case, there
are 2000 segments for a one MB transfer. I don't need to challenge your
imagination when the iSCSI is used to back up one TB of data.
Now the solution. In the IPS protocol breaks down the 1MB data into 25
20K-sequences. Each sequence has ten 2K segments. Each sequence will be
acknowledged individually. We define a Delay Constant between the transfer
of two consecutive sequences. On a not-so-busy network, the delay should be
zero. Hence, the sender sends all 25 sequences or 500 segments without
delay. Using a 1 Gb adapter, the whole 1 MB of data goes out in 10 msec.
25 msec later they arrive at the destination. Each sequence is acknowledged
individually. 25 msec later, all 25 ACKs come back to the sender. The
whole one 1 MB is transferred in 60 msec, not 1.6 sec. Comparing to the 10
msec transfer on a local network, 60 msec is not so great, but it is the
best we can do because the 50 msec delay is contributed by the
speed-of-light. A thousand TCP connections will not rid the 50 msec delay.
If we decide not to keep this IPS Protocol simple and stupid, we can make
the ACK a little more specific by specifying which particular segment is
missing. Only missing segments are retransmitted. We can even bundle the
missing segments from different sequences by defining a transmitted sequence
which contains only retransmitted segments. As an adapter design, I prefer
keeping it simple and stupid by retransmit the whole sequence. Instead, we
fine tuning it by changing the size of a sequence. When retransmit is
necessary, the sender will act as a good citizen by increase the delay
constant between sequences. On successful transmit, the sender will
decrease the delay constant. Exactly how aggressively should we back away
from a congested network -- by a large jump of the delay constant -- will be
left for simulation. I do believe the result will depend on the segment
sizes the latency values. Note, the performance of this protocol does not
depend on the MTU size because it is designed to stream the segments.
Notice, this IPS protocol takes an optimistic view about the Internet
traffic, i.e., assuming the traffic is light. If not true, it backs off
quickly. I believe this is necessary for a network with long latency delay
because we can't afford the slow start. A second thing about this IPS
Protocol is that one ACK is generated on each sequence instead of each
segment. Using the bulk ACK on a busy network with long latency reduces the
ACK traffic. The third thing about the IPS is it assumes the receiver is
very intelligent to generate the bulk ACK. Of course, if an ACK is missing,
the missing sequence is detected by timeout and must be retransmitted. We
should also use the ACK-0 of the fibre channel to signal the sender that
everything is OK even some ACKs are not received by the sender. ACK-0 will
greatly reduce the retransmission by a missing ACK.
5. Queuing Management
An IPS request/response message is transaction-oriented, i.e. the whole
"iSCSI session" is described in a single request/response message to the IPS
provider. Within a request, SCSI command, one or more endpoints, i.e. IP
address and TCP port pairs, and data descriptors in the form of a memory
handle or a scatter/gather list, and other needed variables are provided.
The IPS request/response message is sent to a iSCSI provider that is
responsible for creating outgoing PDUs and receiving incoming PDUs. To the
provider, each message is an exchange between two endpoints. The initiator
give it an OX_ID and the target gives it a RX_ID. Each exchange is executed
atomically, i.e. the IPS provider is responsible for sequencing the SCSI
command, data, and status. There are no command queuing or head-of-queue
deadlock problems. This is because the IPS provider creates a giant
exchange table. Whenever a data PDU is received, using OX_ID or RX_ID to
find the exchange, the IPS refers to the exchange table to determine what to
do. Data PDUs are served on demand, hence, no head-of-queue blocking
problem. Outgoing data PDUs are broken down into sequences. After the
transfer of each sequence the IPS provider can switch to another exchange to
avoid long delay behind a large exchange. For those who familiar with a
fibre channel adapter, executing an IPS request is like executing an FCP
request, except for the congestion management described earlier. If more
than one endpoint is in the iSCSI request/response message, the IPS provider
can take the liberty of selecting another endpoint to transmit or
retransmit. However, when a different endpoint is used, the whole message,
or session, is repeated. A Task Management PDU like ABORT may be needed to
avoid confusion on the receiver side.
I do appreciate that some people will implement the iSCSI provider in the
old fashion stream-oriented TCP protocol instead of this IPS protocol. I
don't have any problem for the working group in trying to solve their
problems. Personally, I will never implement an iSCSI provider using TCP
stream oriented protocol. I will implement the aforementioned congestion
management in a fibre channel adapter today as an IPS provider. As long as
an IPS provider deals with the PDU's correctly, it should always
interoperate with another node which uses TCP stream oriented protocol. Of
course, how do two endpoints generate the ACKs must be uniform. In dealing
with an IPS provider using TCP, the concept of transfer sequence disappear.
Each sequence is a single segment which is ACK'ed individually. By the way,
I will never consider multiple TCP paths to reduce latency time because the
IPS provider like a fibre channel adapter is targeted to deliver 50,000 IOs
per second going to 100,000 IOs in the near future. The context switch time
between multiple TCP paths will make the 100,000 IOs impossible. Keeping
the segments streaming on the same connection path is the only good solution
for long latency delay.
6. Resource Management
There are three layers of resource management. First, the BB credit takes
care of two nodes connecting point-to-point or on the same arbitrated loop.
Using BB credit, one node can never overrun the incoming buffer of another
node. This does not apply to iSCSI device connecting to Ethernet due to the
collision avoidance protocol, i.e, one has no control of the sender of the
incoming segments. Second, the EE credit is equivalent to the rwnd variable
of the RFC2581. It manages how many segments a receiver is willing to
receive. The EE credit concept is unpractical on a network with long
latency. Using the example of the one MB transfer earlier, if the EE credit
is small, the sender must wait after its EE credit is exhausted. Only ACKs
can replenish the EE credit. The wait is 50 msec each time. In fact, it is
imperative for an IPS provider to use DMA to empty incoming segments from
its buffer in lieu of EE-credit management. Using EE-credit to slow down
the sender on a network with long latency makes the performance unpractical.
Finally, the third, the number of SCSI commands can be sent to a target
device is governed by the SCSI tag queuing concept. The initiator is always
aware of the number of SCSI commands can be sent to a target. It simply
does not make sense to send ten commands to a target who can only accept
five. After command #6 is rejected with queue busy, the is no guarantee
that command #7 will also be rejected. This is because command #1 could be
completed before #7 arrives. If #7 is not rejected, then, #6 and #7 will be
executed out of order and not acceptable. With the exception of SCSI tag
queuing, an IPS provider can not use either BB or EE credits. It must use
DMA to empty the incoming segments quickly. For those who implement the IPS
provider in TCP, EE credit can be used. Then, one must pay the price of a
network with long latency delay. Last, but not least, in the IPS protocol,
an IPS provider never needs to allocate cache memory to receive PDUs. This
is because it uses the memory supplied by application software with the data
descriptor in the request/response message. Each message sets up one
exchange table entry which saves the data descriptor. When a PDU is
received without an exchange table entry, the segment is unsolicited and
thrown away. In other words, the IPS provider is not responsible for an
incoming segment when there is no application program waiting for it. This
is like the TCP receiving an incoming segment which has an invalid port
number. Like setting up the TCP port, an application program must always
instruct the IPS provider to create an exchange table entry to receive
incoming iSCSI segments.
It is OK to send data to a target right after a SCSI command without waiting
the Read-To-Transfer from the target. This is known as streaming transfer.
When a target uses a IPS message to receive a SCSI command, it can also have
the option to provide data descriptors to receive the streamed data without
the need of returning R2T first. The streaming transfer is OK'ed when a
connection is made.
7. Multiple NICs
We certainly do not exclude multiple IPS providers. I believe a wedge
driver sits on top of the IPS providers may choose different one for load
balance as long as they can reach the same destination. Note, since each
IPS request/response message is executed atomically by one IPS provider,
there is no synchronization between them. One the receiving end, the
application software can set up multiple IPS provider to receive incoming
requests. I don't know enough about this area to make meaningful comments.
8. Multiple Paths to Same Destination
The IPS protocol uses the SOCKET, CONNECT, and BIND system calls to make a
TCP connection. It is assumed that when there are multiple IP addresses to
reach a same destination, the SOCKET data structure will provide such
information which in turn will be given to the IPS provider for
retransmission consideration.
Y.P. Cheng, CTO, ConnectCom Solutions Corp.
Home Last updated: Tue Sep 04 01:07:07 2001 6315 messages in chronological order |