Re: is 1 Gbps a MUST?

To: "'ips@ece.cmu.edu'" <ips@ece.cmu.edu>
Subject: Re: is 1 Gbps a MUST?
From: "CAVANNA,VICENTE V (A-Roseville,ex1)" <vince_cavanna@agilent.com>
Date: Thu, 21 Feb 2002 20:14:06 -0700
Cc: "'Martin, Nick'" <Nick.Martin@compaq.com>, "'Bernard Aboba'" <bernard_aboba@hotmail.com>, "SHEEHY,DAVE (A-Americas,unix1)" <dave_sheehy@agilent.com>, "THALER,PAT (A-Roseville,ex1)" <pat_thaler@agilent.com>, "CAVANNA,VICENTE V (A-Roseville,ex1)" <vince_cavanna@agilent.com>
Content-Type: multipart/mixed;boundary="----_=_NextPart_000_01C1BB4E.FC719570"
Sender: owner-ips@ece.cmu.edu


Thanks for the clarification. Something still bothers me however.
If IPSec is a bottleneck (because the policy lookup is done in software)
then the receiver may be forced to drop packets quite frequently. Such
behavior could have a dramatic effect on performance as explained in a memo
that Jonathan Stone posted on 2/5/02 (attached) and in my interpretation
which I did not post on 2/6/02 (attached). Comments? Thanks.

Vince Cavanna
Agilent Technologies

 <<Re: iSCSI: No Framing >>  <<RE: iSCSI: No Framing >>

To: "Peglar, Robert" <robert_peglar@xiotech.com>
Subject: Re: iSCSI: No Framing
From: Jonathan Stone <jonathan@dsg.stanford.edu>
Date: Tue, 5 Feb 2002 20:09:41 -0700
Cc: ips@ece.cmu.edu
Content-Type: text/plain;charset="ISO-8859-1"

In message <ED8EDD517E0AA84FA2C36C8D6D205C1301CBF2C5@alfred.xiotech.com>,
"Peglar, Robert" writes:

>The original thread began with a question (paraphrased) about '...what
>applications could consume a 10G pipe for long periods of time'.  I
answered
>that question - disk-disk backup and subsystem replication.

Even disk-to-disk applications or backup applications really want
approximately BW*RTT worth of buffering.  Hugh Holbrook's recent
Stanford PhD thesis traces the conventional wisdom back to an email
from Van Jacobson to the e2e list in 1990.

It's reasonably well-known in the TCP community that TCP slow-start
generates spiky traffic. It leads to bursts of high buffer occupancy
(e.g., at the point where the exponential rampup switches to
congestion avoidance.)  Indeed, that was the motivation behind
TCP-Vegas, and the recent work on TCP pacing.

The whole debate over framing/marking only makes sense if one views
outboard NIC buffering of RTT*BW as very expensive (e.g., forcing a
design from onchip RAM to external SRAM). Adding framing of iSCSI PDUs
allows the NIC to continue doing direct data placement into host
buffers, accomodating the BW*RTT of TCP buffering in "cheap" host RAM
rather than "expensive" NIC RAM.  

But you can't get away from providing the buffers. Not unless you are
also willing to artificially restrict throughput.  If iSCSI doesn't
provide some form of framing, then what can a NIC on a MAN with medium
BW*RTT do, if it sees a drop? It has only a few choices:

  1. start buffering data outboard, hoping that TCP fast-retransmit will
     send the missing segment(s) before the outboard buffers  are exhausted;

  2. Give up on direct  data placment, and start delivering packets to
    host memory, any old how --at the cost of SW reassembly and alignment
    problems, and a software CRC, once the missing segment is recovered.

  3. Start dropping packets, and pay a huge performance cost.

There are some important caveats around the BW*RTT: if we can
*guarantee* that the iSCSI NICs are never the bottleneck point, or
that TCP never tries to reach the true link BW*RTT (due to undersized
windows), then one can get away with less. (See Hugh Holbrook's thesis
for more concrete details).

But the lesson to take away is that even in relatively well-behaved
LANs, TCP *by design* is always oscillating around overloading the
available buffers, causing a drop, then backing off.  See, for
example, Figure 2 of the paper by Janey Hoe which introduced "New
Reno"; or Fig. 2 and 3 of the paper by Floyd and Fall. New Reno avoids
the long-timeouts between each drop, but the drops themselves still
occur.

Moral: TCP can require significant buffering even on quite modest
networks.  It __may__ be worth keeping framing, so that host NICs can
do more of that buffering in host memory rather than outboard; and so
they can continue performing DDP rather than software reassembly and
software CRC checking. Storage devices are another issue again.

References:

Van Jacobson, modified TCP congestion avoidance algorithm.
Email to end2end@isi.edu, April 1990.

L Brakmo, , S O'Malley, L Peterson, TCP Vegas: new techniques for congestion
detection and control, SIGCOMM 94.

J Kulik, R Coulter, D Rockwell, and C Partridge, A
simulation study of paced TCP. BBN TEchnical Memorandum 1218, BBN,
August 1999.

J Hoe, Improving the Startup Behaviour of a Congestion Control Scheme
for TCP,  ACM SIGCOMM 1996, 

S Floyd and K Fall, Simulation-based comparisons of Tahoe, Reno, and SACK
TCP, Comp. COmm. Review no 6 v 3, April 1996.

H Holbrook.  A Channel Model for Multicast.  PhD Dissertation.
Department of Computer Science.  Stanford University.  August, 2001.
http://dsg.stanford.edu/~holbrook/thesis.ps{,.gz}. (See Chapter 5.)

(Holbrook cites Aggrawal, Savage, and Anderson, INFOCOMM 2000, on the
downsides of TCP pacing; but I haven't read that.  The PILC draft on
link designs touch the same issue, but the throughput equations cited
there factor out buffer size.)

>FC is not sufficient.  Storage-to-storage needs all the advantages as well
>as that which iSCSI has to offer the host-storage model.

But it will still need approximately BW*RTT of buffering, even for
low-delay LANS. Or performance will fall off a cliff under
"congestion" -- e.g., each time some other iSCSI flow starts up,
begins competing for the same TCP endpoint buffers, on the same iSCSI
device, and triggering a burst of TCP loss events for the
storage-to-storage flow.

To: 'Jonathan Stone' <jonathan@dsg.stanford.edu>
Subject: RE: iSCSI: No Framing
From: "CAVANNA,VICENTE V (A-Roseville,ex1)" <vince_cavanna@agilent.com>
Date: Wed, 6 Feb 2002 15:25:01 -0700
Cc: "CAVANNA,VICENTE V (A-Roseville,ex1)" <vince_cavanna@agilent.com>
Content-Type: text/plain;charset="ISO-8859-1"

Hello Jonathan,

Interesting and useful points!

I would appreciate your opinion on the following observation.

It seems to me that the cliff-like drop in performance that is a consequence
of dropping packets is likely to result, as well, from any other bottleneck
that may exist in the path to the buffers such as an IPSec engine that is
not capable of link-speed throughput or large, even if occassional, latency
in any internal shared medium that lies in the path to the buffers. It is
easy for the IPSec engine to become a bottleneck (even without considering
the crypto algorithms) since it has to perform a complex policy database
lookup on every received packet, secured or not, to confirm that the packet
was afforded the appropriate security as indicated by the configured
security policy. The moral I took from your memo is that link speed
throughput is necessary all the way to the buffers.

Thanks,
Vince Cavanna
Agilent Technologies

|-----Original Message-----
|From: Jonathan Stone [mailto:jonathan@dsg.stanford.edu]
|Sent: Tuesday, February 05, 2002 7:10 PM
|To: Peglar, Robert
|Cc: ips@ece.cmu.edu
|Subject: Re: iSCSI: No Framing 
|
|
|In message 
|<ED8EDD517E0AA84FA2C36C8D6D205C1301CBF2C5@alfred.xiotech.com>,
|"Peglar, Robert" writes:
|
|>The original thread began with a question (paraphrased) about '...what
|>applications could consume a 10G pipe for long periods of 
|time'.  I answered
|>that question - disk-disk backup and subsystem replication.
|
|Even disk-to-disk applications or backup applications really want
|approximately BW*RTT worth of buffering.  Hugh Holbrook's recent
|Stanford PhD thesis traces the conventional wisdom back to an email
|from Van Jacobson to the e2e list in 1990.
|
|It's reasonably well-known in the TCP community that TCP slow-start
|generates spiky traffic. It leads to bursts of high buffer occupancy
|(e.g., at the point where the exponential rampup switches to
|congestion avoidance.)  Indeed, that was the motivation behind
|TCP-Vegas, and the recent work on TCP pacing.
|
|The whole debate over framing/marking only makes sense if one views
|outboard NIC buffering of RTT*BW as very expensive (e.g., forcing a
|design from onchip RAM to external SRAM). Adding framing of iSCSI PDUs
|allows the NIC to continue doing direct data placement into host
|buffers, accomodating the BW*RTT of TCP buffering in "cheap" host RAM
|rather than "expensive" NIC RAM.  
|
|But you can't get away from providing the buffers. Not unless you are
|also willing to artificially restrict throughput.  If iSCSI doesn't
|provide some form of framing, then what can a NIC on a MAN with medium
|BW*RTT do, if it sees a drop? It has only a few choices:
|
|  1. start buffering data outboard, hoping that TCP 
|fast-retransmit will
|     send the missing segment(s) before the outboard buffers  
|are exhausted;
|
|  2. Give up on direct  data placment, and start delivering packets to
|    host memory, any old how --at the cost of SW reassembly 
|and alignment
|    problems, and a software CRC, once the missing segment is 
|recovered.
|
|  3. Start dropping packets, and pay a huge performance cost.
|
|There are some important caveats around the BW*RTT: if we can
|*guarantee* that the iSCSI NICs are never the bottleneck point, or
|that TCP never tries to reach the true link BW*RTT (due to undersized
|windows), then one can get away with less. (See Hugh Holbrook's thesis
|for more concrete details).
|
|But the lesson to take away is that even in relatively well-behaved
|LANs, TCP *by design* is always oscillating around overloading the
|available buffers, causing a drop, then backing off.  See, for
|example, Figure 2 of the paper by Janey Hoe which introduced "New
|Reno"; or Fig. 2 and 3 of the paper by Floyd and Fall. New Reno avoids
|the long-timeouts between each drop, but the drops themselves still
|occur.
|
|Moral: TCP can require significant buffering even on quite modest
|networks.  It __may__ be worth keeping framing, so that host NICs can
|do more of that buffering in host memory rather than outboard; and so
|they can continue performing DDP rather than software reassembly and
|software CRC checking. Storage devices are another issue again.
|
|
|
|
|
|References:
|
|Van Jacobson, modified TCP congestion avoidance algorithm.
|Email to end2end@isi.edu, April 1990.
|
|L Brakmo, , S O'Malley, L Peterson, TCP Vegas: new techniques 
|for congestion
|detection and control, SIGCOMM 94.
|
|J Kulik, R Coulter, D Rockwell, and C Partridge, A
|simulation study of paced TCP. BBN TEchnical Memorandum 1218, BBN,
|August 1999.
|
|J Hoe, Improving the Startup Behaviour of a Congestion Control Scheme
|for TCP,  ACM SIGCOMM 1996, 
|
|S Floyd and K Fall, Simulation-based comparisons of Tahoe, 
|Reno, and SACK
|TCP, Comp. COmm. Review no 6 v 3, April 1996.
|
|H Holbrook.  A Channel Model for Multicast.  PhD Dissertation.
|Department of Computer Science.  Stanford University.  August, 2001.
|http://dsg.stanford.edu/~holbrook/thesis.ps{,.gz}. (See Chapter 5.)
|
|(Holbrook cites Aggrawal, Savage, and Anderson, INFOCOMM 2000, on the
|downsides of TCP pacing; but I haven't read that.  The PILC draft on
|link designs touch the same issue, but the throughput equations cited
|there factor out buffer size.)
|
|
|
|>FC is not sufficient.  Storage-to-storage needs all the 
|advantages as well
|>as that which iSCSI has to offer the host-storage model.
|
|But it will still need approximately BW*RTT of buffering, even for
|low-delay LANS. Or performance will fall off a cliff under
|"congestion" -- e.g., each time some other iSCSI flow starts up,
|begins competing for the same TCP endpoint buffers, on the same iSCSI
|device, and triggering a burst of TCP loss events for the
|storage-to-storage flow.
|

Follow-Ups:
- Re: is 1 Gbps a MUST?
  - From: Paul Koning <ni1d@arrl.net>

Prev by Date: RE: is 1 Gbps a MUST?
Next by Date: Re: iSCSI: MaxBurstSize
Prev by thread: RE: is 1 Gbps a MUST?
Next by thread: Re: is 1 Gbps a MUST?
Index(es):
- Date
- Thread

Home

Last updated: Fri Feb 22 10:18:08 2002
8845 messages in chronological order