RE: IETF mailing list question on Storage over Ethernet/IP

To: Michael Krause <krause@cup.hp.com>
Subject: RE: IETF mailing list question on Storage over Ethernet/IP
From: Dave_Lee@3com.com
Date: Wed, 31 May 2000 10:56:26 -0700
cc: wayland@troikanetworks.com, ips@ece.cmu.edu
Content-Disposition: inline
Content-type: text/plain; charset=us-ascii
Delivery-Date: Wed May 31 14:00:51 2000
Sender: owner-ips@ece.cmu.edu

FYI - Cisco demonstrated 10 GE at N+I a few weeks ago.

Michael Krause <krause@cup.hp.com> on 05/31/2000 06:25:19 AM

Sent by:  Michael Krause <krause@cup.hp.com>

To:   wayland@troikanetworks.com, ips@ece.cmu.edu
cc:    (Dave Lee/HQ/3Com)
Subject:  RE: IETF mailing list question on Storage over Ethernet/IP

At 06:21 PM 5/30/00 -0700, Wayland Jeong wrote:

>This problem is not true of any network. Since FC supports both
>link-level and end-to-end flow-control, the likelihood of packets being
>dropped under congestion situations is mitigated.

Packet drops are bad for many protocols.  However, just is bad is having
fabric efficiency drop due to head-of-line blocking with long congestion
timeouts within routing elements.  This limits the effective throughput of
the fabric and results in higher level application timeouts or additional
resources within the endnodes to deal with application resource shortages.

>Packets can still expire within congested switches but only after long
>time-outs occur (RA_TOV). Momentary bursts of congestion can be handled by
>applying backpressure all the way to the producers. Cisco's WRED (Weighted
>Random Early Detection) as I understand it, is a mechanism to start
>dropping packets before congestion becomes critical. The dropping of
>packets will trigger back-off by TCP.

WRED provides a number of advantages.  It can be invoked when a threshold
is hit such that one does not just drop the moment congestion is hit.  When
invoked, it can apply a filter mechanism based on packet attributes, e.g.
class of service.  In general, WRED has been shown to improve the overall
efficiency of the fabric while reducing oscillations for applications thus
delivering overall smoother operation.

>My point here is that in a streaming storage environment, packet loss is
>very bad. We want to prevent, at all costs, packet loss since losing a
>packet means triggering an I/O level retry of a very large chunk of data.

This is true for nearly any application I can think of which uses a network.

>I agree, bandwidth is always a solution, but are we going to see a
>ubiquitous deployment of 10Gbs Ethernet any time soon?

I believe 10 GbE will be available before 10 Gb FC and that will be about
the same time as this spec becomes sufficiently solid to begin building
product.  So, the answer is yes.  Also, most workloads can operate quite
well with GbE if they operate with FC today and one can always aggregate
GbE links (802.1ad) to provide a fatter pipe while waiting for 10 GbE.

>Okay. I am not familiar enough with today's LAN products to comment
>on their ability to provide near guaranteed QoS (i.e. fractional bandwidth).
>So, QoS applied correctly or proper configuration of networks (i.e.
>matching bandwidth requirements) could alleviate most of the problems.

Yes.

>I am still curious though how windowing works in the SIP model. As I
>understand it, the proposal calls for a single command connection
>(a target implements a well-known TCP server port). After authentication
>of the client (host), a data channel is allocated for that connection.
>Windowing applies to that channel which provides the target, the
>ability to manage buffers with that host. Thus, if a data channel
>communicates with server A, the target will advertise its window size
>which would correspond to its available buffer space for that host.
>But, typically, many hosts will login with one target. Thus, each
>host will have its own data channel and hence its own advertised
>window size. If, say 10 hosts connect to a given target and each
>is allocated a 64KB window size, then the total buffer space
>available at the target must be 640KB. Now, a host will have no
>problem allocating this space, but the congestion point of interest
>is not host memory, but in the target adapter itself (in fact, these
>may be one in the same on a low-cost drive).

1 MB of DRAM costs about $1.  I can buy very low cost adapters today with 1
MB of DRAM without much problem - a variety of GbE adapters ship with at
least this much and they are much cheaper than FC adapters.  There are a
number of adapters out there that support 8, 16, or 32 MB of memory for a
slight cost delta more and I've seen a few adapters that can support 256 MB
of memory.

>Now, RTT is one way to coordinate access to the local buffers in
>the target interface which may be acceptable. But, the equivalent
>in FCP is XFER_READY and the intention of this protocol is to
>both pace FCP_WRITES and also give the producer and indication
>of what data is the best data to send. It really has no mapping
>to physical buffers, only cache state.

End-to-end flow control is implemented in both protocols though in two
slightly different manners.  It is possible to implement a similar credit
scheme on top of TCP with little difficulty.  I have some ideas on how this
could be implemented within this spec but am still bouncing them around
within HP to see if they are in alignment with the overall architecture.

>It seems to me that protocol-level mechanisms for handling
>flow control, like windowing and RTT, are better suited for gross-level
>congestion management. In my opinion providing near zero
>packet loss is not best handled at the protocol-level.

I'd be interested in any data or modeling that suggests one mechanism over
the other if you have it available.

>Yes, I think I understand how RDMA works. I was only making
>a comment that the thrust of the IETF work is geared towards
>creating an architecture which can yield acceptable performance.
>I think the mapping is fairly straightforward. Making implementations
>which achieve good performance and are cost effective is the real
>challenge.
>
>Now, I saw a comment in the RDMA proposal which said that the
>MSS size should be no more than 8KB to avoid fragmentation. How
>does an MSS of 8KB avoid fragmentation on a 1.5KB MTU Ethernet
>network? I'm sure I'm just missing something here.

The RDMA proposal has some problems in its current form.  Again, RDMA is
about packet placement and not fragmentation avoidance and thus the
proposal needs to be fixed.

>I would be interested to see some information on this implementation.
>Is there some public whitepapers or such on this product? I would
>assume that it was on HP-UX.

It is on HP-UX.  There is a whitepaper that will be coming out quite soon
on the performance - should be out in June if I recall.

>I'm not that familiar with GSN. Is that a HIPPI-based network?

GSN is another name for HIPPI6400.

>Yeah, I guess the bottom-line is that I'm not arguing about limitations in
>the architecture. I'm more interested in actual implementations. How will
>one implement this architecture and what kind of performance might one expect?

I don't think the implementation is all that difficult.  Most people might
leverage some of the SGL algorithms and driver work that is used with say a
Tachyon-TL implementation and merge this into a good GbE
implementation.  HP has been looking into how this would be done and have
not seen anything insurmountable yet.

Mike

Prev by Date: RE: IETF mailing list question on Storage over Ethernet/IP
Next by Date: RE: IETF mailing list question on Storage over Ethernet/IP
Prev by thread: RE: IETF mailing list question on Storage over Ethernet/IP
Next by thread: RE: IETF mailing list question on Storage over Ethernet/IP
Index(es):
- Date
- Thread

Home

Last updated: Tue Sep 04 01:08:14 2001
6315 messages in chronological order