I
think the COBS/COWS should more potential than the markers proposal,
or
the
ULP framing without COBS. There is one case where it does add more
overhead,
but
the question is how prevelent is the scenario - when outbound zero copy
is
enabled/possible and the NIC does checksum offload
and cannot be changed to
do
COBS. Of course changes are required :-).
I
also hope that data-centers will use acclerated iSCSI/Clustering HBAs/NICs
rather
than
the current solutions. The current solutions MAY be useful for
desktops/laptops
where hopefully there are plenty of spare cycles to
do COBS in software.
COBS
also has alignment benefits - the header could be aligned with the ULP
PDU,
and
the ULP PDU can be aligned with the TCP header and there are no
false
positives. The alignment with the TCP header may not
always happen (the mythical
box
in the middle that does TCP resegmentation), but can be detected - in
the
presence of such a box, the performance could reduce
to the levels encountered
when
IP fragmentation happens.
I
think it is better to have a couple of inter-operable implementations that
demonstrate
the
benefit of any of the alternate proposals (especially markers vs cobs) before
selecting
one.
Jim,
There are some things attractive about
COWS -
1. the hard work -
touching every data word has to be done only by the sender (on the normal
path) and can be easily included in NIC with accelerator cards that seem to
do a good job on the send side
2. If you are doing CRC or IPsec on a client in software there is no
additional penalty (provided you can include the code in the right layer of
software) as no data gets moved
3.It does not have to associated with a TCP packet alignment - and
can work in face of TCP segmentation
Julo
"Jim Pinkerton"
<jpink@microsoft.com> wrote on 17-12-2001 17:32:04:
>
> My main concern with this approach is that we could kill the
product but
> win the spec wars. Specifically, this approach means
that an
> end-customer has one of two choices in deploying the
technology:
>
> 1) Upgrade both ends, and they'll
see the full benefit
> 2) Upgrade only the server side,
and see roughly 2-4 times the
> CPU
>
utilization on the client if their current
>
implementation is optimized
> on the client side
(a mere 2x if they are doing
> significant
>
receives that already require a copy, more like 4x if
>
they
> are primarily doing sends, which currently
has no bcopy
> in
> many OS
implementations).
>
> This means that if they pick option 2)
and their machines are CPU bound,
> that the data center capacity to
handle requests will actually
> *decrease* if they deploy the
technology. If the front end has enough
> idle CPU cycles, then they
probably could select option 2).
>
> In my experience, we need
to make sure we have a volume solution to
> enable NIC vendors to make
enough profit to fund the next generation
> (otherwise RDMA/TOE is a
one-shot deal and won't keep up with the CPU
> architecture). This
means we need a path to the front-end boxes in the
> data center. My
concern is that there is no half-measure in the above
> scenario - the
IT manager must upgrade the thousand front-end boxes at
> the same
time as they upgrade the back end, or deploy asymmetric configs
>
where some front end boxes are upgraded. I'm not sure how attractive
>
this deployment scenario is.
>
> Howard and Uri, can you
comment on this issue?
>
>
>
> Jim
>
>
>
>
>
>
>
> >
-----Original Message-----
> > From: Stephen Bailey
[mailto:steph@cs.uchicago.edu]
> > Sent: Monday, December 17, 2001
7:13 AM
> > To: uri@broadcom.com;
howard.c.herbert@intel.com
> > Cc: csapuntz@cisco.com; Jim
Pinkerton; Julian_Satran@il.ibm.com;
> > allyn@cisco.com
>
> Subject: Wot I `know' about COWS in hardware
> >
> >
Hi,
> >
> > I haven't gotten a chance to do a full
implementation yet, but here's
> > some architectural properties I
believe to be true of a hardware COWS
> > implementation:
>
>
> > 1) can be implemented `in line' on receive
>
> 2) requires an MTU-sized RAM on send
> > 3)
expected touches to send RAM is 2 per data word (just the `fifo'
>
> write and read ops, no editing), assuming the
headers are merged
> > on the outbound
side.
> > 4) worst case touches to send RAM is 3 per data
word (assuming every
> > word must be
edited)
> > 5) eliminates the need for the funky `make sure
you don't send
> > anything that's false
positive under non-nominal conditions'
> >
behavior of the key/length proposal (I kinda doubted hardware
>
> impls were going to do this anyway, since it was a
SHOULD).
> >
> > Basically, it looks OK to me.
Slowing the sender is much better than
> > slowing the
receiver. Theoretically, we could reverse the pointer
> >
chain and allow in-line send, but RAM on receive, but that seems
>
> stupid to me.
> >
> > It's clearly a design tradeoff
whether you chose to use the COWS
> > send-side RAM for other
purposes, or not.
> >
> > I'm hoping you guys can tell me
whether you think this blows your
> > budget, or other noteworthy
(unfortunate) properties. As you can
> > tell, I have the
utmost enthusiasm for the mission (Dr. Chandra...).
> >
>
> Thanks,
> >
Steph