I 
    think the COBS/COWS should more potential than the markers proposal, 
    or
    the ULP framing without COBS. There is one case 
    where it does add more overhead,
    but the question is how prevelent is the scenario - 
    when outbound zero copy is
    enabled/possible and the NIC does checksum offload 
    and cannot be changed to
    do 
    COBS. Of course changes are required :-).
     
    I 
    also hope that data-centers will use acclerated iSCSI/Clustering HBAs/NICs 
    rather
    than the current solutions. The current solutions 
    MAY be useful for desktops/laptops
    where hopefully there are plenty of spare cycles to 
    do COBS in software.
     
    COBS also has alignment benefits - the header could 
    be aligned with the ULP PDU,
    and the ULP PDU can be aligned with the TCP header 
    and there are no false
    positives. The alignment with the TCP header may 
    not always happen (the mythical
    box in the middle that does TCP resegmentation), 
    but can be detected - in the
    presence of such a box, the performance could 
    reduce to the levels encountered
    when IP fragmentation happens.
     
    I 
    think it is better to have a couple of inter-operable implementations that 
    demonstrate
    the benefit of any of the alternate proposals 
    (especially markers vs cobs) before selecting
    one.
    
      
Jim, 
      
There are some things attractive 
      about COWS - 
1. the hard work 
      - touching every data word has to be done only by the sender (on the 
      normal path) and can be easily included in NIC with accelerator cards that 
      seem to do a good job on the send side 
2. If you are doing CRC or IPsec on a client in 
      software there is no additional penalty (provided you can include the code 
      in the right layer of software) as no data gets moved 
3.It does not have to associated with a TCP packet 
      alignment - and can work in face of TCP segmentation 
Julo 
"Jim Pinkerton" <jpink@microsoft.com> wrote on 17-12-2001 
      17:32:04:
> 
> My main concern with this approach is that 
      we could kill the product but
> win the spec wars. Specifically, 
      this approach means that an
> end-customer has one of two choices in 
      deploying the technology:
> 
>    1) Upgrade both 
      ends, and they'll see the full benefit
>    2) Upgrade 
      only the server side, and see roughly 2-4 times the
> CPU
> 
            utilization on the client if their current 
> 
            implementation is optimized
>     
        on the client side (a mere 2x if they are doing
> 
      significant
>       receives that already require a 
      copy, more like 4x if
> they
>       are 
      primarily doing sends, which currently has no bcopy
> in
> 
            many OS implementations).
> 
> This means 
      that if they pick option 2) and their machines are CPU bound,
> that 
      the data center capacity to handle requests will actually
> 
      *decrease* if they deploy the technology. If the front end has 
      enough
> idle CPU cycles, then they probably could select option 
      2).
> 
> In my experience, we need to make sure we have a 
      volume solution to
> enable NIC vendors to make enough profit to 
      fund the next generation
> (otherwise RDMA/TOE is a one-shot deal 
      and won't keep up with the CPU
> architecture). This means we need a 
      path to the front-end boxes in the
> data center. My concern is that 
      there is no half-measure in the above
> scenario - the IT manager 
      must upgrade the thousand front-end boxes at
> the same time as they 
      upgrade the back end, or deploy asymmetric configs
> where some 
      front end boxes are upgraded. I'm not sure how attractive
> this 
      deployment scenario is.
> 
> Howard and Uri, can you comment 
      on this issue?
> 
> 
> 
> Jim
> 
> 
      
> 
> 
> 
> 
> 
> > -----Original 
      Message-----
> > From: Stephen Bailey 
      [mailto:steph@cs.uchicago.edu]
> > Sent: Monday, December 17, 
      2001 7:13 AM
> > To: uri@broadcom.com; 
      howard.c.herbert@intel.com
> > Cc: csapuntz@cisco.com; Jim 
      Pinkerton; Julian_Satran@il.ibm.com;
> > allyn@cisco.com
> 
      > Subject: Wot I `know' about COWS in hardware
> > 
> 
      > Hi,
> > 
> > I haven't gotten a chance to do a full 
      implementation yet, but here's
> > some architectural properties 
      I believe to be true of a hardware COWS
> > 
      implementation:
> > 
> >   1) can be implemented 
      `in line' on receive
> >   2) requires an MTU-sized RAM on 
      send
> >   3) expected touches to send RAM is 2 per data 
      word (just the `fifo'
> >      write and read ops, 
      no editing), assuming the headers are merged
> >     
       on the outbound side.
> >   4) worst case touches to 
      send RAM is 3 per data word (assuming every
> >     
       word must be edited)
> >   5) eliminates the need for 
      the funky `make sure you don't send
> >     
       anything that's false positive under non-nominal conditions'
> 
      >      behavior of the key/length proposal (I kinda 
      doubted hardware
> >      impls were going to do 
      this anyway, since it was a SHOULD).
> > 
> > Basically, 
      it looks OK to me.  Slowing the sender is much better than
> 
      > slowing the receiver.  Theoretically, we could reverse the 
      pointer
> > chain and allow in-line send, but RAM on receive, but 
      that seems
> > stupid to me.
> > 
> > It's 
      clearly a design tradeoff whether you chose to use the COWS
> > 
      send-side RAM for other purposes, or not.
> > 
> > I'm 
      hoping you guys can tell me whether you think this blows your
> > 
      budget, or other noteworthy (unfortunate) properties.  As you 
      can
> > tell, I have the utmost enthusiasm for the mission (Dr. 
      Chandra...).
> > 
> > Thanks,
> >   
      Steph