Re: TCP RDMA option to accelerate NFS, CIFS, SCSI, etc.

To: ips@ece.cmu.edu, tcp-impl@grc.nasa.gov
Subject: Re: TCP RDMA option to accelerate NFS, CIFS, SCSI, etc.
From: Vernon Schryver <vjs@calcite.rhyolite.com>
Date: Fri, 25 Feb 2000 09:35:30 -0700 (MST)
Delivery-Date: Fri Feb 25 11:36:02 2000
Sender: owner-ips@ece.cmu.edu
] From: Charles Esson <charlese@cvs.com.au>

] I must have missed something.
]
] If we don't have this, you can take the destination port, convert to a
] table address, use the sequence number,
] do some calculations and come up with a buffer address and an offset. If
] you want to mess up the layering
] of your stack, they are all things you can do now.

Standards committees don't like hashing.  It looks complicated and
insufficiently deterministic on an overhead projector.

] or using RDMA
] ...

] --->An attacker plays with the data so returned.
] ...

That's a very good point.  The tagging in RDMA cannot be used until
after it has been validated by the receiver.  The validating consists
of looking at sequence numbers, RPC/XDR headers, etc. to figure
out where the data can and should go, and then checking that the
sender guessed right.  Why not skip the last part and ignore the
RDMA tag?  Then why send the RDMA tag?

 .......


> From: Pete Zaitcev <zaitcev@metabyte.com>

> Very well, but what about its companion document (SCOT)?
>  http://search.ietf.org/internet-drafts/draft-satran-scot-00.txt
> It is published, isn't it? It was somewhat disturbing to see the
> notice, but on the other hand it was honest. IBM could just as
> easily come up silently with a silly software patent for RDMA option
> or for SCSI over TCP idea as such.

The IETF's protections against patent games are well intended, but nothing
to worry about if you want to play them and nothing to rely upon if you
don't.  The history of IETF patent games demonstrates that the IETF is
powerless to limit them (or worse), and that they're harder to play than
the players hope.  (E.g. PPP CCP and PPP 48-bit FCS, respectively)

  ......

} From: "Justin T. Gibbs" <gibbs@FreeBSD.org>

} ...
} >Can you elaborate on this?  Suppose TCP "blindly" does zero copy everything to
} >an app's buffer (for example, to a web browser's receive buffer) without
} >RDMA.  Then the browser app looks at the data and displays it.  What is the
} >difference RDMA makes in this case?  Yes, RDMA can separate different messages
} >in the buffer.  But this can also be done by the browser app, not by TCP.
}
} You seem to be saying that in the common case zero copy is achievable.
} Most implementations I've seen require the network driver to make
} a guess about where the payload will be in an incoming packet so the header
} can be stripped off and the payload dmaed to an aligned area.   A page
} flip is then performed to get the data where the user wants it,
} imposing the restriction that your  payload be page sized so you don't
} leave gaps in the user's destination buffer.

That is required only if you stick to the current API.  Obvious, 
minor changes in the direction of some operating systems that existed
before UNIX are sufficient to relax the page boundary requirement.
To use RDMA, you have to change the API.

}                                               Certainly, with a more
} intelligent network adapter that knows every protocol you can determine
} exactly where the data is in each packet.  If you add connection tracking
} and sequence number sniffing to the nic with a mechanism to register user
} buffers to connections, you can get zero copy every time*.  Unfortunately
} this is not very general purpose solution.

Only standards committees and some academics care about "every protocol"
or optimizing absolutely every application.  The rest of us (including
academics) only care about optimizing the important stuff.

Also as you say, looking at sequence numbers in the interface and relaxing
the sockets API rules about not touching any bytes in the buffer except
those that are actually received lets you avoid copies all of the time.
I don't see why that is not a general purpose solution, if you want one.

}                                             The point of RDMA seems to be
} to allow nic manufacturers to add support for a single tcp option that, at
} the very least, allows the nic to align the payload for you.  Add RID
} registration with the nic and you get the payload exactly where you want it
} too.  All without too much state information kept by the nic.

I've been hearing since the mid-1980's proposals to do TSP lookups in the
network interface instead of software because it is so incredibly difficult
to find the right TSP quickly in software.  I think those ideas are similar
to the RDMA idea.  They assume facts not in evidence, that there is a
problem that needs to be solved, and that the solution is not worse than
the nominal problem.  There are reasons why such proposals appear in
standards committees before implementations.

 ......

] From: Lloyd Wood <l.wood@eim.surrey.ac.uk>

] Note the mentions of SCSI and SCSI/TCP and the tie-in with the
] proposed IP Storage efforts (recent ietf general list discussion).
]
] I'd still like to know _why_.

] ...
] SCSI DMA over TCP? What _is_ all this aiming for - trying to build
] distributed RAID arrays with really poor performance that are subject
] to WAN outages and DoS attacks?

Why put SCSI over an protocol that measures RTT's, worries about
congestion in routers, and that expects the error rates that come
with 5000 miles of wire and 20 routers in the path?  Does anyone
really think that TCP/IP or even IP with it's 64K bit packet limit
are remotely close to the right protocol, particularly given the
existing and commercially available alternatives?

A standards committee is the venue of first and last resort for
such ideas, especially a committee that is related to currently
trendy things like the SuperInfoHypeWay.

 ....

) From: julian_satran@il.ibm.com

) That is not completely accurate. You will need appreciably more silicon to
) do what you suggest.   And you can do it only with information that "passes
) through the protocol" .

Significantly silicon more than what to do what?  Since the comment
was addressed to me, I'll assume one 'what' was looking at sequence
numbers, port numbers, and so forth to page flip.  Clearly it takes more
silicon to support page flipping in hardware than to not support page
flipping in hardware.  I will not agree that the required silicon is a
big deal, not because I have a clue about floor plans and so forth (I
don't), but because at a previous employeer I fought to keep the hardware
guys from throwing in gates to do it.  They had the silicon to spare and
had heard so much about the wonderfulness of page flipping that they wanted
to get in on the fun.

Doing things in hardware is ok only if you absolutely must.  Software is
always better when it is good enough, because it is soft.

) The good thing about the  proposal is that it can TAG whatever the
) application wants (and that can be several layers away from the protocol).
) You can't "page-flip" to buffers that you are not aware of. And page
) flipping wherever is applicable assumes  also page boundaries for buffers.

That's important only if you stick close to the sockets or UNIX read()
API.  If you are not ultra-conservative, and if you know a little of the
history of file and device I/O API's, or of you think about such things
for 10 seconds, then RDMA tagging becomes less interesting.
To use RDMA tagging, you must abandone the UNIX read() API.  If you change
the API, then you may as well think about the whole problem instead of
only a corner.  If you let the operating system tell the application where
the incoming data arrived, then you don't need elaborate hints from the
sender to the receivers hardware to say where the receiving software will
want the data.


) Vernon Schryver <vjs@calcite.rhyolite.com> on 25/02/2000 04:23:47
)
) Please respond to Vernon Schryver <vjs@calcite.rhyolite.com>

I did not write that!

 .....

) From: Alan Cox <alan@lxorguk.ukuu.org.uk>
)
) > flip is then performed to get the data where the user wants it,
) > imposing the restriction that your  payload be page sized so you don't
) > leave gaps in the user's destination buffer.  Certainly, with a more
)
) Perhaps its about time the world put together an official, sane, ring buffer
) style mmap socket api. A lot of the requirement to align data is coming
) from the existing socket API. 

The IETF should not get involved in API's.  There are plenty of other
standards committees in that arena, as well as big commercial outfits
including one in the U.S. Pacific Northwest.  In other words, do you think
the IETF would be more successful arguing with Microsoft about winsock
than the IETF has been in dealing with Microsoft's obviously completely
stupid and wrong PPP ideas?

If you do get involved in standardizing such things, then *PLEASE* don't
limit yourself to #$%$#@! ring buffers!  The ancient Execelan and preceding
(I've a mental block against the name starting with 'I') ring buffer notion
was ok as an initial hack, but WRONG for something to go fast.  To start,
you don't need pointers or indeces that must be written by both the
interface and the host.


Vernon Schryver    vjs@rhyolite.com
Prev by Date: Re: TCP RDMA option to accelerate NFS, CIFS, SCSI, etc.
Next by Date: Re: TCP RDMA option to accelerate NFS, CIFS, SCSI, etc.
Prev by thread: Re: TCP RDMA option to accelerate NFS, CIFS, SCSI, etc.
Next by thread: Re: TCP RDMA option to accelerate NFS, CIFS, SCSI, etc.
Index(es):
- Date
- Thread
Home
Last updated: Tue Sep 04 01:08:18 2001
6315 messages in chronological order