|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: NFS Header/data parsing and RDMA> From: Costa Sapuntzakis <csapuntz@cisco.com> > Ok, so doing NFSv2/v3 header/data splitting is easy on an in-order > TCP stream because NFS has fixed-length trailers. Here's a little > technique: > ... > Note, to do this with NFS/TCP, your NIC has to do some primitive > level of TCP processing (at least keep track of flows). It also > needs to understand RPC/TCP message boundaries. Do I understand correctly that you're applying the familiar NFS/UDP page flipping tactic to NFS/TCP? > Are there significantly simpler approaches than this? 1. How about using NFS/UDP instead of NFS/TCP? It's well known in the NFS community that NFSv2-3/TCP is no faster or otherwise better than NFSv2-3/UDP except over very narrow or at least rather long pipes. (Recall also the congestion control and avoidance mechanisms in some NFSv2-3/UDP implementations.) 2. Use NFS/TCP, but send every RPC/XDR transaction in a single TCP segment, and use IP fragmentation to fit the MTU. This tactic was used for 10+ years ago in the FDDI adapters of some super computers. It does have the problems of IP fragmentation, but those problems are rarely encountered where NFS is used. > NFSv4 doesn't seem to have fixed length trailers and neither > does CIFS in all cases. And it looks like it will be costly to parse > NFSv4 headers. I've not been paying attention to NFSv4. A quick skim of the draft suggests that it will not displace NFSv2/3 in the environments where NFS is currently popular. NFSv4 certainly has nothing to do with anything like SCSI over IP. I'm also far from convinced that NFSv4 has got some of the extensions close enough to the underlying real filesystems to be popular. Even if I'm wrong, it will be years before NFSv4 is widely used While I think there are ways to page flip NFSv4 without special hardware, I don't think they are worth talking about yet. Even if I'm also wrong about that, it is years early to be modifying TCP/IP to support NFSv4. No one can see what NFSv4 will be like when it is popular enough to justify modifying TCP today, if NFSv4 ever is popular. > RDMA still has the following features: > > - Per-packet (Works with arbitrary out-of-order reception of TCP > segments) > - Fixed header that's generic across all protocols (NFSv4, v5, AFS, > DFS, CIFS, etc..) > - No page flipping necessary on solicited transfers > - Message boundary bit (which is admittedly orthogonal to RDMA) allows > out-of-order processing on TCP receive buffer. Decreases parsing latency, > esp. in the face of packet drops. > ... Knowing to which buffer an out-of-order TCP segment belongs is something that I don't see how to do without something like RDMA. However, out-of-order TCP segments are both very rare and very bad for TCP performance, regardless of whether RDMA is present. Out of order TCP segments must be even more rare in storage networks. Talk about NFSv5 or even AFS/DFS does the opposite of make me think there might be something good in RDMA. And as I've said, it's years too early to justifiy RDMA with NFSv4. With existing techniques, if you don't want to page flip, you don't need to. If you are able to provide enough distinct application buffer streams to the NIC for RDMA, then you could do the same for other techniques. What's that about "parsing latency" and what does it have to do with lost segments? Are you proposing to deliver TCP data to applications out of order? I trust not! .... ] From: Zachary Amsden <zamsden@cthulhu.engr.sgi.com> ] ... ]No, that situation doesn't require any hardware support. However, a zero-copy ] receive path is not the only element of RDMA - RDMA was designed (I suppose ] from the discussion here) specifically to address header/payload issues for ] storage protocols. Clearly one can do zero-copy receive with changes to the ] API and no hardware/firmware modifications. But with no special hardware ] support, flipping the payload into some page with alignment constraints will ] require another copy. What about the many systems that have been page flipping NFS in and out of buffer caches for more than 10 years, with no changes to APIs or special silicon? ]There is one exception to my last statement that I know of: If you pre-adjust ]the hardware receive buffers to make the payload align on a page boundary, you ] can flip the page into the buffer cache for (hopefully) the common case. ] However, this requires the ability to tune these header offsets and will only ] work for one protocol at a time (mostly). The page flipping systems I've worked on did not tune header offsets and worked on more than one protocol. (Given your email address, it might be interesting to check the old IRIX source trees. Besides the NFS kernel code and the HIPPI, ATM, and FDDI drivers and firmware, check cmd/rcp and cmd/rsh.) UDP page flipping is trivial on protocols that have no trailers. It requires trivial smarts in the NIC and much simpler buffer allocation by the NIC than RDMA requires. (I suspect RDMA needs pools of buffers for every stream, while the classic tactic needs only two pools, "little" and "pages"....well, for tiny improvements I've also done it with "little", "medium" and "pages".) ] Realistically, who is going to be running a storage system that requires so ] much bandwidth that avoiding receive copies is necessary, and runs on generic ] NICs with no firmware/ASIC modifications possible? So I think using modified ] hardware is completely reasonable in those circumstances. ] ... Even more reasonable than special hardware are modified API's and protocols and other steps, including ensuring that out-of-order packets are very rare, and with header offsets are few, fixed, known, and friendly. How would you have out-of-order arrival on a storage network, other than due to bit rot in the wires, and what storage network is going to have significant bit rot? Vernon Schryver vjs@rhyolite.com
Home Last updated: Tue Sep 04 01:08:17 2001 6315 messages in chronological order |