|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: TCP RDMA option to accelerate NFS, CIFS, SCSI, etc.> From: Costa Sapuntzakis <csapuntz@cisco.com> > ... > Today, you have specialized silicon that for simple bus protocols > (SCSI parallel interface and ATA) will directly take transfer blocks > between the device and the buffer cache. This is not currently done > with TCP, to the best of my knowledge. ... It might be good to investigate the history of Protocol Engines Inc., including its goals, the reasons for its failure as a business, and what it achieved technically. A skewed history might be: 1. founded to make silicon for XTP, a nominally faster protocol than TCP. 2. when XTP protocol and the XTP chips got bogged down, shifted to making chips to help TCP go wire speed over FDDI. 3. other people made TCP go wire speed over FDDI without any special silicon or new to protocols. That took some wind out of XTP's sails, and tore the sails driving PEI's TCP acclerator chips. 4. standard standards committee problems with XTP didn't help PEI's other sails. If you ask me, SCSI/IP and RDMA have striking parallels to #1 and #2. I bet you'll meet parallels to #3 before any real deployment. You've started to see #4 in some of the suggested improvements to RDMA today. It's not that the suggestions are not good ideas. That problem is that committees cannot say no to good ideas, while the one thing that matters above all in any design task is saying no to almost everything. Protocol Engines and XTP were based on the unexamined assumption that TCP is very difficult to implement and an unavoidably slow protocol. Most people just knew those "facts" 15 years ago. I think RDMA suffers a similar problem. Instead of starting by assuming that a new protocol is needed for a new goal, if you actually look within the existing boundaries, you'll often find a solution. Often the inside solution is better than any possible extension of the protocol. Protocol extensions require more bandwidth and more processing on both sender and receiver. They also have problems gaining enough marketshare to survive. Please don't misunderstand me. Greg didn't include my name among the authors on one of the XTP specs because I said XTP was a stupid idea. I still like lots of XTP. I also think that many of the XTP ideas can be *and have been* applied to TCP implementations. > However, in the case of most storage protocols, you don't want > the data in the receive buffer. You want it in the buffer cache, so > there is a copy to the buffer cache. Which NFS implementation written in the last 10 or at least 5 years and intended to be fast doesn't move data between the buffer cache near the disk and the buffer cache near the application with zero (0) copies? Page flipping to and from buffer caches is especially easy, because buffer caches tend to be page aligned, and file systems like to move data in page-sized or larger chunks. > So, NFS has a CPU overhead hit as compared to optimized storage host bus > adapters. The goal was to eliminate part of this hit, by getting rid of an > extra copy. How can you have fewer than zero copies? > Now, this proposal doesn't fix the interrupt overhead problem. > Optimized FC/SCSI NICs have one interrupt/transfer or less. Interrupts are killers, and so for that last 5 or 10 years, a competetive NFS system has had about 0.1 interrupts per packet. The trick is not reducing the ratio of interrupts/packet, but reducing it only so far that things don't slow down, and increasing the ratio when the total system (client & server) moves into a regime that requires more interrupts. ] From: Michael Krause <krause@cup.hp.com> ] It ain't free and there are plenty of reasons to avoid copying data since ] ... ] touching the buffers themselves. Also, one could use this technology with ] storage devices to bypass the server and send data to one or more NICs for ] remote access - RDMA is still quite good for this type of operation and ] does not involve touching the data. There are other, much easier ways to separate data and control information in the receiver than being forced to parse optional new bits in TCP or IP headers. For 10 years, network interfaces in commercial UNIX systems have been putting the headers (including RPC/XDR) of incoming NFS traffic in one place (a "small mbuf") and the data in another place (the buffer cache) without extra copies, and without parsing any headers, not to mention new header bits with the nasty problems of TCP or IP options. And this despite the fact that the RPC/XDR stuff is between variable length (recall the NFS group list) and a hard to predict length. Vernon Schryver vjs@rhyolite.com
Home Last updated: Tue Sep 04 01:08:18 2001 6315 messages in chronological order |