Maxwell Reid put forth on 8/8/2010 2:43 PM:
Hi Stan,
Hay Maxwell,
If you expand your definition of NFS server to include high end systems (NetApp being a common example), the placement of the NFS server isn't necessarily limited to user space. Some vendors like BlueArc, use FPGAs to handle the protocols. ORT in many cases is less than 1 msec on some of these boxes.
That is a valid point. I don't know if DataOnTap runs NFS/CIFS code in kernel or user space, or if NetApp offloads the code of these protocols to an FPGA. Given the gate count on Xylinx' latest Vertex FPGAs, I'm not so sure code as complex as NFS could fit on the die. Obviously, if it would make performance and cost sense to do so, they could split the subroutines across multiple FPGA chips.
One thing is for certain, and that is that the driver code for a low level protocol such as Fiber Channel, or even iSCSI, is at least a factor of 10 or more smaller than the NFS code. I've not actually counted the lines of code of either code set, but am making a well educated guess.
The code stack size for a clustered filesystem such as GFS2, which is a requirement in the second cluster architecture of this discussion, is probably slightly larger than the XFS code stack which, again, is going to be much smaller than NFS, and it runs in kernel space by default as it is a filesystem driver. If you add up the most used critical path machine instructions of the NFS + TCP-UDP/IP stack and filesystem on the host (say NetApp) and do the same for GFS2 + Fiber Channel, the latter is going to be a much much shorter and faster execution path, period.
The 2nd of thse architectures has one less layer in the stack than the first, which is fat NFS. The transport layer protocol of the 2nd architecture has less than 1/10th the complexity of the first.
NFS/NAS box solutions can be made to be extremely fast, as in the case of NetApp, but they'll never be as fast as a good cluster filesystem atop a Fiber Channel SAN solution. As someone else pointed out, Noel IIRC, under light to medium load, you'll likely never notice a throughput/latency difference. Under heavy to extreme load, you'll notice performance degradation much sooner with an NFS solution, and the cliff will be much steeper, than with a good cluster filesystem and Fiber Channel SAN.
Dovecot clusters may be simpler to implement using NFS storage servers,
Simpler and more cost effective. The price / performance (per watt if you want to go that far seeing as you don't need 2 fabrics ) generally favor NAS or some other kind of distributed file system based approach. The gains that come from parallelization are worth it at the cost of slightly less performance on an individual node basis, especially if you're dealing with N+2 or greater availabilty schemes.
To equal the performance of 4G/8G Fiber Channel strictly at the transport layer, one must use an end-to-end 10GbE infrastructure. At just about any port count required for a cluster of any size, the cost difference between FC and 10GbE switches and HBAs is negligible, and in fact 10GbE equipment is usually a bit higher priced than FC gear. And for the kind of redundancy you're talking about, the 10GbE network will require the same redundant topology as a twin fabric FC network. You'll also need dual HBAs or dual port HBAs just as in the FC network. FC networks are just as scalable as ethernet, and at this performance level, again, the cost is nearly the same.
These Distributed File Systems and specialilzed RPC mechanisms have higher overhead than even NFS, but they make up for it by increasing paralleization and using some very creative optimizations that you can use when you have many multiple machines and some other things that are don't have useful analogs outside of Google.
I wish there was more information available on Google's home grown distributed database architecture, and whether or not they are indeed storing the Gmail user data (emails, address books, etc) in this database, or if they had to implement something else for the back end. If they're using the distributed db, then they had to have written their own custom POP/IMAP/etc server, or _heavily_ modified someone else's. As you say, this is neat, but probably has little applicability outside of the Googles of the world.
|In a nutshell, you divide the aggregate application data equally across a |number of nodes with local storage, and each node is responsible for handling |only a specific subset of the total data.
You can go the same thing with NFS nodes, with the added benefit using the automounter (on the low end) to "virtualize" the name space similar to what they do with render farms.
Many years ago when I first read about this, IIRC there were reliability and data ordering/arrival issues that prevented early adoption. In essence, when a host requested data via NFS, its NFS would sometimes corrupt the file data because it didn't reassemble the parallel fragments properly. Have those bugs been quashed? I assume they have, as I said, it was a long time ago I read about this. I don't use NFS so I'm not current WRT its status.
|The cluster host numbers I'm using are merely examples. Google for example |probably has a larger IMAP cluster server count per datacenter than the 256 |nodes in my example--that's only about 6 racks packed with 42 x 1U servers. |Given the number of gmail accounts in the US, and the fact they have less than |2 dozen datacenters here, we're probably looking at thousands of 1U IMAP |servers per datacenter.
The architecture you describe is very similar to webex, but they cap the number of accounts per node at some ridiculously small level, like 10,000 or something and use SAS drives.
Interesting. I don't think I've heard of webex. I'll have to read up on them.
-- Stan