[Dovecot] Maildir over NFS

Mon Aug 9 14:54:18 EEST 2010

Maxwell Reid put forth on 8/8/2010 2:43 PM:
> Hi Stan,

Hay Maxwell,

> If you expand your definition of NFS server to include high end systems
> (NetApp being a common example), the placement of the NFS server isn't
> necessarily limited to user space.  Some vendors like BlueArc, use FPGAs to
> handle the protocols.   ORT in many cases is less than 1 msec on some of
> these boxes.

That is a valid point.  I don't know if DataOnTap runs NFS/CIFS code in kernel
or user space, or if NetApp offloads the code of these protocols to an FPGA.
Given the gate count on Xylinx' latest Vertex FPGAs, I'm not so sure code as
complex as NFS could fit on the die.  Obviously, if it would make performance
and cost sense to do so, they could split the subroutines across multiple FPGA
chips.

One thing is for certain, and that is that the driver code for a low level
protocol such as Fiber Channel, or even iSCSI, is at least a factor of 10 or
more smaller than the NFS code.  I've not actually counted the lines of code
of either code set, but am making a well educated guess.

The code stack size for a clustered filesystem such as GFS2, which is a
requirement in the second cluster architecture of this discussion, is probably
slightly larger than the XFS code stack which, again, is going to be much
smaller than NFS, and it runs in kernel space by default as it is a filesystem
driver.  If you add up the most used critical path machine instructions of the
NFS + TCP-UDP/IP stack and filesystem on the host (say NetApp) and do the same
for GFS2 + Fiber Channel, the latter is going to be a much much shorter and
faster execution path, period.

The 2nd of thse architectures has one less layer in the stack than the first,
which is fat NFS.  The transport layer protocol of the 2nd architecture has
less than 1/10th the complexity of the first.

NFS/NAS box solutions can be made to be extremely fast, as in the case of
NetApp, but they'll never be as fast as a good cluster filesystem atop a Fiber
Channel SAN solution.  As someone else pointed out, Noel IIRC, under light to
medium load, you'll likely never notice a throughput/latency difference.
Under heavy to extreme load, you'll notice performance degradation much sooner
with an NFS solution, and the cliff will be much steeper, than with a good
cluster filesystem and Fiber Channel SAN.

>> Dovecot clusters may be simpler to implement using NFS storage servers,
> 
> Simpler and more cost effective.  The price / performance (per watt if you
> want to go that far seeing as you don't need 2 fabrics ) generally favor NAS
> or some other kind of  distributed file system based approach.  The gains
> that come from parallelization are worth it at the cost of slightly less
> performance on an individual node basis, especially if you're dealing with
> N+2 or greater availabilty schemes.

To equal the performance of 4G/8G Fiber Channel strictly at the transport
layer, one must use an end-to-end 10GbE infrastructure.  At just about any
port count required for a cluster of any size, the cost difference between FC
and 10GbE switches and HBAs is negligible, and in fact 10GbE equipment is
usually a bit higher priced than FC gear.  And for the kind of redundancy
you're talking about, the 10GbE network will require the same redundant
topology as a twin fabric FC network.  You'll also need dual HBAs or dual port
HBAs just as in the FC network.  FC networks are just as scalable as ethernet,
and at this performance level, again, the cost is nearly the same.

>   These Distributed File Systems and specialilzed RPC mechanisms  have
> higher overhead than even NFS, but they make up for it by increasing
> paralleization and using some very creative optimizations that you can use
> when you have many multiple machines and some other things that are don't
> have useful analogs outside of Google.

I wish there was more information available on Google's home grown distributed
database architecture, and whether or not they are indeed storing the Gmail
user data (emails, address books, etc) in this database, or if they had to
implement something else for the back end.  If they're using the distributed
db, then they had to have written their own custom POP/IMAP/etc server, or
_heavily_ modified someone else's.  As you say, this is neat, but probably has
little applicability outside of the Googles of the world.

>  |In a nutshell, you divide the aggregate application data equally across a
>  |number of nodes with local storage, and each node is responsible for
> handling
>  |only a specific subset of the total data.
>
> You can go the same thing with NFS nodes, with the added benefit using the
> automounter (on the low end) to "virtualize" the name space similar to what
> they do with render farms.

Many years ago when I first read about this, IIRC there were reliability and
data ordering/arrival issues that prevented early adoption.  In essence, when
a host requested data via NFS, its NFS would sometimes corrupt the file data
because it didn't reassemble the parallel fragments properly.  Have those bugs
been quashed?  I assume they have, as I said, it was a long time ago I read
about this.  I don't use NFS so I'm not current WRT its status.

> |The cluster host numbers I'm using are merely examples.  Google for example
> |probably has a larger IMAP cluster server count per datacenter than the 256
> |nodes in my example--that's only about 6 racks packed with 42 x 1U servers.
> |Given the number of gmail accounts in the US, and the fact they have less
> than
> |2 dozen datacenters here, we're probably looking at thousands of 1U IMAP
> |servers per datacenter.
> 
> The architecture you describe is very similar to webex, but they cap the
> number of accounts per node at some ridiculously small level, like 10,000 or
> something and use SAS drives.

Interesting.  I don't think I've heard of webex.  I'll have to read up on them.

-- 
Stan