[Dovecot] Maildir over NFS
I'm working on bring up a new mail server to replace our current one.
Our current mail server is running dovecot 1.1.16, with postfix using
mbox format. User inboxes are stored locally on the mail server and all
other mail folders in users home directory under mail which is NFS
mounted on the mail server.
For our new mail server I'm looking to switch to the Maildir format.
Some years ago I remember reading that it was not recommended to run
Maildir format over NFS. Now I'm looking at several posts that seem to
indicate that Maildir should run fine over NFS. I'm a little concerned
about running Maildir over NFS, especially from the howto conversion
pages I read would move all messages over to the User ~/Maildir folder
including the inbox. So having every single mail transaction going over
NFS doesn't seem the smart thing to do.
So question I have for the dovecot team, does running Maildir over NFS work well? Or would you recommend that all user mail folders be stored locally on the mail server when using Maildir?
We have about 3400 users, doing about 30k mail deliveries daily. Some users have 10's of thousands of mail messages in hundreds of mail folders.
Thanks...
-- C. J. Keist Email: cj.keist@colostate.edu UNIX/Network Manager Phone: 970-491-0630 Engineering Network Services Fax: 970-491-5569 College of Engineering, CSU Ft. Collins, CO 80523-1301
All I want is a chance to prove 'Money can't buy happiness'
On 8/6/2010 12:31, CJ Keist wrote:
So question I have for the dovecot team, does running Maildir over NFS work well? Or would you recommend that all user mail folders be stored locally on the mail server when using Maildir?
We have about 3400 users, doing about 30k mail deliveries daily. Some users have 10's of thousands of mail messages in hundreds of mail folders.
Maildir (and Dovecot in general) will work fine with NFS.
~Seth
Quoting CJ Keist <cj.keist@colostate.edu>:
I'm working on bring up a new mail server to replace our current
one. Our current mail server is running dovecot 1.1.16, with
postfix using mbox format. User inboxes are stored locally on the
mail server and all other mail folders in users home directory under
mail which is NFS mounted on the mail server.For our new mail server I'm looking to switch to the Maildir format.
Some years ago I remember reading that it was not recommended to
run Maildir format over NFS. Now I'm looking at several posts that
seem to indicate that Maildir should run fine over NFS. I'm a
little concerned about running Maildir over NFS, especially from the
howto conversion pages I read would move all messages over to the
User ~/Maildir folder including the inbox. So having every single
mail transaction going over NFS doesn't seem the smart thing to do.So question I have for the dovecot team, does running Maildir over
NFS work well? Or would you recommend that all user mail folders be
stored locally on the mail server when using Maildir?We have about 3400 users, doing about 30k mail deliveries daily.
Some users have 10's of thousands of mail messages in hundreds of
mail folders.
I'm happy with maildir on nfs (netapp), doing 3million receiving per day.
Looking at the new methods for v2.0, still on the fence though with them.
On Fri, 2010-08-06 at 13:31 -0600, CJ Keist wrote:
For our new mail server I'm looking to switch to the Maildir format.
Some years ago I remember reading that it was not recommended to run Maildir format over NFS. Now I'm looking at several posts that seem to indicate that Maildir should run fine over NFS. I'm a little concerned
Maildir was designed with NFS in mind, it is, always has been, and always will be, perfectly suited to any mail based system. It was, is, and likely always will be, mbox that is dangerous to use over NFS.
about running Maildir over NFS, especially from the howto conversion pages I read would move all messages over to the User ~/Maildir folder including the inbox. So having every single mail transaction going over NFS doesn't seem the smart thing to do.
Actually you will not notice any difference. How do you think all the big boys do it now :) Granted some opted for the SAN approach over NAS, but for mail, NAS is better way to go IMHO and plenty of large services, ISP, corporations, and universities etc, all use NAS.
One thing I strongly suggest however, for added security, is to use an internal private RFC1918 based LAN for NFS using your second ethernet port.
So question I have for the dovecot team, does running Maildir over NFS
For years, safely and happily.
We have about 3400 users, doing about 30k mail deliveries daily. Some users have 10's of thousands of mail messages in hundreds of mail folders.
Run systems where multiple front end MTA's each processing 1.1million messages a day, and that's accepted messages each, not counting rejected connections, along with the several pop3 and webmail servers, all talking to this same mail system, best to use a dedicated device like a NetApp filer as the NAS though, not a garden variety server, but, even so, so long as the hardware is good, I can't see how you could go wrong with even that.
Oh, and I trust you are using a virtual users and not system users, might be a good time to at least "think" about that if your still using system users, since you can expand to multiple servers easily, in fact, with only 2 minutes effort.
Cheers
Noel Butler put forth on 8/6/2010 4:29 PM:
Actually you will not notice any difference. How do you think all the big boys do it now :) Granted some opted for the SAN approach over NAS, but for mail, NAS is better way to go IMHO and plenty of large services, ISP, corporations, and universities etc, all use NAS.
The protocol overhead of the NFS stack is such that one way latency is in the 1-50 millisecond range, depending on specific implementations and server load. The one way latency of a fibre channel packet is in the sub 100 microsecond range and is fairly immune to system load. The performance of fibre channel is equal to local disk plus approximately one millisecond of additional effective head seek time due to switch latency, SAN array controller latency, and latency due to cable length. A filesystem block served out of SAN array controller cache returns to the kernel quicker than a block read from local disk that is not in cache because the former suffers no mechanical latency. Due to the complexity of the stack, NFS is far slower than either.
Those who would recommend NFS/NAS over fibre channel SAN have no experience with fibre channel SANs. I'm no fan of iSCSI SANs due to the reliance on TCP/IP for transport, and the low performance due to stck processing. However, using the same ethernet switches for both, iSCSI SAN arrays will also outperform NFS/NAS boxen by a decent margin.
Regarding the OP's case, given the low cost of new hardware, specifically locally attached RAID and the massive size and low cost of modern disks, I'd recommend storing user mail on the new mail host. It's faster and more cost effective than both NFS/SAN. Unless his current backup solution "requires" user mail dirs to be on that NFS server for nightly backup, local disk is definitely the way to go. Four 300GB 15k SAS drives on a good PCIe RAID card w/256-512MB cache in a RAID 10 configuration would yield ~350-400MB/s of real filesystem bandwidth, seek throughput equivalent to a 2 disk stripe--about 600 random seeks/s, 600GB of usable space, ability to sustain two simultaneous disk failures (assuming 1 failure per mirror pair), and cost effectiveness.
-- Stan
All, Thanks for all the information. I think I'm leaning towards locally attached fiber disk array. Couple of advantages I see, one it will be faster than NFS, second it will allow us to separate user home directory disk quotas and email disk quotas. Something we have been wanting to do for awhile.
Again thanks for all the view points and experiences with Maildir over NFS.
On 8/7/10 4:06 AM, Stan Hoeppner wrote:
Noel Butler put forth on 8/6/2010 4:29 PM:
Actually you will not notice any difference. How do you think all the big boys do it now :) Granted some opted for the SAN approach over NAS, but for mail, NAS is better way to go IMHO and plenty of large services, ISP, corporations, and universities etc, all use NAS.
The protocol overhead of the NFS stack is such that one way latency is in the 1-50 millisecond range, depending on specific implementations and server load. The one way latency of a fibre channel packet is in the sub 100 microsecond range and is fairly immune to system load. The performance of fibre channel is equal to local disk plus approximately one millisecond of additional effective head seek time due to switch latency, SAN array controller latency, and latency due to cable length. A filesystem block served out of SAN array controller cache returns to the kernel quicker than a block read from local disk that is not in cache because the former suffers no mechanical latency. Due to the complexity of the stack, NFS is far slower than either.
Those who would recommend NFS/NAS over fibre channel SAN have no experience with fibre channel SANs. I'm no fan of iSCSI SANs due to the reliance on TCP/IP for transport, and the low performance due to stck processing. However, using the same ethernet switches for both, iSCSI SAN arrays will also outperform NFS/NAS boxen by a decent margin.
Regarding the OP's case, given the low cost of new hardware, specifically locally attached RAID and the massive size and low cost of modern disks, I'd recommend storing user mail on the new mail host. It's faster and more cost effective than both NFS/SAN. Unless his current backup solution "requires" user mail dirs to be on that NFS server for nightly backup, local disk is definitely the way to go. Four 300GB 15k SAS drives on a good PCIe RAID card w/256-512MB cache in a RAID 10 configuration would yield ~350-400MB/s of real filesystem bandwidth, seek throughput equivalent to a 2 disk stripe--about 600 random seeks/s, 600GB of usable space, ability to sustain two simultaneous disk failures (assuming 1 failure per mirror pair), and cost effectiveness.
-- C. J. Keist Email: cj.keist@colostate.edu UNIX/Network Manager Phone: 970-491-0630 Engineering Network Services Fax: 970-491-5569 College of Engineering, CSU Ft. Collins, CO 80523-1301
All I want is a chance to prove 'Money can't buy happiness'
On Sat, 2010-08-07 at 09:17 -0600, CJ Keist wrote:
All, Thanks for all the information. I think I'm leaning towards locally attached fiber disk array. Couple of advantages I see, one it will be faster than NFS, second it will allow us to separate user home directory disk quotas and email disk quotas. Something we have been wanting to do for awhile.
You truly wont notice the difference between NAS/SAN in the "real world", we used to use SAN for mail at an old employers, it adds slightly more complexity, and with large volumes of mail, you want things as simple as possible, we found NAS much more reliable, the cost of the units is the same, as netapps do both nas and san, but, if you do not intend to expand beyond the single server (with 3K users you got a long way to go unless you introduce redundancy) then attached fiber disk array will be a cheaper option, even a low end FAS 2k series we use for web was about $30K (to get in this country anyway) obviously much cheaper in the U.S.
What you could do, is talk to vendors, explain what your considering, most often they will send you a loan of devices for a few weeks, so you can configure your scenarios and run the tests, then evaluate which is the best bang for buck way to go, buy be wary of their sales pushes, you know what you want, they don't, they may try upsell you what you'll never ever need.
That said, the one important thing you need to remember, plan for the future.
All the best in your ventures Cheers
/usr/sbin/dovecot --exec-mail ext /usr/libexec/dovecot/expire-tool
Error: dlopen(/usr/lib64/dovecot/imap/lib11_imap_quota_plugin.so) failed: /usr/lib64/dovecot/imap/lib11_imap_quota_plugin.so: undefined symbol: capability_string Fatal: Couldn't load required plugins
On Sat, 2010-08-07 at 18:38 -0400, Jerrale G wrote:
/usr/sbin/dovecot --exec-mail ext /usr/libexec/dovecot/expire-tool
Error: dlopen(/usr/lib64/dovecot/imap/lib11_imap_quota_plugin.so)
expire-tool can't load IMAP plugins. You need the ugly script wrapper shown in http://wiki.dovecot.org/Plugins/Expire
CJ Keist put forth on 8/7/2010 10:17 AM:
All, Thanks for all the information. I think I'm leaning towards locally attached fiber disk array. Couple of advantages I see, one it will be faster than NFS, second it will allow us to separate user home directory disk quotas and email disk quotas. Something we have been wanting to do for awhile.
If you're going to do locally attached storage, why spend the substantial additional treasure required for a fiber channel array and HBA solution? You're looking at a minimum of $10k-$20k USD for a 'low end' FC array solution. Don't get me wrong, I'm a huge fan of FC SANs, but only when it makes sense. And it only makes sense if you have multiple hosts and you're slicing capacity (and performance) to each host. So at least get an entry level Qlogic FC switch so you can attach others hosts in the future, or even right away, once you realize what you can do with this technology. If you're set on the FC path, I recommend these components, all of which I've used and are fantastic products when great performance and support:
http://www.qlogic.com/Products/SANandDataNetworking/FibreChannelSwitches/Pag... http://www.sandirect.com/product_info.php?products_id=1366
http://www.qlogic.com/Products/SANandDataNetworking/FibreChannelAdapters/Pag... http://www.sandirect.com/product_info.php?cPath=257_260_268&products_id=291
http://www.nexsan.com/sataboy.php http://www.sandirect.com/product_info.php?cPath=171_208_363&products_id=1434
Configure the first 12 of the 14 drives in the Nexsan as a RAID 1+0 array, the last two drives as hot spares. This will give you 6TB of usable array space sliceable to hosts as you see fit, 600 MB/s of sequential read throughput, ~1000 random seeks/sec to disk and 35k/s to cache, ultra fast rebuilds after drive failure (10x faster than a RAID 5 or 6 rebuild). RAID 1+0 does not suffer the mandatory RAID 5/6 read-modify-write cycle and thus the write throughput of RAID 1+0 has a 4:1 advantage over RAID 5 and an 8:1 advantage over RAID 6, if my math is correct.
The only downside of RAID 1+0 compared to RAID 5/6 is usable space after redundancy overhead. With our Nexsan unit above, RAID 5 with two hot spares will give us 11TB of usable space and RAID 6 will give us 10TB. Most people avoid RAID 5 these days because of the "write hole" silent data corruption issue, and go with RAID 6 instead, because they want to maximize their usable array space.
One of the nice things about the Nexsan is that you can mix & match RAID levels within the same chassis. Let's say you're wanting to consolidate the Postfix queues, INBOX and user maildir files, _and_ user home directories onto your new Nexsan array. You want faster performance for files that are often changing but you don't need as much total storage for these. You want more space for user home dirs but you don't need the fastest access times.
In this case you can create a RAID 1+0 array of the first 6 disks in the chassis giving you 3TB of fast usable space with highest redundancy for the mail queue, user INBOX and maildir files. Take the next 7 disks and create a RAID 5 array yielding 6TB of usable space, double that of the "fast" array. We now have one disk left for a hot space, and this is fine as long as you have another spare disk or 2 on the shelf. Speaking of spares, with one 14 drive RAID 1+0, you could actually gain 1TB of usable storage by using no hot spares and keeping spares on the shelf. The likelihood of a double drive failure is rare, and with RAID 1+0 is would be extremely rare for two failures to occur in the same mirror pair.
-- Stan
On Sat, Aug 7, 2010 at 3:06 AM, Stan Hoeppner <stan@hardwarefreak.com>wrote:
Noel Butler put forth on 8/6/2010 4:29 PM:
Actually you will not notice any difference. How do you think all the big boys do it now :) Granted some opted for the SAN approach over NAS, but for mail, NAS is better way to go IMHO and plenty of large services, ISP, corporations, and universities etc, all use NAS.
The protocol overhead of the NFS stack is such that one way latency is in the 1-50 millisecond range, depending on specific implementations and server load.
Yes, I would say NFS has greater overhead, but it allows for multi system access where fiber channel does not unless you're using clustered filesystems which have their own issues with latency and lock management.... it's also worth noting that the latencies between the storage and mail processing nodes is an insignificant bottle neck compared to the usual latencies between the client and mail processing nodes.
Those who would recommend NFS/NAS over fibre channel SAN have no experience with fibre channel SANs.
Bold statement there sir :-) From a price performance ratio, I'd argue NAS is far superior and scalable, and generally there is far less management overhead involved with NAS than with SANs, and if you have a commercial high end NAS, you don't have to deal with the idosyncracies of the host file system.
In my previous lives running large scale mail systems handling up to 500k accounts (I work with a team which manages an infrastructure much larger than that now) The price of latency for a single node using NAS flattens out as the number of nodes increase. If you're handling a smaller system with one or two nodes and don't plan or growing significantly, DAS or SAN should be fine.
~Max
On Sat, 2010-08-07 at 15:18 -0700, Maxwell Reid wrote:
On Sat, Aug 7, 2010 at 3:06 AM, Stan Hoeppner <stan@hardwarefreak.com>wrote:
Noel Butler put forth on 8/6/2010 4:29 PM:
Actually you will not notice any difference. How do you think all the big boys do it now :) Granted some opted for the SAN approach over NAS, but for mail, NAS is better way to go IMHO and plenty of large services, ISP, corporations, and universities etc, all use NAS.
The protocol overhead of the NFS stack is such that one way latency is in the 1-50 millisecond range, depending on specific implementations and server load.
Yes, I would say NFS has greater overhead, but it allows for multi system access where fiber channel does not unless you're using clustered filesystems which have their own issues with latency and lock management.... it's also worth noting that the latencies between the storage and mail processing nodes is an insignificant bottle neck compared to the usual latencies between the client and mail processing nodes.
*nods*
Thats why my very first line said ' will not "notice" any difference '
Those who would recommend NFS/NAS over fibre channel SAN have no experience with fibre channel SANs.
Bold statement there sir :-) From a price performance ratio, I'd argue NAS is far superior and scalable, and generally there is far less management
and with large mail systems, scalability is what it is all about
Cheers
Noel Butler put forth on 8/7/2010 5:34 PM:
Bold statement there sir :-) From a price performance ratio, I'd argue NAS is far superior and scalable, and generally there is far less management
and with large mail systems, scalability is what it is all about
True large mailbox count scalability requires a "shared nothing" storage architecture and an ultra cheap hardware footprint. The big 3 commercial database vendors all adopted this shared nothing storage strategy a decade ago for scaling OLAP, and then for OLTP. This shared nothing architecture actually works very well for almost any scalable small data transaction application, which includes email.
In a nutshell, you divide the aggregate application data equally across a number of nodes with local storage, and each node is responsible for handling only a specific subset of the total data. I'm guessing this is exactly what Google has done with Gmail, but I've yet to see a white paper detailing the hardware design of gmail, hotmail, or yahoo mail. I'd make a very educated guess that not one of them uses globally shared storage for user mailboxes, like the shared storage we've been discussing.
I would venture to guess that due to performance scalability needs into the tens of millions of mailboxen and, as importantly, geographically distributed scalability, and just as importantly, cost reasons, they probably do something like this mostly shared nothing model.
web server 1 imap server cluster 1 web server 2 ------------------ web server 3 \ / host 1 | 2 disks mirrored | web server 4 \ / ------------------ \ DRDB + GFS ... \ smart / ------------------ / ... director host 2 | 2 disks mirrored | ... / IMAP \ ------------------ web server 509 / proxy \ ------------------ web server 510 / \ host 1 | 2 disks mirrored | web server 511 ------------------ \ DRDB + GFS web server 512 ------------------ / host 2 | 2 disks mirrored | ------------------ imap server cluster 128
An http balancer (not shown) would route requests to any free web server. The smart director behind the web servers contains a database with many metrics and routes new account creation to the proper IMAP server cluster. After the account is established, that can log into any web server but that user's mailbox data transactions are now forever routed to that particular cluster. Each cluster has 1 level of host redundancy and 2 levels of storage redundancy. Each IMAP cluster member would have a relatively low end low power dual core processor, 4GB RAM, 2 x 7.2k RPM disks, and dual GigE ports--a pretty standard base configuration 1U server--and cheap. The target service level being 100-400 concurrent logged in users per IMAP server cluster, for around 50,000 concurrent users for 256 IMAP servers.
This is not a truly shared nothing architecture, as we have an IMAP service based on a 2 node cluster. However, given the total size of these organizations' user bases, in the multiple 10s of millions of mailboxen, in practical terms, this is a shared nothing design, as only a a few dozen to a hundred user mailboxen exist on each server. One host in each cluster pair resides in a different physical datacenter close to the user do a catastrophic network or facility failure doesn't prevent the user from accessing his/her mailbox.
Depending on how much redundancy, and thus money, the provider wishes to pony up, each two node cluster above could be expanded to a node count sufficient to put one member of each cluster in each and every datacenter the provider has. The upside to this is massive redundancy and an enhanced user experience when an outage at one center occurs, or a backbone segment goes down. The downside is data synchronization across WAN links, with an n+1 increase in synchronization overhead for each cluster member added.
Having central shared mailbox storage for this size user count is impossible due to the geographically distributed datacenters these outfits operate. The shared nothing 2 node cluster approach I've suggested is probably pretty close to what these guys are using. If a mailbox server goes down, its cluster partner carries the load for both until the failed node is repaired/replaced. If both nodes go down, a very limited subset of the user base is affected.
If one centralized FC SAN or NFS/NAS array was used per datacenter in place of the local disks in these cheap clusters, costs would go through the roof. To duplicate the performance of the 256 x 7.2k local SATA disks (512 total but mirrors don't add to performance), you'd need an array controller with big cache (8-32GB), 40k random IO/s at the spindle level and 7.6GB/s of random IO spindle throughput. This would require an array controller with a minimum of 10 x 8Gb FC ports, or 8 x 10GbE NAS ports, and 128 x 15k SAS disks. Depending on whose unit meeting these specs that you buy, you're looking at somewhere in the neighborhood of $250-500k.
And given the cost of the switch and HBA infrastructure required in this central storage scenario, those 256 single socket cheap IMAP cluster machines are going to rapidly turn into 8 rather expensive dual socket 12 core processor nodes (24 cores per node, 192 total cores) with 128GB RAM each, 1TB total, same as the 256 el cheapo node aggregate. Each node will have an 8Gb FC HBA or 10GbE HBA, and a single connection to the SAN/NAS array controller, eliminating the need/cost for a dedicated switch. As configured, each of these servers will run ~$20k USD due to the 128GB of RAM, the ~$1,000 HBA, and due to the fact that vendors selling such boxen gouge customers on big memory configurations. Base price for the box with 2 x 12 core Opteron and 16GB RAM is ~$6k USD. Anyway, figure 8 x $20k = ~$160,000 for the IMAP cluster nodes. Add in $250-$500k for the SAN/NAS array, and you're looking at ~$410k to ~$660k.
A quantity buy of 256 of the aforementioned cheap single socket boxen will get the price down to well less than $1,000 each, probably more like $800, yielding a total cluster cost of about $200k USD for 256 cluster hosts--less than half that of the big smp SAN/NAS solution.
The cluster host numbers I'm using are merely examples. Google for example probably has a larger IMAP cluster server count per datacenter than the 256 nodes in my example--that's only about 6 racks packed with 42 x 1U servers. Given the number of gmail accounts in the US, and the fact they have less than 2 dozen datacenters here, we're probably looking at thousands of 1U IMAP servers per datacenter.
-- Stan
Maxwell Reid put forth on 8/7/2010 5:18 PM:
On Sat, Aug 7, 2010 at 3:06 AM, Stan Hoeppner <stan@hardwarefreak.com>wrote:
Noel Butler put forth on 8/6/2010 4:29 PM:
Actually you will not notice any difference. How do you think all the big boys do it now :) Granted some opted for the SAN approach over NAS, but for mail, NAS is better way to go IMHO and plenty of large services, ISP, corporations, and universities etc, all use NAS.
The protocol overhead of the NFS stack is such that one way latency is in the 1-50 millisecond range, depending on specific implementations and server load.
Yes, I would say NFS has greater overhead, but it allows for multi system access where fiber channel does not unless you're using clustered filesystems which have their own issues with latency and lock management....
Care to elaborate on this point? The NFS server sits in user space. All cluster filesystem operations take place in kernel space.
Using a FC SAN array with a dovecot farm, disk blocks are read/written at local disk speeds and latencies. The only network communication is between the nodes via a dedicated switch or VLAN with QOS for lock management, which takes place in the sub 1 millisecond range, still much faster than NFS stack processing.
Using NFS the dovecot member server file request must traverse the local user space NFS client to the TCP/IP stack where it is then sent to the user space NFS server on the remote machine which grabs the file blocks and then ships them back through the multiple network stack layers.
Again, with FC SAN, it's a direct read/write to, for all practical purposes, local disk--an FC packet encapsulating SCSI commands over a longer cable, if you will, maybe through an FC switch hop or two, which are in the microsecond range.
Dovecot clusters may be simpler to implement using NFS storage servers, but they are far more performant and scalable using SAN storage and a cluster FS, assuming the raw performance of the overall SAN/NAS systems is equal, i.e. electronics complex, #disks and spindle speed, etc.
-- Stan
Hi Stan,
On Sat, Aug 7, 2010 at 10:27 PM, Stan Hoeppner <stan@hardwarefreak.com>wrote:
Care to elaborate on this point? The NFS server sits in user space. All cluster filesystem operations take place in kernel space.
If you expand your definition of NFS server to include high end systems (NetApp being a common example), the placement of the NFS server isn't necessarily limited to user space. Some vendors like BlueArc, use FPGAs to handle the protocols. ORT in many cases is less than 1 msec on some of these boxes.
econd range.
Dovecot clusters may be simpler to implement using NFS storage servers,
Simpler and more cost effective. The price / performance (per watt if you want to go that far seeing as you don't need 2 fabrics ) generally favor NAS or some other kind of distributed file system based approach. The gains that come from parallelization are worth it at the cost of slightly less performance on an individual node basis, especially if you're dealing with N+2 or greater availabilty schemes.
These Distributed File Systems and specialilzed RPC mechanisms have higher overhead than even NFS, but they make up for it by increasing paralleization and using some very creative optimizations that you can use when you have many multiple machines and some other things that are don't have useful analogs outside of Google.
|In a nutshell, you divide the aggregate application data equally across a |number of nodes with local storage, and each node is responsible for handling |only a specific subset of the total data.
You can go the same thing with NFS nodes, with the added benefit using the automounter (on the low end) to "virtualize" the name space similar to what they do with render farms.
|The cluster host numbers I'm using are merely examples. Google for example |probably has a larger IMAP cluster server count per datacenter than the 256 |nodes in my example--that's only about 6 racks packed with 42 x 1U servers. |Given the number of gmail accounts in the US, and the fact they have less than |2 dozen datacenters here, we're probably looking at thousands of 1U IMAP |servers per datacenter.
The architecture you describe is very similar to webex, but they cap the number of accounts per node at some ridiculously small level, like 10,000 or something and use SAS drives.
~Max
Maxwell Reid put forth on 8/8/2010 2:43 PM:
Hi Stan,
Hay Maxwell,
If you expand your definition of NFS server to include high end systems (NetApp being a common example), the placement of the NFS server isn't necessarily limited to user space. Some vendors like BlueArc, use FPGAs to handle the protocols. ORT in many cases is less than 1 msec on some of these boxes.
That is a valid point. I don't know if DataOnTap runs NFS/CIFS code in kernel or user space, or if NetApp offloads the code of these protocols to an FPGA. Given the gate count on Xylinx' latest Vertex FPGAs, I'm not so sure code as complex as NFS could fit on the die. Obviously, if it would make performance and cost sense to do so, they could split the subroutines across multiple FPGA chips.
One thing is for certain, and that is that the driver code for a low level protocol such as Fiber Channel, or even iSCSI, is at least a factor of 10 or more smaller than the NFS code. I've not actually counted the lines of code of either code set, but am making a well educated guess.
The code stack size for a clustered filesystem such as GFS2, which is a requirement in the second cluster architecture of this discussion, is probably slightly larger than the XFS code stack which, again, is going to be much smaller than NFS, and it runs in kernel space by default as it is a filesystem driver. If you add up the most used critical path machine instructions of the NFS + TCP-UDP/IP stack and filesystem on the host (say NetApp) and do the same for GFS2 + Fiber Channel, the latter is going to be a much much shorter and faster execution path, period.
The 2nd of thse architectures has one less layer in the stack than the first, which is fat NFS. The transport layer protocol of the 2nd architecture has less than 1/10th the complexity of the first.
NFS/NAS box solutions can be made to be extremely fast, as in the case of NetApp, but they'll never be as fast as a good cluster filesystem atop a Fiber Channel SAN solution. As someone else pointed out, Noel IIRC, under light to medium load, you'll likely never notice a throughput/latency difference. Under heavy to extreme load, you'll notice performance degradation much sooner with an NFS solution, and the cliff will be much steeper, than with a good cluster filesystem and Fiber Channel SAN.
Dovecot clusters may be simpler to implement using NFS storage servers,
Simpler and more cost effective. The price / performance (per watt if you want to go that far seeing as you don't need 2 fabrics ) generally favor NAS or some other kind of distributed file system based approach. The gains that come from parallelization are worth it at the cost of slightly less performance on an individual node basis, especially if you're dealing with N+2 or greater availabilty schemes.
To equal the performance of 4G/8G Fiber Channel strictly at the transport layer, one must use an end-to-end 10GbE infrastructure. At just about any port count required for a cluster of any size, the cost difference between FC and 10GbE switches and HBAs is negligible, and in fact 10GbE equipment is usually a bit higher priced than FC gear. And for the kind of redundancy you're talking about, the 10GbE network will require the same redundant topology as a twin fabric FC network. You'll also need dual HBAs or dual port HBAs just as in the FC network. FC networks are just as scalable as ethernet, and at this performance level, again, the cost is nearly the same.
These Distributed File Systems and specialilzed RPC mechanisms have higher overhead than even NFS, but they make up for it by increasing paralleization and using some very creative optimizations that you can use when you have many multiple machines and some other things that are don't have useful analogs outside of Google.
I wish there was more information available on Google's home grown distributed database architecture, and whether or not they are indeed storing the Gmail user data (emails, address books, etc) in this database, or if they had to implement something else for the back end. If they're using the distributed db, then they had to have written their own custom POP/IMAP/etc server, or _heavily_ modified someone else's. As you say, this is neat, but probably has little applicability outside of the Googles of the world.
|In a nutshell, you divide the aggregate application data equally across a |number of nodes with local storage, and each node is responsible for handling |only a specific subset of the total data.
You can go the same thing with NFS nodes, with the added benefit using the automounter (on the low end) to "virtualize" the name space similar to what they do with render farms.
Many years ago when I first read about this, IIRC there were reliability and data ordering/arrival issues that prevented early adoption. In essence, when a host requested data via NFS, its NFS would sometimes corrupt the file data because it didn't reassemble the parallel fragments properly. Have those bugs been quashed? I assume they have, as I said, it was a long time ago I read about this. I don't use NFS so I'm not current WRT its status.
|The cluster host numbers I'm using are merely examples. Google for example |probably has a larger IMAP cluster server count per datacenter than the 256 |nodes in my example--that's only about 6 racks packed with 42 x 1U servers. |Given the number of gmail accounts in the US, and the fact they have less than |2 dozen datacenters here, we're probably looking at thousands of 1U IMAP |servers per datacenter.
The architecture you describe is very similar to webex, but they cap the number of accounts per node at some ridiculously small level, like 10,000 or something and use SAS drives.
Interesting. I don't think I've heard of webex. I'll have to read up on them.
-- Stan
On Mon, 2010-08-09 at 12:18 +0200, Edgar Fuß wrote:
The NFS server sits in user space. Oops? I don't know what Linux does, but with BSD, it has always been in-kernel.
Historically there was a user-space NFS-daemon (and can very probably be found via Google today). Actually there are stories about people using it because if you export filesystems via a user-space NFS daemon, you can change the mounting below without affecting the clients for NFS-v3.
But the kernel has a NFS-server since years and all (somewhat common) distributions use it per default.
Bernd
-- Bernd Petrovitsch Email : bernd@petrovitsch.priv.at LUGA : http://www.luga.at
On 8/9/10 5:31 AM, Bernd Petrovitsch wrote:
On Mon, 2010-08-09 at 12:18 +0200, Edgar Fuß wrote:
The NFS server sits in user space. Oops? I don't know what Linux does, but with BSD, it has always been in-kernel.
Historically there was a user-space NFS-daemon (and can very probably be found via Google today). Actually there are stories about people using it because if you export filesystems via a user-space NFS daemon, you can change the mounting below without affecting the clients for NFS-v3.
But the kernel has a NFS-server since years and all (somewhat common) distributions use it per default.
Debian still gives you a choice with packages nfs-user-server and nfs-kernel-server.
~Seth
On Mon, 2010-08-09 at 08:02 -0700, Seth Mattinen wrote:
On 8/9/10 5:31 AM, Bernd Petrovitsch wrote:
On Mon, 2010-08-09 at 12:18 +0200, Edgar Fuß wrote:
The NFS server sits in user space. Oops? I don't know what Linux does, but with BSD, it has always been in-kernel.
Historically there was a user-space NFS-daemon (and can very probably be found via Google today). Actually there are stories about people using it because if you export filesystems via a user-space NFS daemon, you can change the mounting below without affecting the clients for NFS-v3.
But the kernel has a NFS-server since years and all (somewhat common) distributions use it per default.
Debian still gives you a choice with packages nfs-user-server and nfs-kernel-server.
Ah, I wasn't aware of that. Thanks for adding that.
Bernd
-- Bernd Petrovitsch Email : bernd@petrovitsch.priv.at LUGA : http://www.luga.at
On Sun, 2010-08-08 at 00:27 -0500, Stan Hoeppner wrote:
Care to elaborate on this point? The NFS server sits in user space. All cluster filesystem operations take place in kernel space.
hellloooooooo 2001 is calling you, NFS has been in kernel for many years now.
it seems I'm missing a lot of your posts, looks likely our SA anti_troll rules
Noel Butler wrote:
it seems I'm missing a lot of your posts, looks likely our SA anti_troll rules
Simply being wrong does not a troll make, however...
It was pointed out to you a couple of months ago on the spam-l list that you are making some really bad decisions wrt your SA/Mailscanner rules, some of which make you worse than a backscatter source - in fact, one of your rules caused you to actually spam the spam-l list, and then to compound the error you explicitly said you made no apologies for it (which got you moderated) - not very smart...
You also said earlier that you were missing some of my posts - and now some of Stans - which makes it on the order of 100% certain that you are losing other legitimate mail, so obviously that rule is not the only ill-advised one you implement.
To each his own, just please don't start spamming the dovecot list.
--
Best regards,
Charles
On Tue, 2010-08-10 at 06:21 -0400, Charles Marcus wrote:
Simply being wrong does not a troll make, however...
No , what makes one a troll is the amount of constant crap emitted from a person over many mediums.
It was pointed out to you a couple of months ago on the spam-l list that
spam-l? they are IMHO mostly just chest beating wankers trying to justify their own self importance. The only person their I actually have heard of before in my 20 odd years of doing this, and respect, is Al, and he rarely posts.
you are making some really bad decisions wrt your SA/Mailscanner rules,
in your opinion, and yours alone
some of which make you worse than a backscatter source - in fact, one of your rules caused you to actually spam the spam-l list, and then to
stop spreading lies. it wasnt spam at all as you know, the content of the message was direct and legit, identical to many others posts, the only difference is I do not whitelist outbound mail, so it was caught, and issued a warning about the included content, ill never change that and make no apologies, ever. i dont care that you disagree with that.
compound the error you explicitly said you made no apologies for it (which got you moderated) - not very smart...
Correct and it goes to show what childish morons exist in such position, I'll gladly put up on the web a copy of the archives for anyone (their copyright BS is invalid and legally unenforceable in this country because it is a "general mailing list that anyone is free to join"), and if you only saw the immature private mail terranson sent me :)
You also said earlier that you were missing some of my posts - and now some of Stans - which makes it on the order of 100% certain that you are losing other legitimate mail, so obviously that rule is not the only ill-advised one you implement.
negative, we have rules in place for ranters and trolls, we used to run an nntp-email gateway, so the rules were perfected over time, the rules ARE working EXACTLY as designed when they catch Stans 10 page rants, but when he behaves himself, like the 5 liner he did a week ago, they have a chance of getting through.
In relation to you, the email service you use has a poor reputation here, that along with the troll rules will certainly ensure I dont see your posts unless I am so bored I go looking in archives (only went looking for this after someone sent me a weird email a few days back and I was curious what he was going on about) and in accordance with that, I luckily wont see any self justification rant you try reply with, so dont bother, when a netblock or host is listed for miscreant activities, it tends to mean they are out forever, this includes high score additions in SA as well.
To each his own, just please don't start spamming the dovecot list.
oh grow up little child, I have been on this list for many many many years, so keep your high and mighty self importance to the spam-l (does L stand for "lamers"... starting to think so) list.
my apologies to anyone else who was bored enough to read this, but experience shows sometimes you have to lower yourself to a childs level to communicate with that child. (and I bet he still replies to me cause he still just wont get that fact that I wont see it *sigh*)
On 6.8.2010, at 20.31, CJ Keist wrote:
So question I have for the dovecot team, does running Maildir over NFS work well? Or would you recommend that all user mail folders be stored locally on the mail server when using Maildir?
As long as you have only a single Dovecot server accessing mails at the same time (that includes Dovecot LDA, but not non-Dovecot LDA), there won't be any reliability problems. For better performance you could then also put index files on local disk.
If you have multiple servers accessing mails at the same time, I recommend v2.0 with director enabled.
participants (11)
-
Bernd Petrovitsch
-
Charles Marcus
-
CJ Keist
-
Edgar Fuß
-
Jerrale G
-
Maxwell Reid
-
Noel Butler
-
Patrick Domack
-
Seth Mattinen
-
Stan Hoeppner
-
Timo Sirainen