On 4/5/2012 3:02 PM, Emmanuel Noobadmin wrote:
Hi Emmanuel,
I'm trying to improve the setup of our Dovecot/Exim mail servers to handle the increasingly huge accounts (everybody thinks it's like infinitely growing storage like gmail and stores everything forever in their email accounts) by changing from Maildir to mdbox, and to take advantage of offloading older emails to alternative networked storage nodes.
I'll assume "networked storage nodes" means NFS, not FC/iSCSI SAN, in which case you'd have said "SAN".
The question now is whether having a single large server or will a number of 1U servers with the same total capacity be better?
Less complexity and cost is always better. CPU throughput isn't a factor in mail workloads--it's all about IO latency. A 1U NFS server with 12 drive JBOD is faster, cheaper, easier to setup and manage, sucks less juice and dissipates less heat than 4 1U servers each w/ 4 drives. I don't recall seeing your user load or IOPS requirements so I'm making some educated guesses WRT your required performance and total storage. I came up with the following system that should be close to suitable, for ~$10k USD. The 4 node system runs ~$12k USD. At $2k this isn't substantially higher. But when we double the storage of each architecture we're at ~$19k, vs ~$26k for an 8 node cluster, a difference of ~$7k. That's $1k shy of another 12 disk JBOD. Since CPU is nearly irrelevant for a mail workload, you can see it's much cheaper to scale capacity and IOPS with a single node w/fat storage than with skinny nodes w/thin storage. Ok, so here's the baseline config I threw together:
http://h10010.www1.hp.com/wwpc/us/en/sm/WF06b/15351-15351-3328412-241644-332... 8 cores is plenty, 2 boot drives mirrored on B110i, 16GB (4x4GB) http://www.lsi.com/products/storagecomponents/Pages/LSISAS9205-8e.aspx http://h10010.www1.hp.com/wwpc/us/en/sm/WF06b/12169-304616-3930445-3930445-3... w/ 12 2TB 7.2K SATA drives, configured as md concat+RAID1 pairs with 12 allocation groups, 12TB usable. Format the md device with the defaults:
$ mkfs.xfs /dev/md0
Mount with inode64. No XFS stripe alignment to monkey with. No md chunk size or anything else to worry about. XFS' allocation group design is pure elegance here.
If 12 TB isn't sufficient, or if you need more space later, you can daisy chain up to 3 additional D2600 JBODs for ~$8500 USD each, just add cables. This quadruples IOPS, throughput, and capacity--96TB total, 48TB net. Simply create 6 more mdraid1 devices and grow the linear array with them. Then do an xfs_growfs to bring the extra 12TB of free space into the filesystem.
If you're budget conscious and/or simply prefer quality inexpensive whitebox/DIY type gear, as I do, you can get 24 x 2TB drives in one JBOD chassis for $7400 USD. That twice the drives, capacity, IOPS, for ~$2500 less than the HP JBOD. And unlike the HP 'enterprise SATA' drives, the 2TB WD Black series have a 5 year warranty, and work great with mdraid. Chassis and drives at Newegg:
http://www.newegg.com/Product/Product.aspx?Item=N82E16816133047 http://www.newegg.com/Product/Product.aspx?Item=N82E16822136792
You can daisy chain 3 of these off one HBA SFF8088 port, 6 total on our LSI 9205-8e above, for a total of 144 2TB drives, 72 effective spindles in our concat+RAID1 setup, 144TB net space.
Will be using RAID 1 pairs, likely XFS based on reading Hoeppner's recommendation on this and the mdadm list.
To be clear, the XFS configuration I recommend/promote for mailbox storage is very specific and layered. The layers must all be used together to get the performance. These layers consist of using multiple hardware or software RAID1 pairs and concatenating them with an md linear array. You then format that md device with the XFS defaults, or a specific agcount if you know how to precisely tune AG layout based on disk size and your anticipated concurrency level of writers.
Putting XFS on a singe RAID1 pair, as you seem to be describing above for the multiple "thin" node case, and hitting one node with parallel writes to multiple user mail dirs, you'll get less performance than EXT3/4 on that mirror pair--possibly less than half, depending on the size of the disks and thus the number of AGs created. The 'secret' to XFS performance with this workload is concatenation of spindles. Without it you can't spread the AGs--thus directories, thus parallel file writes--horizontally across the spindles--and this is the key. By spreading AGs 'horizontally' across the disks in a concat, instead of 'vertically' down a striped array, you accomplish two important things:
You dramatically reduce disk head seeking by using the concat array. With XFS on a RAID10 array of 24 2TB disks you end up with 24 AGs evenly spaced vertically down each disk in the array, following the stripe pattern. Each user mailbox is stored in a different directory. Each directory was created in a different AG. So if you have 96 users writing their dovecot index concurrently, you have at worst case a minimum 192 head movements occurring back and forth across the entire platter of each disk, and likely not well optimized by TCQ/NCQ. Why 192 instead of 96? The modification time in the directory metadata must be updated for each index file, among other things.
Because we decrease seeks dramatically we also decrease response latency significantly. With the RAID1+concat+XFS we have 12 disks each with only 2 AGs spaced evenly down each platter. We have the same 4 user mail dirs in each AG, but in this case only 8 user mail dirs are contained on each disk instead of portions all 96. With the same 96 concurrent writes to indexes, in this case end up with only 16 seeks per drive--again, one to update each index file and one to update the metadata.
Assuming these drives have a max seek rate of 150 which is the average for 7.2k drives, it will take 192/150 = 1.28 seconds for these operations on the RAID10 array. With the concat array it will only take 16/150 = 0.11 seconds. Extrapolating from that demonstrates that the concat array can handle 1.28/0.11 = 11.6*96 = 1,111 concurrent user index updates in the same time as the RAID10 array, just over 10 times more users. Granted, these are rough theoretical numbers--an index plus metadata update isn't always going to cause a seek on every chunk in a stripe, etc. But this does paint a very accurate picture of the differences in mailbox workload disk seek patterns between XFS on concat and RAID10 with the same hardware. In production one should be able to handle at minimum 2x more users, probably many more, with the RAID1+concat+XFS vs RAID10+XFS setup on the same hardware.
Currently, I'm leaning towards multiple small servers because I think it should be better in terms of performance.
This usually isn't the case with mail. It's impossible to split up the user files across the storage nodes in a way that balances block usage on each node and user access to those blocks. Hotspots are inevitable in both categories. You may achieve the same total performance of a single server, maybe slightly surpass it depending on user load, but you end up spending extra money on building resources that are idle most of the time, in the case of CPU and NICs, or under/over utilized, in the case of disk capacity in each node. Switch ports aren't horribly expensive today, but you're still wasting some with the farm setup.
At the very least even if one node gets jammed up, the rest should still be able serve up the emails for other accounts that is unless Dovecot will get locked up by that jammed transaction.
Some host failure redundancy is about all you'd gain from the farm setup. Dovecot shouldn't barf due to one NFS node being down, only hiccup. I.e. only imap process accessing files on the downed node would have trouble.
Also, I could possibly arrange them in a sort of network raid 1 to gain redundancy over single machine failure.
Now you're sounding like Charles Marcus, but worse. ;) Stay where you are, and brush your hair away from your forehead. I'm coming over with my branding iron that says "K.I.S.S"
Would I be correct in these or do actual experiences say otherwise?
Oracles on Mount Interweb profess that 2^5 nodes wide scale out is the holy grail. IBM's mainframe evangelists tell us to put 5 million mail users on a SystemZ with hundreds of Linux VMs.
I think bliss for most of us is found somewhere in the middle.
-- Stan