[Dovecot] Better to use a single large storage server or multiple smaller for mdbox?

Stan Hoeppner stan at hardwarefreak.com
Sat Apr 7 13:19:46 EEST 2012


On 4/5/2012 3:02 PM, Emmanuel Noobadmin wrote:

Hi Emmanuel,

> I'm trying to improve the setup of our Dovecot/Exim mail servers to
> handle the increasingly huge accounts (everybody thinks it's like
> infinitely growing storage like gmail and stores everything forever in
> their email accounts) by changing from Maildir to mdbox, and to take
> advantage of offloading older emails to alternative networked storage
> nodes.

I'll assume "networked storage nodes" means NFS, not FC/iSCSI SAN, in
which case you'd have said "SAN".

> The question now is whether having a single large server or will a
> number of 1U servers with the same total capacity be better? 

Less complexity and cost is always better.  CPU throughput isn't a
factor in mail workloads--it's all about IO latency.  A 1U NFS server
with 12 drive JBOD is faster, cheaper, easier to setup and manage, sucks
less juice and dissipates less heat than 4 1U servers each w/ 4 drives.
 I don't recall seeing your user load or IOPS requirements so I'm making
some educated guesses WRT your required performance and total storage.
I came up with the following system that should be close to suitable,
for ~$10k USD.  The 4 node system runs ~$12k USD.  At $2k this isn't
substantially higher.  But when we double the storage of each
architecture we're at ~$19k, vs ~$26k for an 8 node cluster, a
difference of ~$7k.  That's $1k shy of another 12 disk JBOD.  Since CPU
is nearly irrelevant for a mail workload, you can see it's much cheaper
to scale capacity and IOPS with a single node w/fat storage than with
skinny nodes w/thin storage.  Ok, so here's the baseline config I threw
together:

http://h10010.www1.hp.com/wwpc/us/en/sm/WF06b/15351-15351-3328412-241644-3328421-4091396-4158470-4158440.html?dnr=1
8 cores is plenty, 2 boot drives mirrored on B110i, 16GB (4x4GB)
http://www.lsi.com/products/storagecomponents/Pages/LSISAS9205-8e.aspx
http://h10010.www1.hp.com/wwpc/us/en/sm/WF06b/12169-304616-3930445-3930445-3930445-3954787-4021626-4021628.html?dnr=1
w/ 12 2TB 7.2K SATA drives, configured as md concat+RAID1 pairs with 12
allocation groups, 12TB usable.  Format the md device with the defaults:

$ mkfs.xfs /dev/md0

Mount with inode64.  No XFS stripe alignment to monkey with.  No md
chunk size or anything else to worry about.  XFS' allocation group
design is pure elegance here.

If 12 TB isn't sufficient, or if you need more space later, you can
daisy chain up to 3 additional D2600 JBODs for ~$8500 USD each, just add
cables.  This quadruples IOPS, throughput, and capacity--96TB total,
48TB net.  Simply create 6 more mdraid1 devices and grow the linear
array with them.  Then do an xfs_growfs to bring the extra 12TB of free
space into the filesystem.

If you're budget conscious and/or simply prefer quality inexpensive
whitebox/DIY type gear, as I do, you can get 24 x 2TB drives in one JBOD
chassis for $7400 USD.  That twice the drives, capacity, IOPS, for
~$2500 less than the HP JBOD.  And unlike the HP 'enterprise SATA'
drives, the 2TB WD Black series have a 5 year warranty, and work great
with mdraid.  Chassis and drives at Newegg:

http://www.newegg.com/Product/Product.aspx?Item=N82E16816133047
http://www.newegg.com/Product/Product.aspx?Item=N82E16822136792

You can daisy chain 3 of these off one HBA SFF8088 port, 6 total on our
LSI 9205-8e above, for a total of 144 2TB drives, 72 effective spindles
in our concat+RAID1 setup, 144TB net space.

> Will be
> using RAID 1 pairs, likely XFS based on reading Hoeppner's
> recommendation on this and the mdadm list.

To be clear, the XFS configuration I recommend/promote for mailbox
storage is very specific and layered.  The layers must all be used
together to get the performance.  These layers consist of using multiple
hardware or software RAID1 pairs and concatenating them with an md
linear array.  You then format that md device with the XFS defaults, or
a specific agcount if you know how to precisely tune AG layout based on
disk size and your anticipated concurrency level of writers.

Putting XFS on a singe RAID1 pair, as you seem to be describing above
for the multiple "thin" node case, and hitting one node with parallel
writes to multiple user mail dirs, you'll get less performance than
EXT3/4 on that mirror pair--possibly less than half, depending on the
size of the disks and thus the number of AGs created.  The 'secret' to
XFS performance with this workload is concatenation of spindles.
Without it you can't spread the AGs--thus directories, thus parallel
file writes--horizontally across the spindles--and this is the key.  By
spreading AGs 'horizontally' across the disks in a concat, instead of
'vertically' down a striped array, you accomplish two important things:

1.  You dramatically reduce disk head seeking by using the concat array.
 With XFS on a RAID10 array of 24 2TB disks you end up with 24 AGs
evenly spaced vertically down each disk in the array, following the
stripe pattern.  Each user mailbox is stored in a different directory.
Each directory was created in a different AG.  So if you have 96 users
writing their dovecot index concurrently, you have at worst case a
minimum 192 head movements occurring back and forth across the entire
platter of each disk, and likely not well optimized by TCQ/NCQ.  Why 192
instead of 96?  The modification time in the directory metadata must be
updated for each index file, among other things.

2.  Because we decrease seeks dramatically we also decrease response
latency significantly.  With the RAID1+concat+XFS we have 12 disks each
with only 2 AGs spaced evenly down each platter.  We have the same 4
user mail dirs in each AG, but in this case only 8 user mail dirs are
contained on each disk instead of portions all 96.  With the same 96
concurrent writes to indexes, in this case end up with only 16 seeks per
drive--again, one to update each index file and one to update the metadata.

Assuming these drives have a max seek rate of 150 which is the average
for 7.2k drives, it will take 192/150 = 1.28 seconds for these
operations on the RAID10 array.  With the concat array it will only take
16/150 = 0.11 seconds.  Extrapolating from that demonstrates that the
concat array can handle 1.28/0.11 = 11.6*96 = 1,111 concurrent user
index updates in the same time as the RAID10 array, just over 10 times
more users.  Granted, these are rough theoretical numbers--an index plus
metadata update isn't always going to cause a seek on every chunk in a
stripe, etc.  But this does paint a very accurate picture of the
differences in mailbox workload disk seek patterns between XFS on concat
and RAID10 with the same hardware.  In production one should be able to
handle at minimum 2x more users, probably many more, with the
RAID1+concat+XFS vs RAID10+XFS setup on the same hardware.

> Currently, I'm leaning towards multiple small servers because I think
> it should be better in terms of performance. 

This usually isn't the case with mail.  It's impossible to split up the
user files across the storage nodes in a way that balances block usage
on each node and user access to those blocks.  Hotspots are inevitable
in both categories.  You may achieve the same total performance of a
single server, maybe slightly surpass it depending on user load, but you
end up spending extra money on building resources that are idle most of
the time, in the case of CPU and NICs, or under/over utilized, in the
case of disk capacity in each node.  Switch ports aren't horribly
expensive today, but you're still wasting some with the farm setup.

> At the very least even if
> one node gets jammed up, the rest should still be able serve up the
> emails for other accounts that is unless Dovecot will get locked up by
> that jammed transaction. 

Some host failure redundancy is about all you'd gain from the farm
setup.  Dovecot shouldn't barf due to one NFS node being down, only
hiccup.  I.e. only imap process accessing files on the downed node would
have trouble.

> Also, I could possibly arrange them in a sort
> of network raid 1 to gain redundancy over single machine failure.

Now you're sounding like Charles Marcus, but worse. ;)  Stay where you
are, and brush your hair away from your forehead.  I'm coming over with
my branding iron that says "K.I.S.S"

> Would I be correct in these or do actual experiences say otherwise?

Oracles on Mount Interweb profess that 2^5 nodes wide scale out is the
holy grail.  IBM's mainframe evangelists tell us to put 5 million mail
users on a SystemZ with hundreds of Linux VMs.

I think bliss for most of us is found somewhere in the middle.

-- 
Stan


More information about the dovecot mailing list