Re: [Dovecot] Better to use a single large storage server or multiple smaller for mdbox?

7 Apr 2012

      ...
Putting XFS on a singe RAID1 pair, as you seem to be describing above
for the multiple "thin" node case, and hitting one node with parallel
writes to multiple user mail dirs, you'll get less performance than
EXT3/4 on that mirror pair--possibly less than half, depending on the
size of the disks and thus the number of AGs created.  The 'secret' to
XFS performance with this workload is concatenation of spindles.
Without it you can't spread the AGs--thus directories, thus parallel
file writes--horizontally across the spindles--and this is the key.  By
spreading AGs 'horizontally' across the disks in a concat, instead of
'vertically' down a striped array, you accomplish two important things:

You dramatically reduce disk head seeking by using the concat array.
With XFS on a RAID10 array of 24 2TB disks you end up with 24 AGs
evenly spaced vertically down each disk in the array, following the
stripe pattern.  Each user mailbox is stored in a different directory.
Each directory was created in a different AG.  So if you have 96 users
writing their dovecot index concurrently, you have at worst case a
minimum 192 head movements occurring back and forth across the entire
platter of each disk, and likely not well optimized by TCQ/NCQ.  Why 192
instead of 96?  The modification time in the directory metadata must be
updated for each index file, among other things.

Does the XFS allocator automatically distribute AGs in this way even
when disk usage is extremely light, i.e, a freshly formatted system with
user directories initially created, and then the actual mailbox contents
copied into them?
If this is indeed the case, then what you describe is a wondrous
revelation, since you're scaling out the number of simultaneous metadata
reads+writes/second as you add RAID1 pairs, if my understanding of this
is correct.  I'm assuming of course, but should look at the code, that
metadata locks imposed by the filesystem "distribute" as the number of
pairs increase - if it's all just one Big Lock, then that wouldn't be
the case.
Forgive my laziness, as I could just experiment and take a look at the
on-disk structures myself, but I don't have four empty drives handy to
experiment.
The bandwidth improvements due to striping (RAID0/5/6 style) are no help
for metadata-intensive IO loads, and probably of little value for even
mdbox loads too, I suspect, unless the mdbox max size is set to
something pretty large, no?
Have you tried other filesystems and seen if they distribute metadata in
a similarly efficient and scalable manner across concatenated drive sets?
Is there ANY point to using striping at all, a la "RAID10" in this?  I'd
have thought just making as many RAID1 pairs out of your drives as
possible would be the ideal strategy - is this not the case?
=R=

Re: [Dovecot] Better to use a single large storage server or multiple smaller for mdbox?

Robin