[Dovecot] Better to use a single large storage server or multiple smaller for mdbox?

Robin dovecot at r.paypc.com
Sat Apr 7 23:45:08 EEST 2012


> Putting XFS on a singe RAID1 pair, as you seem to be describing above
> for the multiple "thin" node case, and hitting one node with parallel
> writes to multiple user mail dirs, you'll get less performance than
> EXT3/4 on that mirror pair--possibly less than half, depending on the
> size of the disks and thus the number of AGs created.  The 'secret' to
> XFS performance with this workload is concatenation of spindles.
> Without it you can't spread the AGs--thus directories, thus parallel
> file writes--horizontally across the spindles--and this is the key.  By
> spreading AGs 'horizontally' across the disks in a concat, instead of
> 'vertically' down a striped array, you accomplish two important things:
>
> 1.  You dramatically reduce disk head seeking by using the concat array.
>   With XFS on a RAID10 array of 24 2TB disks you end up with 24 AGs
> evenly spaced vertically down each disk in the array, following the
> stripe pattern.  Each user mailbox is stored in a different directory.
> Each directory was created in a different AG.  So if you have 96 users
> writing their dovecot index concurrently, you have at worst case a
> minimum 192 head movements occurring back and forth across the entire
> platter of each disk, and likely not well optimized by TCQ/NCQ.  Why 192
> instead of 96?  The modification time in the directory metadata must be
> updated for each index file, among other things.

Does the XFS allocator automatically distribute AGs in this way even 
when disk usage is extremely light, i.e, a freshly formatted system with 
user directories initially created, and then the actual mailbox contents 
copied into them?

If this is indeed the case, then what you describe is a wondrous 
revelation, since you're scaling out the number of simultaneous metadata 
reads+writes/second as you add RAID1 pairs, if my understanding of this 
is correct.  I'm assuming of course, but should look at the code, that 
metadata locks imposed by the filesystem "distribute" as the number of 
pairs increase - if it's all just one Big Lock, then that wouldn't be 
the case.

Forgive my laziness, as I could just experiment and take a look at the 
on-disk structures myself, but I don't have four empty drives handy to 
experiment.

The bandwidth improvements due to striping (RAID0/5/6 style) are no help 
for metadata-intensive IO loads, and probably of little value for even 
mdbox loads too, I suspect, unless the mdbox max size is set to 
something pretty large, no?

Have you tried other filesystems and seen if they distribute metadata in 
a similarly efficient and scalable manner across concatenated drive sets?

Is there ANY point to using striping at all, a la "RAID10" in this?  I'd 
have thought just making as many RAID1 pairs out of your drives as 
possible would be the ideal strategy - is this not the case?

=R=


More information about the dovecot mailing list