Putting XFS on a singe RAID1 pair, as you seem to be describing above for the multiple "thin" node case, and hitting one node with parallel writes to multiple user mail dirs, you'll get less performance than EXT3/4 on that mirror pair--possibly less than half, depending on the size of the disks and thus the number of AGs created. The 'secret' to XFS performance with this workload is concatenation of spindles. Without it you can't spread the AGs--thus directories, thus parallel file writes--horizontally across the spindles--and this is the key. By spreading AGs 'horizontally' across the disks in a concat, instead of 'vertically' down a striped array, you accomplish two important things:
- You dramatically reduce disk head seeking by using the concat array. With XFS on a RAID10 array of 24 2TB disks you end up with 24 AGs evenly spaced vertically down each disk in the array, following the stripe pattern. Each user mailbox is stored in a different directory. Each directory was created in a different AG. So if you have 96 users writing their dovecot index concurrently, you have at worst case a minimum 192 head movements occurring back and forth across the entire platter of each disk, and likely not well optimized by TCQ/NCQ. Why 192 instead of 96? The modification time in the directory metadata must be updated for each index file, among other things.
Does the XFS allocator automatically distribute AGs in this way even when disk usage is extremely light, i.e, a freshly formatted system with user directories initially created, and then the actual mailbox contents copied into them?
If this is indeed the case, then what you describe is a wondrous revelation, since you're scaling out the number of simultaneous metadata reads+writes/second as you add RAID1 pairs, if my understanding of this is correct. I'm assuming of course, but should look at the code, that metadata locks imposed by the filesystem "distribute" as the number of pairs increase - if it's all just one Big Lock, then that wouldn't be the case.
Forgive my laziness, as I could just experiment and take a look at the on-disk structures myself, but I don't have four empty drives handy to experiment.
The bandwidth improvements due to striping (RAID0/5/6 style) are no help for metadata-intensive IO loads, and probably of little value for even mdbox loads too, I suspect, unless the mdbox max size is set to something pretty large, no?
Have you tried other filesystems and seen if they distribute metadata in a similarly efficient and scalable manner across concatenated drive sets?
Is there ANY point to using striping at all, a la "RAID10" in this? I'd have thought just making as many RAID1 pairs out of your drives as possible would be the ideal strategy - is this not the case?
=R=