[Dovecot] Better to use a single large storage server or multiple smaller for mdbox?

Sun Apr 8 03:46:20 EEST 2012

On 4/7/2012 3:45 PM, Robin wrote:
> 
>> Putting XFS on a singe RAID1 pair, as you seem to be describing above
>> for the multiple "thin" node case, and hitting one node with parallel
>> writes to multiple user mail dirs, you'll get less performance than
>> EXT3/4 on that mirror pair--possibly less than half, depending on the
>> size of the disks and thus the number of AGs created.  The 'secret' to
>> XFS performance with this workload is concatenation of spindles.
>> Without it you can't spread the AGs--thus directories, thus parallel
>> file writes--horizontally across the spindles--and this is the key.  By
>> spreading AGs 'horizontally' across the disks in a concat, instead of
>> 'vertically' down a striped array, you accomplish two important things:
>>
>> 1.  You dramatically reduce disk head seeking by using the concat array.
>>   With XFS on a RAID10 array of 24 2TB disks you end up with 24 AGs
>> evenly spaced vertically down each disk in the array, following the
>> stripe pattern.  Each user mailbox is stored in a different directory.
>> Each directory was created in a different AG.  So if you have 96 users
>> writing their dovecot index concurrently, you have at worst case a
>> minimum 192 head movements occurring back and forth across the entire
>> platter of each disk, and likely not well optimized by TCQ/NCQ.  Why 192
>> instead of 96?  The modification time in the directory metadata must be
>> updated for each index file, among other things.
> 
> Does the XFS allocator automatically distribute AGs in this way even
> when disk usage is extremely light, i.e, a freshly formatted system with
> user directories initially created, and then the actual mailbox contents
> copied into them?

It doesn't distribute AGs.  There are a static number created during
mkfs.xfs.  The inode64 allocator round robins new directory creation
across the AGs, and does the same with files created in those
directories.  Having the directory metadata and file extents in the same
AG decreases head movement and thus seek latency for mixed
metadata/extent high IOPS workloads.

> If this is indeed the case, then what you describe is a wondrous
> revelation, since you're scaling out the number of simultaneous metadata
> reads+writes/second as you add RAID1 pairs, if my understanding of this
> is correct.  

Correct.  And adding more space and IOPS is uncomplicated.  No chunk
calculations, no restriping of the array.  You simply grow the md linear
array adding the new disk device.  Then grow XFS to add the new free
space to the filesystem.  AFAIK this can be done infinitely,
theoretically.  I'm guessing md has a device count limit somewhere.  If
not your bash line buffer might. ;)

> I'm assuming of course, but should look at the code, that
> metadata locks imposed by the filesystem "distribute" as the number of
> pairs increase - if it's all just one Big Lock, then that wouldn't be
> the case.

XFS locking is done as minimally as possibly and is insanely fast.  I've
not come across any reported performance issues relating to it.  And
yes, any single metadata lock will occur in a single AG on one mirror
pair using the concat setup.

> Forgive my laziness, as I could just experiment and take a look at the
> on-disk structures myself, but I don't have four empty drives handy to
> experiment.

Don't sweat it.  All of this stuff is covered in the XFS Filesystem
Structure Guide, exciting reading if you enjoy a root canal while
watching snales race:
http://xfs.org/docs/xfsdocs-xml-dev/XFS_Filesystem_Structure//tmp/en-US/html/index.html

> The bandwidth improvements due to striping (RAID0/5/6 style) are no help
> for metadata-intensive IO loads, and probably of little value for even
> mdbox loads too, I suspect, unless the mdbox max size is set to
> something pretty large, no?

The problem with striped parity RAID is not allocation, which takes
place in free space and is pretty fast.  The problem is the extra read
seeks and bandwidth of the RMW cycle when you modify an existing stripe.
 Updating a single flag in a Dovecot index causes md or the hardware
RAID controller to read the entire stripe into buffer space or RAID
cache, modify the flag byte, recalculate parity, then write the whole
stripe and parity block back out across all the disks.

With a linear concat of RAID1 pairs we're simply rewriting a single 4KB
filesystem block, maybe only a single 512B sector.  I'm at the edge of
my knowledge here.  I don't know exactly how Timo does the index
updates.  Regardless of the method, the index update is light years
faster with the concat setup as there is no RMW and full stripe
writeback as with the RAID5/6 case.

> Have you tried other filesystems and seen if they distribute metadata in
> a similarly efficient and scalable manner across concatenated drive sets?

EXT, any version, does not.  ReiserFS does not.  Both require disk
striping to achieve any parallelism.  With concat they both simply start
writing at the beginning sectors of the first RAID1 pair and 4 years
later maybe reach the last pair as they fill up the volume. ;)  JFS has
a more advanced allocation strategy that EXT or ReiserFS, not as
advanced as XFS.  I've never read of a concat example with JFS and I've
never tested it.  It's all but a dead filesystem at this point anyway,
less than 2 dozen commits in 8 years last I checked, and these were
simple bug fixes and changes to keep it building on new kernels.  If
it's not suffering bit rot now I'm sure it will be in the near future.

> Is there ANY point to using striping at all, a la "RAID10" in this?  I'd
> have thought just making as many RAID1 pairs out of your drives as
> possible would be the ideal strategy - is this not the case?

If you're using XFS, and your workload is overwhelmingly mail,
RAID1+concat is the only way to fly, and it flies.  If the workload is
not mail, say large file streaming writes, then you're limited to
100-200MB/s, a single drive of throughput, as each file is written to a
single directory on a single AG on a single disk.  For streaming write
performance you'll need striping.  If you have many concurrent large
streaming writes, you'll want to concat multiple striped arrays.

-- 
Stan