On 4/7/2012 3:45 PM, Robin wrote:
Putting XFS on a singe RAID1 pair, as you seem to be describing above for the multiple "thin" node case, and hitting one node with parallel writes to multiple user mail dirs, you'll get less performance than EXT3/4 on that mirror pair--possibly less than half, depending on the size of the disks and thus the number of AGs created. The 'secret' to XFS performance with this workload is concatenation of spindles. Without it you can't spread the AGs--thus directories, thus parallel file writes--horizontally across the spindles--and this is the key. By spreading AGs 'horizontally' across the disks in a concat, instead of 'vertically' down a striped array, you accomplish two important things:
- You dramatically reduce disk head seeking by using the concat array. With XFS on a RAID10 array of 24 2TB disks you end up with 24 AGs evenly spaced vertically down each disk in the array, following the stripe pattern. Each user mailbox is stored in a different directory. Each directory was created in a different AG. So if you have 96 users writing their dovecot index concurrently, you have at worst case a minimum 192 head movements occurring back and forth across the entire platter of each disk, and likely not well optimized by TCQ/NCQ. Why 192 instead of 96? The modification time in the directory metadata must be updated for each index file, among other things.
Does the XFS allocator automatically distribute AGs in this way even when disk usage is extremely light, i.e, a freshly formatted system with user directories initially created, and then the actual mailbox contents copied into them?
It doesn't distribute AGs. There are a static number created during mkfs.xfs. The inode64 allocator round robins new directory creation across the AGs, and does the same with files created in those directories. Having the directory metadata and file extents in the same AG decreases head movement and thus seek latency for mixed metadata/extent high IOPS workloads.
If this is indeed the case, then what you describe is a wondrous revelation, since you're scaling out the number of simultaneous metadata reads+writes/second as you add RAID1 pairs, if my understanding of this is correct.
Correct. And adding more space and IOPS is uncomplicated. No chunk calculations, no restriping of the array. You simply grow the md linear array adding the new disk device. Then grow XFS to add the new free space to the filesystem. AFAIK this can be done infinitely, theoretically. I'm guessing md has a device count limit somewhere. If not your bash line buffer might. ;)
I'm assuming of course, but should look at the code, that metadata locks imposed by the filesystem "distribute" as the number of pairs increase - if it's all just one Big Lock, then that wouldn't be the case.
XFS locking is done as minimally as possibly and is insanely fast. I've not come across any reported performance issues relating to it. And yes, any single metadata lock will occur in a single AG on one mirror pair using the concat setup.
Forgive my laziness, as I could just experiment and take a look at the on-disk structures myself, but I don't have four empty drives handy to experiment.
Don't sweat it. All of this stuff is covered in the XFS Filesystem Structure Guide, exciting reading if you enjoy a root canal while watching snales race: http://xfs.org/docs/xfsdocs-xml-dev/XFS_Filesystem_Structure//tmp/en-US/html...
The bandwidth improvements due to striping (RAID0/5/6 style) are no help for metadata-intensive IO loads, and probably of little value for even mdbox loads too, I suspect, unless the mdbox max size is set to something pretty large, no?
The problem with striped parity RAID is not allocation, which takes place in free space and is pretty fast. The problem is the extra read seeks and bandwidth of the RMW cycle when you modify an existing stripe. Updating a single flag in a Dovecot index causes md or the hardware RAID controller to read the entire stripe into buffer space or RAID cache, modify the flag byte, recalculate parity, then write the whole stripe and parity block back out across all the disks.
With a linear concat of RAID1 pairs we're simply rewriting a single 4KB filesystem block, maybe only a single 512B sector. I'm at the edge of my knowledge here. I don't know exactly how Timo does the index updates. Regardless of the method, the index update is light years faster with the concat setup as there is no RMW and full stripe writeback as with the RAID5/6 case.
Have you tried other filesystems and seen if they distribute metadata in a similarly efficient and scalable manner across concatenated drive sets?
EXT, any version, does not. ReiserFS does not. Both require disk striping to achieve any parallelism. With concat they both simply start writing at the beginning sectors of the first RAID1 pair and 4 years later maybe reach the last pair as they fill up the volume. ;) JFS has a more advanced allocation strategy that EXT or ReiserFS, not as advanced as XFS. I've never read of a concat example with JFS and I've never tested it. It's all but a dead filesystem at this point anyway, less than 2 dozen commits in 8 years last I checked, and these were simple bug fixes and changes to keep it building on new kernels. If it's not suffering bit rot now I'm sure it will be in the near future.
Is there ANY point to using striping at all, a la "RAID10" in this? I'd have thought just making as many RAID1 pairs out of your drives as possible would be the ideal strategy - is this not the case?
If you're using XFS, and your workload is overwhelmingly mail, RAID1+concat is the only way to fly, and it flies. If the workload is not mail, say large file streaming writes, then you're limited to 100-200MB/s, a single drive of throughput, as each file is written to a single directory on a single AG on a single disk. For streaming write performance you'll need striping. If you have many concurrent large streaming writes, you'll want to concat multiple striped arrays.
-- Stan