[Dovecot] Better to use a single large storage server or multiple smaller for mdbox?

newer
[Dovecot] POP3 Dovecot Auth CPU...

older
[Dovecot] Sieve pipe extension -...

Emmanuel Noobadmin

5 Apr 2012 5 Apr '12

11:02 p.m.

I'm trying to improve the setup of our Dovecot/Exim mail servers to handle the increasingly huge accounts (everybody thinks it's like infinitely growing storage like gmail and stores everything forever in their email accounts) by changing from Maildir to mdbox, and to take advantage of offloading older emails to alternative networked storage nodes.

The question now is whether having a single large server or will a number of 1U servers with the same total capacity be better? Will be using RAID 1 pairs, likely XFS based on reading Hoeppner's recommendation on this and the mdadm list.

Currently, I'm leaning towards multiple small servers because I think it should be better in terms of performance. At the very least even if one node gets jammed up, the rest should still be able serve up the emails for other accounts that is unless Dovecot will get locked up by that jammed transaction. Also, I could possibly arrange them in a sort of network raid 1 to gain redundancy over single machine failure.

Would I be correct in these or do actual experiences say otherwise?

Show replies by date

Stan Hoeppner

7 Apr 7 Apr

1:19 p.m.

New subject: [Dovecot] Better to use a single large storage server or multiple smaller for mdbox?

On 4/5/2012 3:02 PM, Emmanuel Noobadmin wrote:

Hi Emmanuel,

...

I'm trying to improve the setup of our Dovecot/Exim mail servers to handle the increasingly huge accounts (everybody thinks it's like infinitely growing storage like gmail and stores everything forever in their email accounts) by changing from Maildir to mdbox, and to take advantage of offloading older emails to alternative networked storage nodes.

I'll assume "networked storage nodes" means NFS, not FC/iSCSI SAN, in which case you'd have said "SAN".

...

The question now is whether having a single large server or will a number of 1U servers with the same total capacity be better?

Less complexity and cost is always better. CPU throughput isn't a factor in mail workloads--it's all about IO latency. A 1U NFS server with 12 drive JBOD is faster, cheaper, easier to setup and manage, sucks less juice and dissipates less heat than 4 1U servers each w/ 4 drives. I don't recall seeing your user load or IOPS requirements so I'm making some educated guesses WRT your required performance and total storage. I came up with the following system that should be close to suitable, for ~$10k USD. The 4 node system runs ~$12k USD. At $2k this isn't substantially higher. But when we double the storage of each architecture we're at ~$19k, vs ~$26k for an 8 node cluster, a difference of ~$7k. That's $1k shy of another 12 disk JBOD. Since CPU is nearly irrelevant for a mail workload, you can see it's much cheaper to scale capacity and IOPS with a single node w/fat storage than with skinny nodes w/thin storage. Ok, so here's the baseline config I threw together:

http://h10010.www1.hp.com/wwpc/us/en/sm/WF06b/15351-15351-3328412-241644-332... 8 cores is plenty, 2 boot drives mirrored on B110i, 16GB (4x4GB) http://www.lsi.com/products/storagecomponents/Pages/LSISAS9205-8e.aspx http://h10010.www1.hp.com/wwpc/us/en/sm/WF06b/12169-304616-3930445-3930445-3... w/ 12 2TB 7.2K SATA drives, configured as md concat+RAID1 pairs with 12 allocation groups, 12TB usable. Format the md device with the defaults:

$ mkfs.xfs /dev/md0

Mount with inode64. No XFS stripe alignment to monkey with. No md chunk size or anything else to worry about. XFS' allocation group design is pure elegance here.

If 12 TB isn't sufficient, or if you need more space later, you can daisy chain up to 3 additional D2600 JBODs for ~$8500 USD each, just add cables. This quadruples IOPS, throughput, and capacity--96TB total, 48TB net. Simply create 6 more mdraid1 devices and grow the linear array with them. Then do an xfs_growfs to bring the extra 12TB of free space into the filesystem.

If you're budget conscious and/or simply prefer quality inexpensive whitebox/DIY type gear, as I do, you can get 24 x 2TB drives in one JBOD chassis for $7400 USD. That twice the drives, capacity, IOPS, for ~$2500 less than the HP JBOD. And unlike the HP 'enterprise SATA' drives, the 2TB WD Black series have a 5 year warranty, and work great with mdraid. Chassis and drives at Newegg:

http://www.newegg.com/Product/Product.aspx?Item=N82E16816133047 http://www.newegg.com/Product/Product.aspx?Item=N82E16822136792

You can daisy chain 3 of these off one HBA SFF8088 port, 6 total on our LSI 9205-8e above, for a total of 144 2TB drives, 72 effective spindles in our concat+RAID1 setup, 144TB net space.

...

Will be using RAID 1 pairs, likely XFS based on reading Hoeppner's recommendation on this and the mdadm list.

To be clear, the XFS configuration I recommend/promote for mailbox storage is very specific and layered. The layers must all be used together to get the performance. These layers consist of using multiple hardware or software RAID1 pairs and concatenating them with an md linear array. You then format that md device with the XFS defaults, or a specific agcount if you know how to precisely tune AG layout based on disk size and your anticipated concurrency level of writers.

Putting XFS on a singe RAID1 pair, as you seem to be describing above for the multiple "thin" node case, and hitting one node with parallel writes to multiple user mail dirs, you'll get less performance than EXT3/4 on that mirror pair--possibly less than half, depending on the size of the disks and thus the number of AGs created. The 'secret' to XFS performance with this workload is concatenation of spindles. Without it you can't spread the AGs--thus directories, thus parallel file writes--horizontally across the spindles--and this is the key. By spreading AGs 'horizontally' across the disks in a concat, instead of 'vertically' down a striped array, you accomplish two important things:

You dramatically reduce disk head seeking by using the concat array. With XFS on a RAID10 array of 24 2TB disks you end up with 24 AGs evenly spaced vertically down each disk in the array, following the stripe pattern. Each user mailbox is stored in a different directory. Each directory was created in a different AG. So if you have 96 users writing their dovecot index concurrently, you have at worst case a minimum 192 head movements occurring back and forth across the entire platter of each disk, and likely not well optimized by TCQ/NCQ. Why 192 instead of 96? The modification time in the directory metadata must be updated for each index file, among other things.
Because we decrease seeks dramatically we also decrease response latency significantly. With the RAID1+concat+XFS we have 12 disks each with only 2 AGs spaced evenly down each platter. We have the same 4 user mail dirs in each AG, but in this case only 8 user mail dirs are contained on each disk instead of portions all 96. With the same 96 concurrent writes to indexes, in this case end up with only 16 seeks per drive--again, one to update each index file and one to update the metadata.

Assuming these drives have a max seek rate of 150 which is the average for 7.2k drives, it will take 192/150 = 1.28 seconds for these operations on the RAID10 array. With the concat array it will only take 16/150 = 0.11 seconds. Extrapolating from that demonstrates that the concat array can handle 1.28/0.11 = 11.6*96 = 1,111 concurrent user index updates in the same time as the RAID10 array, just over 10 times more users. Granted, these are rough theoretical numbers--an index plus metadata update isn't always going to cause a seek on every chunk in a stripe, etc. But this does paint a very accurate picture of the differences in mailbox workload disk seek patterns between XFS on concat and RAID10 with the same hardware. In production one should be able to handle at minimum 2x more users, probably many more, with the RAID1+concat+XFS vs RAID10+XFS setup on the same hardware.

...

Currently, I'm leaning towards multiple small servers because I think it should be better in terms of performance.

This usually isn't the case with mail. It's impossible to split up the user files across the storage nodes in a way that balances block usage on each node and user access to those blocks. Hotspots are inevitable in both categories. You may achieve the same total performance of a single server, maybe slightly surpass it depending on user load, but you end up spending extra money on building resources that are idle most of the time, in the case of CPU and NICs, or under/over utilized, in the case of disk capacity in each node. Switch ports aren't horribly expensive today, but you're still wasting some with the farm setup.

...

At the very least even if one node gets jammed up, the rest should still be able serve up the emails for other accounts that is unless Dovecot will get locked up by that jammed transaction.

Some host failure redundancy is about all you'd gain from the farm setup. Dovecot shouldn't barf due to one NFS node being down, only hiccup. I.e. only imap process accessing files on the downed node would have trouble.

...

Also, I could possibly arrange them in a sort of network raid 1 to gain redundancy over single machine failure.

Now you're sounding like Charles Marcus, but worse. ;) Stay where you are, and brush your hair away from your forehead. I'm coming over with my branding iron that says "K.I.S.S"

...

Would I be correct in these or do actual experiences say otherwise?

Oracles on Mount Interweb profess that 2^5 nodes wide scale out is the holy grail. IBM's mainframe evangelists tell us to put 5 million mail users on a SystemZ with hundreds of Linux VMs.

I think bliss for most of us is found somewhere in the middle.

-- Stan

Emmanuel Noobadmin

5:43 p.m.

New subject: [Dovecot] Better to use a single large storage server or multiple smaller for mdbox?

On 4/7/12, Stan Hoeppner <stan@hardwarefreak.com> wrote:

Firstly, thanks for the comprehensive reply. :)

...

I'll assume "networked storage nodes" means NFS, not FC/iSCSI SAN, in which case you'd have said "SAN".

I haven't decided on that but it would either be NFS or iSCSI over Gigabit. I don't exactly get a big budget for this. iSCSI because I planned to do md/mpath over two separate switches so that if one switch explodes, the email service would still work.

...

Less complexity and cost is always better. CPU throughput isn't a factor in mail workloads--it's all about IO latency. A 1U NFS server with 12 drive JBOD is faster, cheaper, easier to setup and manage, sucks less juice and dissipates less heat than 4 1U servers each w/ 4 drives.

My worry is that if that one server dies, everything is dead. With at least a pair of servers, I could keep it running, or if necessary, restore the accounts on the dead servers from backup, make some config changes and have everything back running while waiting for replacement hardware.

...

I don't recall seeing your user load or IOPS requirements so I'm making some educated guesses WRT your required performance and total storage.

I'm embarrassed to admit I don't have hard numbers on the user load except the rapidly dwindling disk space count and the fact when the web-based mail application try to list and check disk quota, it can bring the servers to a crawl. My lame excuse is that I'm just the web dev who got caught holding the server admin potato.

...

is nearly irrelevant for a mail workload, you can see it's much cheaper to scale capacity and IOPS with a single node w/fat storage than with skinny nodes w/thin storage. Ok, so here's the baseline config I threw together:

One of my concern is that heavy IO on the same server slow the overall performance even though the theoretical IOPS of the total drives are the same on 1 and on X servers. Right now, the servers are usually screeching to a halt, to the point of even locking out SSH access due to IOWait sending the load in top to triple digits.

...

Some host failure redundancy is about all you'd gain from the farm setup. Dovecot shouldn't barf due to one NFS node being down, only hiccup. I.e. only imap process accessing files on the downed node would have trouble.

But if I only have one big storage node and that went down, Dovecot would barf wouldn't it? Or would the mdbox format mean Dovecot would still use the local storage, just that users can't access the offloaded messages?

...

...
Also, I could possibly arrange them in a sort of network raid 1 to gain redundancy over single machine failure.

Now you're sounding like Charles Marcus, but worse. ;) Stay where you are, and brush your hair away from your forehead. I'm coming over with my branding iron that says "K.I.S.S"

Lol, I have no idea who Charles is, but I always feel safer if there was some kind of backup. Especially since I don't have the time to dedicate myself to server administration, by the time I notice something is bad, it might be too late for anything but the backup.

Of course management and clients don't agree with me since backup/redundancy costs money. :)

Stan Hoeppner

8 Apr 8 Apr

9:21 p.m.

New subject: [Dovecot] Better to use a single large storage server or multiple smaller for mdbox?

On 4/7/2012 9:43 AM, Emmanuel Noobadmin wrote:

...

On 4/7/12, Stan Hoeppner <stan@hardwarefreak.com> wrote:

Firstly, thanks for the comprehensive reply. :)

...
I'll assume "networked storage nodes" means NFS, not FC/iSCSI SAN, in which case you'd have said "SAN".

I haven't decided on that but it would either be NFS or iSCSI over Gigabit. I don't exactly get a big budget for this. iSCSI because I planned to do md/mpath over two separate switches so that if one switch explodes, the email service would still work.

So it seems you have two courses of action:

Identify individual current choke points and add individual systems and storage to eliminate those choke points.
Analyze your entire workflow and all systems, identifying all choke points, then design a completely new well integrated storage architecture that solves all current problems and addresses future needs

Adding an NFS server and moving infrequently (old) accessed files to alternate storage will alleviate your space problems. But it will probably not fix some of the other problems you mention, such as servers bogging down and becoming unresponsive, as that's not a space issue. The cause of that would likely be an IOPS issue, meaning you don't have enough storage spindles to service requests in a timely manner.

...

...
Less complexity and cost is always better. CPU throughput isn't a factor in mail workloads--it's all about IO latency. A 1U NFS server with 12 drive JBOD is faster, cheaper, easier to setup and manage, sucks less juice and dissipates less heat than 4 1U servers each w/ 4 drives.

My worry is that if that one server dies, everything is dead. With at least a pair of servers, I could keep it running, or if necessary, restore the accounts on the dead servers from backup, make some config changes and have everything back running while waiting for replacement hardware.

You are a perfect candidate for VMware ESX. The HA feature will do exactly what you want. If one physical node in the cluster dies, HA automatically restarts the dead VMs on other nodes, transparently. Clients will will have to reestablish connections, but everything else will pretty much be intact. Worse case scenario will possibly be a few corrupted mailboxes that were being written when the hardware crashed.

A SAN is required for such a setup. I had extensive experience with ESX and HA about 5 years ago and it works as advertised. After 5 years it can only have improved. It's not "cheap" but usually pays for itself due to being able to consolidate the workload of dozens of physical servers into just 2 or 3 boxes.

...

...
I don't recall seeing your user load or IOPS requirements so I'm making some educated guesses WRT your required performance and total storage.

I'm embarrassed to admit I don't have hard numbers on the user load except the rapidly dwindling disk space count and the fact when the web-based mail application try to list and check disk quota, it can bring the servers to a crawl.

Maybe just starting with a description of your current hardware setup and number of total users/mailboxes would be a good starting point. How many servers do you have, what storage is connected to each, percent of MUA POP/IMAP connections from user PCs versus those from webmail applications, etc, etc.

Probably the single most important piece of information would be the hardware specs of your current Dovecot server, CPUs/RAM/disk array, etc, and what version of Dovecot you're running.

The focus of your email is building a storage server strictly to offload old mail and free up space on the Dovecot server. From the sound of things, this may not be sufficient to solve all your problems.

...

My lame excuse is that I'm just the web dev who got caught holding the server admin potato.

Baptism by fire. Ouch. What doesn't kill you makes you stronger. ;)

...

...
is nearly irrelevant for a mail workload, you can see it's much cheaper to scale capacity and IOPS with a single node w/fat storage than with skinny nodes w/thin storage. Ok, so here's the baseline config I threw together:

One of my concern is that heavy IO on the same server slow the overall performance even though the theoretical IOPS of the total drives are the same on 1 and on X servers. Right now, the servers are usually screeching to a halt, to the point of even locking out SSH access due to IOWait sending the load in top to triple digits.

If multiple servers are screeching to a halt due to iowait, either all of your servers individual disks are overloaded, or you already have shared storage. We really need more info on your current architecture. Right now we don't knw if we're talking about 4 servers or 40., 100 users or 10,000.

...

...
Some host failure redundancy is about all you'd gain from the farm setup. Dovecot shouldn't barf due to one NFS node being down, only hiccup. I.e. only imap process accessing files on the downed node would have trouble.

But if I only have one big storage node and that went down, Dovecot would barf wouldn't it? Or would the mdbox format mean Dovecot would still use the local storage, just that users can't access the offloaded messages?

If the big storage node is strictly alt storage, and it dies, Dovecot will still access its main mdbox storage just fine. It simply wouldn't be able to access the alt storage and would log errors for those requests.

If you design a whole new architecture from scratch, going with ESX and an iSCSI SAN this whole line of thinking is moot as there is no SPOF. This can be done with as little as two physical servers and one iSCSI SAN array which has dual redundant controllers in the base config. Depending on your actual IOPS needs, you could possibly consolidate everything you have now into two physical servers and one iSCSI SAN array, for between $30-40K USD in hardware and $8-10K in ESX licenses. That just a guess on ESX as I don't know the current pricing. Even if it's that "high" it's far more than worth the price due to the capability.

Such a setup allows you to run all of your Exim, webmail, Dovecot, etc servers on two machines, and you usually get much better performance than with individual boxes, especially if you manually place the VMs on the nodes for lowest network latency. For instance, if you place your webmail server VM on the same host as the Dovecot VM, TCP packet latency drops from the high micro/low milliscond range into the mid nanosecond range--a 1000x decrease in latency. Why? The packet transfer is now a memory-to-memory copy through the hypervisor. The packets never reach a physical network interface. You can do some of these things with free Linux hypervisors, but AFAIK the poor management interfaces for them make the price of ESX seem like a bargain.

...

...
...
Also, I could possibly arrange them in a sort of network raid 1 to gain redundancy over single machine failure.

Now you're sounding like Charles Marcus, but worse. ;) Stay where you are, and brush your hair away from your forehead. I'm coming over with my branding iron that says "K.I.S.S"

...

Lol, I have no idea who Charles is, but I always feel safer if there was some kind of backup. Especially since I don't have the time to dedicate myself to server administration, by the time I notice something is bad, it might be too late for anything but the backup.

Search the list archives for Charles' thread about bringing up a 2nd office site. His desire was/is to duplicate machines at the 2nd site for redundancy, when the proper thing to do is duplicate them at the primary site, and simply duplicate the network links between sites. My point to you and Charles is that you never add complexity for the sake of adding complexity.

...

Of course management and clients don't agree with me since backup/redundancy costs money. :)

So does gasoline, but even as the price has more than doubled in 3 years in the States, people haven't stopped buying it. Why? They have to have it. The case is the same for certain levels of redundancy. You simply have to have it. You job is properly explaining that need. Ask the CEO/CFO how much money the company will lose in productivity if nobody has email for 1 workday, as that is how long it will take to rebuild it from scratch and restore all the data when it fails. Then ask what the cost is if all the email is completely lost because they were to cheap to pay for a backup solution?

To executives, money in the bank is like the family jewels in their trousers. Kicking the family jewels and generating that level of pain seriously gets their attention. Likewise, if a failed server plus rebuild/restore costs $50K in lost productivity, spending $20K on a solution to prevent that from happening is a good investment. Explain it in terms execs understand. Have industry data to back your position. There plenty of it available.

-- Stan

Emmanuel Noobadmin

9 Apr 9 Apr

10:15 p.m.

New subject: [Dovecot] Better to use a single large storage server or multiple smaller for mdbox?

On 4/9/12, Stan Hoeppner <stan@hardwarefreak.com> wrote:

...

So it seems you have two courses of action:

Identify individual current choke points and add individual systems and storage to eliminate those choke points.

Analyze your entire workflow and all systems, identifying all choke points, then design a completely new well integrated storage architecture that solves all current problems and addresses future needs

I started to do this and realize I have a serious mess on hand that makes delving in other people's uncommented source code seem like a joy :D

Management added to this by deciding if we're going to offload the email storage to a network storage, we might as well consolidate everything into that shared storage system so we don't have TBs of un-utilized space. So I might not even be able to use your tested XFS

concat solution since it may not be optimal for VM images and databases.

As the requirements' changed, I'll stop asking here as it's no longer really relevant just for Dovecot purposes.

...

You are a perfect candidate for VMware ESX. The HA feature will do exactly what you want. If one physical node in the cluster dies, HA automatically restarts the dead VMs on other nodes, transparently. Clients will will have to reestablish connections, but everything else will pretty much be intact. Worse case scenario will possibly be a few corrupted mailboxes that were being written when the hardware crashed.

...

A SAN is required for such a setup.

Thanks for the suggestion, I will need to find some time to look into this although we've mostly been using KVM for virtualization so far. Although the "SAN" part will probably prevent this from being accepted due to cost.

...

...
My lame excuse is that I'm just the web dev who got caught holding the server admin potato.

Baptism by fire. Ouch. What doesn't kill you makes you stronger. ;)

True, but I'd hate to be the customer who get to pick up the pieces when things explode due to unintended negligence by a dev trying to level up by multi-classing as an admin.

...

physical network interface. You can do some of these things with free Linux hypervisors, but AFAIK the poor management interfaces for them make the price of ESX seem like a bargain.

Unfortunately, the usual kind of customers we have here, spending that kind of budget isn't justifiable. The only reason we're providing email services is because customers wanted freebies and they felt there was no reason why we can't give them emails on our servers, they are all "servers" after all.

So I have to make do with OTS commodity parts and free software for the most parts.

Stan Hoeppner

10 Apr 10 Apr

8 a.m.

New subject: [Dovecot] Better to use a single large storage server or multiple smaller for mdbox?

On 4/9/2012 2:15 PM, Emmanuel Noobadmin wrote:

...

Unfortunately, the usual kind of customers we have here, spending that kind of budget isn't justifiable. The only reason we're providing email services is because customers wanted freebies and they felt there was no reason why we can't give them emails on our servers, they are all "servers" after all.

So I have to make do with OTS commodity parts and free software for the most parts.

OTS meaning you build your own systems from components? Too few in the business realm do so today. :(

It sounds like budget overrides redundancy then. You can do an NFS cluster with SAN and GFS2, or two servers with their own storage and DRBD mirroring. Here's how to do the latter: http://www.howtoforge.com/high_availability_nfs_drbd_heartbeat

The total cost is about the same for each solution as an iSCSI SAN array of drive count X is about the same cost as two JBOD disk arrays of count X*2. Redundancy in this case is expensive no matter the method. Given how infrequent host failures are, and the fact your storage is redundant, it may make more sense to simply keep spare components on hand and swap what fails--PSU, mobo, etc.

Interestingly, I designed a COTS server back in January to handle at least 5k concurrent IMAP users, using best of breed components. If you or someone there has the necessary hardware skills, you could assemble this system and simply use it for NFS instead of Dovecot. The parts list: secure.newegg.com/WishList/PublicWishDetail.aspx?WishListNumber=17069985

In case the link doesn't work, the core components are:

SuperMicro H8SGL G34 mobo w/dual Intel GbE, 2GHz 8-core Opteron 32GB Kingston REG ECC DDR3, LSI 9280-4i4e, Intel 24 port SAS expander 20 x 1TB WD RE4 Enterprise 7.2K SATA2 drives NORCO RPC-4220 4U 20 Hot-Swap Bays, SuperMicro 865W PSU All other required parts are in the Wish List. I've not written assembly instructions. I figure anyone who would build this knows what s/he is doing.

Price today: $5,376.62

Configuring all 20 drives as a RAID10 LUN in the MegaRAID HBA would give you a 10TB net Linux device and 10 stripe spindles of IOPS and bandwidth. Using RAID6 would yield 18TB net and 18 spindles of read throughput, however parallel write throughput will be at least 3-6x slower than RAID10, which is why nobody uses RAID6 for transactional workloads.

If you need more transactional throughput you could use 20 WD6000HLHX 600GB 10K RPM WD Raptor drives. You'll get 40% more throughput and 6TB net space with RAID10. They'll cost you $1200 more, or $6,576.62 total. Well worth the $1200 for 40% more throughput, if 6TB is enough.

Both of the drives I've mentioned here are enterprise class drives, feature TLER, and are on the LSI MegaRAID SAS hardware compatibility list. The price of the 600GB Raptor has come down considerably since I designed this system, or I'd have used them instead.

Anyway, lots of option out there. But $6,500 is pretty damn cheap for a quality box with 32GB RAM, enterprise RAID card, and 20x10K RPM 600GB drives.

The MegaRAID 9280-4i4e has an external SFF8088 port For an additional $6,410 you could add an external Norco SAS expander JBOD chassis and 24 more 600GB 10K RPM Raptors, for 13.2TB of total net RAID10 space, and 22 10k spindles of IOPS performance from 44 total drives. That's $13K for a 5K random IOPS, 13TB, 44 drive NFS RAID COTS server solution, $1000/TB, $2.60/IOPS. Significantly cheaper than an HP, Dell, IBM solution of similar specs, each of which will set you back at least 20 large.

Note the chassis I've spec'd have single PSUs, not the dual or triple redundant supplies you'll see on branded hardware. With a relatively stable climate controlled environment and a good UPS with filtering, quality single supplies are fine. In fact, in the 4U form factor single supplies are usually more reliable due to superior IC packaging and airflow through the heatsinks, not to mention much quieter.

-- Stan

Emmanuel Noobadmin

9:09 a.m.

New subject: [Dovecot] Better to use a single large storage server or multiple smaller for mdbox?

On 4/10/12, Stan Hoeppner <stan@hardwarefreak.com> wrote:

...

...
So I have to make do with OTS commodity parts and free software for the most parts.

OTS meaning you build your own systems from components? Too few in the business realm do so today. :(

For the inhouse stuff and budget customers yes, in fact both the email servers are on seconded hardware that started life as something else. I spec HP servers for our app servers to customers who are willing to pay for their own colocated or onsite servers but still there are customers who balk at the cost and so go OTS or virtualized.

...

SuperMicro H8SGL G34 mobo w/dual Intel GbE, 2GHz 8-core Opteron 32GB Kingston REG ECC DDR3, LSI 9280-4i4e, Intel 24 port SAS expander 20 x 1TB WD RE4 Enterprise 7.2K SATA2 drives NORCO RPC-4220 4U 20 Hot-Swap Bays, SuperMicro 865W PSU All other required parts are in the Wish List. I've not written assembly instructions. I figure anyone who would build this knows what s/he is doing.

Price today: $5,376.62

This price looks like something I might be able to push through although I'll probably have to go SATA instead of SAS due to cost of keeping spares.

...

Configuring all 20 drives as a RAID10 LUN in the MegaRAID HBA would give you a 10TB net Linux device and 10 stripe spindles of IOPS and bandwidth. Using RAID6 would yield 18TB net and 18 spindles of read throughput, however parallel write throughput will be at least 3-6x slower than RAID10, which is why nobody uses RAID6 for transactional workloads.

Not likely to go with RAID 5 or 6 due to concerns about the uncorrectable read errors risks on rebuild with large arrays. Is the MegaRAID being used as the actual RAID controller or just as a HBA?

I have been avoiding hardware RAID because of a really bad experience with RAID 5 on an obsolete controller that eventually died without replacement and couldn't be recovered. Since then, it's always been RAID 1 and, after I discovered mdraid, using them as purely HBA with mdraid for the flexibility of being able to just pull the drives into a new system if necessary without having to worry about the controller.

...

Both of the drives I've mentioned here are enterprise class drives, feature TLER, and are on the LSI MegaRAID SAS hardware compatibility list. The price of the 600GB Raptor has come down considerably since I designed this system, or I'd have used them instead.

Anyway, lots of option out there. But $6,500 is pretty damn cheap for a quality box with 32GB RAM, enterprise RAID card, and 20x10K RPM 600GB drives.

The MegaRAID 9280-4i4e has an external SFF8088 port For an additional $6,410 you could add an external Norco SAS expander JBOD chassis and 24 more 600GB 10K RPM Raptors, for 13.2TB of total net RAID10 space, and 22 10k spindles of IOPS performance from 44 total drives. That's $13K for a 5K random IOPS, 13TB, 44 drive NFS RAID COTS server solution, $1000/TB, $2.60/IOPS. Significantly cheaper than an HP, Dell, IBM solution of similar specs, each of which will set you back at least 20 large.

Would this setup work well too for serving up VM images? I've been trying to find a solution for the virtualized app servers images as well but the distributed FSes currently are all bad with random reads/writes it seems. XFS seem to be good with large files like db and vm images with random internal write/read so given my time constraints, it would be nice to have a single configuration that works generally well for all the needs I have to oversee.

...

Note the chassis I've spec'd have single PSUs, not the dual or triple redundant supplies you'll see on branded hardware. With a relatively stable climate controlled environment and a good UPS with filtering, quality single supplies are fine. In fact, in the 4U form factor single supplies are usually more reliable due to superior IC packaging and airflow through the heatsinks, not to mention much quieter.

Same reason I do my best to avoid 1U servers, the space/heat issues worries me. Yes, I'm guilty of worrying too much but that had saved me on several occasions.

Stan Hoeppner

11 Apr 11 Apr

10:18 a.m.

New subject: [Dovecot] Better to use a single large storage server or multiple smaller for mdbox?

On 4/10/2012 1:09 AM, Emmanuel Noobadmin wrote:

...

On 4/10/12, Stan Hoeppner <stan@hardwarefreak.com> wrote:

...

...
SuperMicro H8SGL G34 mobo w/dual Intel GbE, 2GHz 8-core Opteron 32GB Kingston REG ECC DDR3, LSI 9280-4i4e, Intel 24 port SAS expander 20 x 1TB WD RE4 Enterprise 7.2K SATA2 drives NORCO RPC-4220 4U 20 Hot-Swap Bays, SuperMicro 865W PSU All other required parts are in the Wish List. I've not written assembly instructions. I figure anyone who would build this knows what s/he is doing.

Price today: $5,376.62

This price looks like something I might be able to push through

It's pretty phenomenally low considering what all you get, especially 20 enterprise class drives.

...

although I'll probably have to go SATA instead of SAS due to cost of keeping spares.

The 10K drives I mentioned are SATA not SAS. WD's 7.2k RE and 10k Raptor series drives are both SATA but have RAID specific firmware, better reliability, longer warranties, etc. The RAID specific firmware is why both are tested and certified by LSI with their RAID cards.

...

...
Configuring all 20 drives as a RAID10 LUN in the MegaRAID HBA would give you a 10TB net Linux device and 10 stripe spindles of IOPS and bandwidth. Using RAID6 would yield 18TB net and 18 spindles of read throughput, however parallel write throughput will be at least 3-6x slower than RAID10, which is why nobody uses RAID6 for transactional workloads.

Not likely to go with RAID 5 or 6 due to concerns about the uncorrectable read errors risks on rebuild with large arrays. Is the

Not to mention rebuild times for large width RAID5/6.

...

MegaRAID being used as the actual RAID controller or just as a HBA?

It's a top shelf RAID controller, 512MB cache, up to 240 drives, SSD support, the works. It's an LSI "Feature Line" card: http://www.lsi.com/products/storagecomponents/Pages/6GBSATA_SASRAIDCards.asp...

The specs: http://www.lsi.com/products/storagecomponents/Pages/MegaRAIDSAS9280-4i4e.asp...

You'll need the cache battery module for safe write caching, which I forgot in the wish list (now added), $160: http://www.newegg.com/Product/Product.aspx?Item=N82E16816118163&Tpk=LSIiBBU0...

With your workload and RAID10 you should run with all 512MB configured as write cache. Linux caches all reads so using any controller cache for reads is a waste. Using all 512MB for write cache will increase random write IOPS.

Note the 9280 allows up to 64 LUNs, so you can do tiered storage within this 20 bay chassis. For spares management you'd probably not want to bother with two different sized drives.

I didn't mention the 300GB 10K Raptors previously due to their limited capacity. Note they're only $15 more apiece than the 1TB RE4 drives in the original parts list. For a total of $300 more you get the same 40% increase in IOPs of the 600GB model, but you'll only have 3TB net space after RAID10. If 3TB is sufficient space for your needs, that extra 40% IOPS makes this config a no brainer. The decreased latency of the 10K drives will give a nice boost to VM read performance, especially when using NFS. Write performance probably won't be much different due to the generous 512MB write cache on the controller. I also forgot to mention that with BBWC enabled you can turn off XFS barriers, which will dramatically speed up Exim queues and Dovecot writes, all writes actually.

Again, you probably don't want the spares management overhead of two different disk types on the shelf, but you could stick these 10K 300s in the first 16 slots, and put the 2TB RE4 drive in the last 4 slots, RAID10 on the 10K drives, RAID5 on the 2TB drives. This yields an 8 spindle high IOPS RAID10 of 2.4TB and a lower performance RAID5 of 6TB for near line storage such as your Dovecot alt storage, VM templates, etc, 8.4TB net, 1.6TB less than the original 10TB setup. Total additional cost is $920 for this setup. You'd have two XFS filesystems (with quite different mkfs parameters).

...

I have been avoiding hardware RAID because of a really bad experience with RAID 5 on an obsolete controller that eventually died without replacement and couldn't be recovered. Since then, it's always been RAID 1 and, after I discovered mdraid, using them as purely HBA with mdraid for the flexibility of being able to just pull the drives into a new system if necessary without having to worry about the controller.

Assuming you have the right connector configuration for your drive/enclosure on the replacement card, you can usually swap out one LSI RAID card with any other LSI RAID card in the same, or newer, generation. It'll read the configuration metadata from the disks and be up an running in minutes. This feature has been around all the way back to the AMI/Mylex cards of the late 1990s. LSI acquired both companies, who were #1 and #2 in RAID, which is why LSI is so successful today. Back in those days LSI simply supplied the ASICs to AMI and Mylex. I have an AMI MegaRAID 428, top of the line in 1998, lying around somewhere. Still working when I retired it many years ago.

FYI, LSI is the OEM provider of RAID and SAS/SATA HBA ASIC silicon for the tier 1 HBA and mobo down markets. Dell, HP, IBM, Intel, Oracle (Sun), Siemens/Fujitsu, all use LSI silicon and firmware. Some simply rebadge OEM LSI cards with their own model and part numbers. IBM and Dell specifically have been doing this rebadging for well over a decade, long before LSI acquired Mylex and AMI. The Dell PERC/2 is a rebadged AMI MegaRAID 428.

Software and hardware RAID each have their pros and cons. I prefer hardware RAID for write cache performance and many administrative reasons, including SAF-TE enclosure management (fault LEDs, alarms, etc) so you know at a glance which drive has failed and needs replacing, email and SNMP notification of events, automatic rebuild, configurable rebuild priority, etc, etc, and good performance with striping and mirroring. Parity RAID performance often lags behind md with heavy workloads but not with light/medium. FWIW I rarely use parity RAID, due to the myriad performance downsides.

For ultra high random IOPS workloads, or when I need a single filesystem space larger than the drive limit or practical limit for one RAID HBA, I'll stitch hardware RAID1 or small stripe width RAID 10 arrays (4-8 drives, 2-4 spindles) together with md RAID 0 or 1.

...

...
Both of the drives I've mentioned here are enterprise class drives, feature TLER, and are on the LSI MegaRAID SAS hardware compatibility list. The price of the 600GB Raptor has come down considerably since I designed this system, or I'd have used them instead.

Anyway, lots of option out there. But $6,500 is pretty damn cheap for a quality box with 32GB RAM, enterprise RAID card, and 20x10K RPM 600GB drives.

The MegaRAID 9280-4i4e has an external SFF8088 port For an additional $6,410 you could add an external Norco SAS expander JBOD chassis and 24 more 600GB 10K RPM Raptors, for 13.2TB of total net RAID10 space, and 22 10k spindles of IOPS performance from 44 total drives. That's $13K for a 5K random IOPS, 13TB, 44 drive NFS RAID COTS server solution, $1000/TB, $2.60/IOPS. Significantly cheaper than an HP, Dell, IBM solution of similar specs, each of which will set you back at least 20 large.

Would this setup work well too for serving up VM images? I've been trying to find a solution for the virtualized app servers images as well but the distributed FSes currently are all bad with random reads/writes it seems. XFS seem to be good with large files like db and vm images with random internal write/read so given my time constraints, it would be nice to have a single configuration that works generally well for all the needs I have to oversee.

Absolutely. If you setup these 20 drives as a single RAID10, soft/hard or hybrid, with the LSI cache set to 100% write-back, with a single XFS filesystem with 10 allocation groups and proper stripe alignment, you'll get maximum performance for pretty much any conceivable workload.

Your only limitations will be possible NFS or TCP tuning issues, and maybe having only two GbE ports. For small random IOPS such as Exim queues, Dovecot store, VM image IO, etc, the two GbE ports are plenty. But if you add any large NFS file copies into the mix, such as copying new VM templates or ISO images over, etc, or do backups over NFS instead of directly on the host machine at the XFS level, then two bonded GbE ports might prove a bottleneck.

The mobo has 2 PCIe x8 slots and one x4 slot. One of the x8 slots is an x16 physical connector. You'll put the LSI card in the x16 slot. If you mount the Intel SAS expander to the chassis as I do instead of in a slot, you have one free x8 and one free x4 slot. Given the $250 price, I'd simply ad an Intel quad port GbE NIC to the order. Link aggregate all 4 ports on day one and use one IP address for the NFS traffic. Use the two on board ports for management etc. This should give you a theoretical 400MB/s of peak NFS throughput, which should be plenty no matter what workload you throw at it.

...

...
Note the chassis I've spec'd have single PSUs, not the dual or triple redundant supplies you'll see on branded hardware. With a relatively stable climate controlled environment and a good UPS with filtering, quality single supplies are fine. In fact, in the 4U form factor single supplies are usually more reliable due to superior IC packaging and airflow through the heatsinks, not to mention much quieter.

Same reason I do my best to avoid 1U servers, the space/heat issues worries me. Yes, I'm guilty of worrying too much but that had saved me on several occasions.

Just about every 1U server I've seen that's been racked for 3 or more years has warped under its own weight. I even saw an HPQ 2U that was warped this way, badly warped. In this instance the slide rail bolts had never been tightened down to the rack--could spin them by hand. Since the chassis side panels weren't secured, and there was lateral play, the weight of the 6 drives caused the side walls of the case to fold into a mild trapezoid, which allowed the bottom and top panels to bow. Let this be a lesson boys and girls: always tighten your rack bolts. :)

-- Stan

Ed W

7:50 p.m.

New subject: [Dovecot] Better to use a single large storage server or multiple smaller for mdbox?

Re XFS. Have you been watching BTRFS recently?

I will concede that despite the authors considering it production ready I won't be using it for my servers just yet. However, it's benchmarking on single disk benchmarks fairly similarly to XFS and in certain cases (multi-threaded performance) can be somewhat better. I haven't yet seen any benchmarks on larger disk arrays yet, eg 6+ disks, so no idea how it scales up. Basically what I have seen seems "competitive"

I don't have such hardware spare to benchmark, but I would be interested to hear from someone who benchmarks your RAID1+linear+XFS suggestion, especially if they have compared a cutting edge btrfs kernel on the same array?

One of the snags of md RAID1 vs RAID6 is the lack of checksumming in the event of bad blocks. (I'm not sure what actually happens when md scrubbing finds a bad sector with raid1..?). For low performance requirements I have become paranoid and been using RAID6 vs RAID10, filesystems with sector checksums seem attractive...

Regards

Ed W

Adrian Minta

11:48 p.m.

New subject: [Dovecot] Better to use a single large storage server or multiple smaller for mdbox?

On 04/11/12 19:50, Ed W wrote:

...

One of the snags of md RAID1 vs RAID6 is the lack of checksumming in the event of bad blocks. (I'm not sure what actually happens when md scrubbing finds a bad sector with raid1..?). For low performance requirements I have become paranoid and been using RAID6 vs RAID10, filesystems with sector checksums seem attractive...

RAID6 is very slow for write operations. That's why is the worst choice for maildir.

Charles Marcus

9:50 p.m.

New subject: [Dovecot] Better to use a single large storage server or multiple smaller for mdbox?

On 2012-04-11 4:48 PM, Adrian Minta <adrian.minta@gmail.com> wrote:

...

On 04/11/12 19:50, Ed W wrote:

...
One of the snags of md RAID1 vs RAID6 is the lack of checksumming in the event of bad blocks. (I'm not sure what actually happens when md scrubbing finds a bad sector with raid1..?). For low performance requirements I have become paranoid and been using RAID6 vs RAID10, filesystems with sector checksums seem attractive...

...

RAID6 is very slow for write operations. That's why is the worst choice for maildir.

He did say '"For *low* *performance* requirements..." ... ;)

Best regards,

Charles

Stan Hoeppner

12 Apr 12 Apr

4:18 a.m.

New subject: [Dovecot] Better to use a single large storage server or multiple smaller for mdbox?

On 4/11/2012 11:50 AM, Ed W wrote:

...

Re XFS. Have you been watching BTRFS recently?

I will concede that despite the authors considering it production ready I won't be using it for my servers just yet. However, it's benchmarking on single disk benchmarks fairly similarly to XFS and in certain cases (multi-threaded performance) can be somewhat better. I haven't yet seen any benchmarks on larger disk arrays yet, eg 6+ disks, so no idea how it scales up. Basically what I have seen seems "competitive"

Links?

...

I don't have such hardware spare to benchmark, but I would be interested to hear from someone who benchmarks your RAID1+linear+XFS suggestion, especially if they have compared a cutting edge btrfs kernel on the same array?

http://btrfs.boxacle.net/repository/raid/history/History_Mail_server_simulat...

This is with an 8 wide LVM stripe over 8 17 drive hardware RAID0 arrays. If the disks had been setup as a concat of 68 RAID1 pairs, XFS would have turned in numbers significantly higher, anywhere from a 100% increase to 500%. It's hard to say because the Boxacle folks didn't show the XFG AG config they used. The concat+RAID1 setup can decrease disk seeks by many orders of magnitude vs striping. Everyone knows as seeks go down IOPS go up. Even with this very suboptimal disk setup, XFS still trounces everything but JFS which is a close 2nd. BTRFS is way down in the pack. It would be nice to see these folks update these results with a 3.2.6 kernel, as both BTRFS and XFS have improved significantly since 2.6.35. EXT4 and JFS have seen little performance work since. In fact JFS has seen no commits but bug fixes and changes to allow compiling with recent kernels.

...

One of the snags of md RAID1 vs RAID6 is the lack of checksumming in the event of bad blocks. (I'm not sure what actually happens when md scrubbing finds a bad sector with raid1..?). For low performance requirements I have become paranoid and been using RAID6 vs RAID10, filesystems with sector checksums seem attractive...

Except we're using hardware RAID1 here and mdraid linear. Thus the controller takes care of sector integrity. RAID6 yields nothing over RAID10, except lower performance, and more usable space if more than 4 drives are used.

-- Stan

Emmanuel Noobadmin

5:23 a.m.

New subject: [Dovecot] Better to use a single large storage server or multiple smaller for mdbox?

On 4/12/12, Stan Hoeppner <stan@hardwarefreak.com> wrote:

...

On 4/11/2012 11:50 AM, Ed W wrote:

...
One of the snags of md RAID1 vs RAID6 is the lack of checksumming in the event of bad blocks. (I'm not sure what actually happens when md scrubbing finds a bad sector with raid1..?). For low performance requirements I have become paranoid and been using RAID6 vs RAID10, filesystems with sector checksums seem attractive...

Except we're using hardware RAID1 here and mdraid linear. Thus the controller takes care of sector integrity. RAID6 yields nothing over RAID10, except lower performance, and more usable space if more than 4 drives are used.

How would the control ensure sector integrity unless it is writing additional checksum information to disk? I thought only a few filesystems like ZFS does the sector checksum to detect if any data corruption occurred. I suppose the controller could throw an error if the two drives returned data that didn't agree with each other but it wouldn't know which is the accurate copy but that wouldn't protect the integrity of the data, at least not directly without additional human intervention I would think.

Stan Hoeppner

1:20 p.m.

New subject: [Dovecot] Better to use a single large storage server or multiple smaller for mdbox?

On 4/11/2012 9:23 PM, Emmanuel Noobadmin wrote:

...

On 4/12/12, Stan Hoeppner <stan@hardwarefreak.com> wrote:

...
On 4/11/2012 11:50 AM, Ed W wrote:

...
One of the snags of md RAID1 vs RAID6 is the lack of checksumming in the event of bad blocks. (I'm not sure what actually happens when md scrubbing finds a bad sector with raid1..?). For low performance requirements I have become paranoid and been using RAID6 vs RAID10, filesystems with sector checksums seem attractive...

Except we're using hardware RAID1 here and mdraid linear. Thus the controller takes care of sector integrity. RAID6 yields nothing over RAID10, except lower performance, and more usable space if more than 4 drives are used.

How would the control ensure sector integrity unless it is writing additional checksum information to disk? I thought only a few filesystems like ZFS does the sector checksum to detect if any data corruption occurred. I suppose the controller could throw an error if the two drives returned data that didn't agree with each other but it wouldn't know which is the accurate copy but that wouldn't protect the integrity of the data, at least not directly without additional human intervention I would think.

When a drive starts throwing uncorrectable read errors, the controller faults the drive and tells you to replace it. Good hardware RAID controllers are notorious for their penchant to kick drives that would continue to work just fine in mdraid or as a single drive for many more years. The mindset here is that anyone would rather spent $150-$2500 dollars on a replacement drive than take a chance with his/her valuable data.

Yes I typed $2500. EMC charges over $2000 for a single Seagate disk drive with an EMC label and serial# on it. The serial number is what prevents one from taking the same off the shelf Seagate drive at $300 and mounting it in a $250,000 EMC array chassis. The controller firmware reads the S/N from each connected drive and will not allow foreign drives to be used. HP, IBM, Oracle/Sun, etc do this as well. Which is why they make lots of profit, and is why I prefer open storage systems.

-- Stan

Ed W

1:58 p.m.

New subject: [Dovecot] Better to use a single large storage server or multiple smaller for mdbox?

On 12/04/2012 11:20, Stan Hoeppner wrote:

...

On 4/11/2012 9:23 PM, Emmanuel Noobadmin wrote:

...
On 4/12/12, Stan Hoeppner<stan@hardwarefreak.com> wrote:

...
On 4/11/2012 11:50 AM, Ed W wrote:

...
One of the snags of md RAID1 vs RAID6 is the lack of checksumming in the event of bad blocks. (I'm not sure what actually happens when md scrubbing finds a bad sector with raid1..?). For low performance requirements I have become paranoid and been using RAID6 vs RAID10, filesystems with sector checksums seem attractive... Except we're using hardware RAID1 here and mdraid linear. Thus the controller takes care of sector integrity. RAID6 yields nothing over RAID10, except lower performance, and more usable space if more than 4 drives are used. How would the control ensure sector integrity unless it is writing additional checksum information to disk? I thought only a few filesystems like ZFS does the sector checksum to detect if any data corruption occurred. I suppose the controller could throw an error if the two drives returned data that didn't agree with each other but it wouldn't know which is the accurate copy but that wouldn't protect the integrity of the data, at least not directly without additional human intervention I would think. When a drive starts throwing uncorrectable read errors, the controller faults the drive and tells you to replace it. Good hardware RAID controllers are notorious for their penchant to kick drives that would continue to work just fine in mdraid or as a single drive for many more years. The mindset here is that anyone would rather spent $150-$2500 dollars on a replacement drive than take a chance with his/her valuable data.

I'm asking a subtlely different question.

The claim by ZFS/BTRFS authors and others is that data silently "bit rots" on it's own. The claim is therefore that you can have a raid1 pair where neither drive reports a hardware failure, but each gives you different data? I can't personally claim to have observed this, so it remains someone else's theory... (for background my experience is simply: RAID10 for high performance arrays and RAID6 for all my personal data - I intend to investigate your linear raid idea in the future though)

I do agree that if one drive reports a read error, then it's quite easy to guess which pair of the array is wrong...

Just as an aside, I don't have a lot of failure experience. However, the few I have had (perhaps 6-8 events now) is that there is a massive correlation in failure time with RAID1, eg one pair I had lasted perhaps 2 years and then both failed within 6 hours of each other. I also had a bad experience with RAID 5 that wasn't being scrubbed regularly and when one drive started reporting errors (ie lack of monitoring meant it had been bad for a while), the rest of the array turned out to be a patchwork of read errors - linux raid then turns out to be quite fragile in the presence of a small number of read failures and it's extremely difficult to salvage the 99% of the array which is ok due to the disks getting kicked out... (of course regular scrubs would have prevented getting so deep into that situation - it was a small cheap nas box without such features)

Ed W

Timo Sirainen

2:09 p.m.

New subject: [Dovecot] Better to use a single large storage server or multiple smaller for mdbox?

On 12.4.2012, at 13.58, Ed W wrote:

...

The claim by ZFS/BTRFS authors and others is that data silently "bit rots" on it's own. The claim is therefore that you can have a raid1 pair where neither drive reports a hardware failure, but each gives you different data?

That's one reason why I planned on adding a checksum to each message in dbox. But I forgot to actually do that. I guess I could add it for new messages in some upcoming version. Then Dovecot could optionally verify the checksum before returning the message to client, and if it detects corruption perhaps automatically read it from some alternative location (e.g. if dsync replication is enabled ask from another replica). And Dovecot index files really should have had some small (8/16/32bit) checksums of stuff as well..

Ed W

3:10 p.m.

New subject: [Dovecot] Better to use a single large storage server or multiple smaller for mdbox?

On 12/04/2012 12:09, Timo Sirainen wrote:

...

On 12.4.2012, at 13.58, Ed W wrote:

...
The claim by ZFS/BTRFS authors and others is that data silently "bit rots" on it's own. The claim is therefore that you can have a raid1 pair where neither drive reports a hardware failure, but each gives you different data? That's one reason why I planned on adding a checksum to each message in dbox. But I forgot to actually do that. I guess I could add it for new messages in some upcoming version. Then Dovecot could optionally verify the checksum before returning the message to client, and if it detects corruption perhaps automatically read it from some alternative location (e.g. if dsync replication is enabled ask from another replica). And Dovecot index files really should have had some small (8/16/32bit) checksums of stuff as well..

I have to say - I haven't actually seen this happen... Do any of your big mailstore contacts observe this, eg rackspace, etc?

I think it's worth thinking about the failure cases before implementing something to be honest? Just sticking in a checksum possibly doesn't help anyone unless it's on the right stuff and in the right place?

Off the top of my head:

Someone butchers the file on disk (disk error or someone edits it with vi)
Restore of some files goes subtly wrong, eg tool tries to be clever and fails, snapshot taken mid-write, etc?
Filesystem crash (sudden power loss), how to deal with partial writes?

Things I might like to do *if* there were some suitable "checksums" available:

Use the checksum as some kind of guid either for the whole message, the message minus the headers, or individual mime sections
Use the checksums to assist with replication speed/efficiency (dsync or custom imap commands)
File RFCs for new imap features along the "lemonde" lines which allow clients to have faster recovery from corrupted offline states...
Single instance storage (presumably already done, and of course this has some subtleties in the face of deliberate attack)
Possibly duplicate email suppression (but really this is an LDA problem...)
Storage backends where emails are redundantly stored and might not ALL be on a single server (find me the closest copy of email X) - derivations of this might be interesting for compliance archiving of messages?
Fancy key-value storage backends might use checksums as part of the key value (either for the whole or parts of the message)

The mail server has always looked like a kind of key-value store to my eye. However, traditional key-value isn't usually optimised for "streaming reads", hence dovecot seems like a "key value store, optimised for sequential high speed streaming access to the key values"... Whilst it seems increasingly unlikely that a traditional key-value store will work well to replace say mdbox, I wonder if it's not worth looking at the replication strategies of key-value stores to see if those ideas couldn't lead to new features for mdbox?

Cheers

Ed W

Dirk Jahnke-Zumbusch

5:08 p.m.

New subject: [Dovecot] Better to use a single large storage server or multiple smaller for mdbox?

Hi there,

...

I have to say - I haven't actually seen this happen... Do any of your big mailstore contacts observe this, eg rackspace, etc?

Just to throw in to the discussion that with (silent) data corruption not only "the disk" is involved but many other parts of your systems. So perhaps you would like to have a look at

https://indico.desy.de/getFile.py/access?contribId=65&sessionId=42&resId=0&m...

http://indico.cern.ch/getFile.py/access?contribId=3&sessionId=0&resId=1&mate...

The documents are from 2007 but the principals are still the same.

Kind regards Dirk

Timo Sirainen

13 Apr 13 Apr

2:51 p.m.

New subject: [Dovecot] Better to use a single large storage server or multiple smaller for mdbox?

On 12.4.2012, at 15.10, Ed W wrote:

...

On 12/04/2012 12:09, Timo Sirainen wrote:

...
On 12.4.2012, at 13.58, Ed W wrote:

...
The claim by ZFS/BTRFS authors and others is that data silently "bit rots" on it's own. The claim is therefore that you can have a raid1 pair where neither drive reports a hardware failure, but each gives you different data? That's one reason why I planned on adding a checksum to each message in dbox. But I forgot to actually do that. I guess I could add it for new messages in some upcoming version. Then Dovecot could optionally verify the checksum before returning the message to client, and if it detects corruption perhaps automatically read it from some alternative location (e.g. if dsync replication is enabled ask from another replica). And Dovecot index files really should have had some small (8/16/32bit) checksums of stuff as well..

I have to say - I haven't actually seen this happen... Do any of your big mailstore contacts observe this, eg rackspace, etc?

I haven't heard. But then again people don't necessarily notice if it has.

...

Things I might like to do *if* there were some suitable "checksums" available:

Use the checksum as some kind of guid either for the whole message, the message minus the headers, or individual mime sections

Messages already have a GUID. And the rest of that is kind of done with the single instance storage stuff.. I was thinking of using SHA1 of the entire message with headers as the checksum, and save it into dbox metadata field. I also thought about checksumming the metadata fields as well, but that would need another checksum as the first one can have other uses as well besides verifying the message integrity.

...

Use the checksums to assist with replication speed/efficiency (dsync or custom imap commands)

It would be of some use with dbox index rebuilding. I don't think it would help with dsync.

...

File RFCs for new imap features along the "lemonde" lines which allow clients to have faster recovery from corrupted offline states...

Too much trouble, no one would implement it :)

...

Storage backends where emails are redundantly stored and might not ALL be on a single server (find me the closest copy of email X) - derivations of this might be interesting for compliance archiving of messages?

Fancy key-value storage backends might use checksums as part of the key value (either for the whole or parts of the message)

GUID would work for these as well, without the possibility of a hash collision.

Ed W

3:17 p.m.

New subject: [Dovecot] Better to use a single large storage server or multiple smaller for mdbox?

On 13/04/2012 12:51, Timo Sirainen wrote:

...

...

Use the checksums to assist with replication speed/efficiency (dsync or custom imap commands) It would be of some use with dbox index rebuilding. I don't think it would help with dsync. ..

File RFCs for new imap features along the "lemonde" lines which allow clients to have faster recovery from corrupted offline states... Too much trouble, no one would implement it :)

I presume you have seen that cyrus is working on various distributed options? Standardising this through imap might work if they also buy into it?

...

...

Storage backends where emails are redundantly stored and might not ALL be on a single server (find me the closest copy of email X) - derivations of this might be interesting for compliance archiving of messages?

Fancy key-value storage backends might use checksums as part of the key value (either for the whole or parts of the message) GUID would work for these as well, without the possibility of a hash collision.

I was thinking that the win for key-value store as a backend is if you can reduce the storage requirements or do better placement of the data (mail text replicated widely, attachments stored on higher latency storage?). Hence whilst I don't see this being a win with current options, if it were done then it would almost certainly be "per mime part", eg storing all large attachments in one place and the rest of the message somewhere else, perhaps with different redundancy levels per type

OK, this is all completely pie in the sky. Please don't build it! All I meant was that these are the kind of things that someone might one day desire to do and hence they would have competing requirements for what to checksum...

Cheers

Ed W

Timo Sirainen

3:21 p.m.

New subject: [Dovecot] Better to use a single large storage server or multiple smaller for mdbox?

On 13.4.2012, at 15.17, Ed W wrote:

...

On 13/04/2012 12:51, Timo Sirainen wrote:

...
...

Use the checksums to assist with replication speed/efficiency (dsync or custom imap commands) It would be of some use with dbox index rebuilding. I don't think it would help with dsync. ..

File RFCs for new imap features along the "lemonde" lines which allow clients to have faster recovery from corrupted offline states... Too much trouble, no one would implement it :)

I presume you have seen that cyrus is working on various distributed options? Standardising this through imap might work if they also buy into it?

Probably more trouble than worth. I doubt anyone would want to run a cross-Dovecot/Cyrus cluster.

...

...
...

Storage backends where emails are redundantly stored and might not ALL be on a single server (find me the closest copy of email X) - derivations of this might be interesting for compliance archiving of messages?

Fancy key-value storage backends might use checksums as part of the key value (either for the whole or parts of the message) GUID would work for these as well, without the possibility of a hash collision.

I was thinking that the win for key-value store as a backend is if you can reduce the storage requirements or do better placement of the data (mail text replicated widely, attachments stored on higher latency storage?). Hence whilst I don't see this being a win with current options, if it were done then it would almost certainly be "per mime part", eg storing all large attachments in one place and the rest of the message somewhere else, perhaps with different redundancy levels per type

OK, this is all completely pie in the sky. Please don't build it! All I meant was that these are the kind of things that someone might one day desire to do and hence they would have competing requirements for what to checksum...

That can almost be done already .. the attachments are saved and accessed via a lib-fs API. It wouldn't be difficult to write a backend for some key-value databases. So with about one day's coding you could already have Dovecot save all message attachments to a key-value db, and you can configure redundancy in the db's configs.

Ed W

5:04 p.m.

New subject: [Dovecot] Better to use a single large storage server or multiple smaller for mdbox?

On 13/04/2012 13:21, Timo Sirainen wrote:

...

On 13.4.2012, at 15.17, Ed W wrote:

...
On 13/04/2012 12:51, Timo Sirainen wrote:

...
...

Use the checksums to assist with replication speed/efficiency (dsync or custom imap commands) It would be of some use with dbox index rebuilding. I don't think it would help with dsync. ..

File RFCs for new imap features along the "lemonde" lines which allow clients to have faster recovery from corrupted offline states... Too much trouble, no one would implement it :) I presume you have seen that cyrus is working on various distributed options? Standardising this through imap might work if they also buy into it? Probably more trouble than worth. I doubt anyone would want to run a cross-Dovecot/Cyrus cluster.

No definitely not. Sorry I just meant that you are both working on similar things. Standardising the basics that each use might be useful in the future

...

That can almost be done already .. the attachments are saved and accessed via a lib-fs API. It wouldn't be difficult to write a backend for some key-value databases. So with about one day's coding you could already have Dovecot save all message attachments to a key-value db, and you can configure redundancy in the db's configs.

Hmm, super.

Ed W

Stan Hoeppner

8:29 a.m.

New subject: [Dovecot] Better to use a single large storage server or multiple smaller for mdbox?

On 4/12/2012 5:58 AM, Ed W wrote:

...

The claim by ZFS/BTRFS authors and others is that data silently "bit rots" on it's own. The claim is therefore that you can have a raid1 pair where neither drive reports a hardware failure, but each gives you different data?

You need to read those articles again very carefully. If you don't understand what they mean by "1 in 10^15 bits non-recoverable read error rate" and combined probability, let me know.

And this has zero bearing on RAID1. And RAID1 reads don't work the way you describe above. I explained this in some detail recently.

...

I do agree that if one drive reports a read error, then it's quite easy to guess which pair of the array is wrong...

Been working that way for more than 2 decades Ed. :) Note that "RAID1" has that "1" for a reason. It was the first RAID level. It was in production for many many years before parity RAID hit the market. It is the most well understood of all RAID levels, and the simplest.

-- Stan

Ed W

6:09 p.m.

New subject: [Dovecot] Better to use a single large storage server or multiple smaller for mdbox?

On 13/04/2012 06:29, Stan Hoeppner wrote:

...

On 4/12/2012 5:58 AM, Ed W wrote:

...
The claim by ZFS/BTRFS authors and others is that data silently "bit rots" on it's own. The claim is therefore that you can have a raid1 pair where neither drive reports a hardware failure, but each gives you different data? You need to read those articles again very carefully. If you don't understand what they mean by "1 in 10^15 bits non-recoverable read error rate" and combined probability, let me know.

OK, I'll bite. I only have an honours degree in mathematics from a well known university, so grateful if you could dumb it down appropriately?

Lets start with what "those articles" are you referring to? I don't see any articles if I go literally up the chain from this email, but you might be talking about any one of the lots of other emails in this thread or even some other email thread?

Wikipedia has it's faults, but it dumbs the "silent corruption" claim down to: http://en.wikipedia.org/wiki/ZFS "an undetected error for every 67TB"

And a CERN study apparently claims "far higher than one in every 10^16 bits"

Now, I'm NOT professing any experience of axe to grind here. I'm simply asking by what feature do you believe either software or hardware RAID1 is capable of detecting which pair is correct when both pairs of a raid one disk return different results and there is no hardware failure to clue us that one pair suffered a read error? Please don't respond with a maths pissing competition, it's an innocent question about what levels of data checking are done on each piece of the hardware chain? My (probably flawed) understanding is that popular RAID 1 implementations don't add any additional sector checksums over and above what the drives/filesystem/etc add already offer - is this the case?

...

And this has zero bearing on RAID1. And RAID1 reads don't work the way you describe above. I explained this in some detail recently.

Where?

...

Been working that way for more than 2 decades Ed. :) Note that "RAID1" has that "1" for a reason. It was the first RAID level.

What should I make of RAID0 then?

Incidentally do you disagree with the history of RAID evolution on Wikipedia? http://en.wikipedia.org/wiki/RAID

Regards

Ed W

Emmanuel Noobadmin

9:12 a.m.

New subject: [Dovecot] Better to use a single large storage server or multiple smaller for mdbox?

On 4/12/12, Stan Hoeppner <stan@hardwarefreak.com> wrote:

...

On 4/11/2012 9:23 PM, Emmanuel Noobadmin wrote: I suppose the controller could throw an error if

...
the two drives returned data that didn't agree with each other but it wouldn't know which is the accurate copy but that wouldn't protect the integrity of the data, at least not directly without additional human intervention I would think.

When a drive starts throwing uncorrectable read errors, the controller faults the drive and tells you to replace it. Good hardware RAID controllers are notorious for their penchant to kick drives that would continue to work just fine in mdraid or as a single drive for many more years.

What I meant wasn't the drive throwing uncorrectable read errors but the drives are returning different data that each think is correct or both may have sent the correct data but one of the set got corrupted on the fly. After reading the articles posted, maybe the correct term would be the controller receiving silently corrupted data, say due to bad cable on one.

If the controller simply returns the fastest result, it could be the bad sector and that doesn't protect the integrity of the data right?

if the controller gets 1st half from one drive and 2nd half from the other drive to speed up performance, we could still get the corrupted half and the controller itself still can't tell if the sector it got was corrupted isn't it?

If the controller compares the two sectors from the drives, it may be able to tell us something is wrong but there isn't anyway for it to know which one of the sector was a good read and which isn't, or is there?

Stan Hoeppner

3:33 p.m.

New subject: [Dovecot] Better to use a single large storage server or multiple smaller for mdbox?

On 4/13/2012 1:12 AM, Emmanuel Noobadmin wrote:

...

On 4/12/12, Stan Hoeppner <stan@hardwarefreak.com> wrote:

...
On 4/11/2012 9:23 PM, Emmanuel Noobadmin wrote: I suppose the controller could throw an error if

...
the two drives returned data that didn't agree with each other but it wouldn't know which is the accurate copy but that wouldn't protect the integrity of the data, at least not directly without additional human intervention I would think.

When a drive starts throwing uncorrectable read errors, the controller faults the drive and tells you to replace it. Good hardware RAID controllers are notorious for their penchant to kick drives that would continue to work just fine in mdraid or as a single drive for many more years.

What I meant wasn't the drive throwing uncorrectable read errors but the drives are returning different data that each think is correct or both may have sent the correct data but one of the set got corrupted on the fly. After reading the articles posted, maybe the correct term would be the controller receiving silently corrupted data, say due to bad cable on one.

This simply can't happen. What articles are you referring to? If the author is stating what you say above, he simply doesn't know what he's talking about.

...

If the controller simply returns the fastest result, it could be the bad sector and that doesn't protect the integrity of the data right?

I already answered this in a previous post.

...

if the controller gets 1st half from one drive and 2nd half from the other drive to speed up performance, we could still get the corrupted half and the controller itself still can't tell if the sector it got was corrupted isn't it?

No, this is not correct.

...

If the controller compares the two sectors from the drives, it may be able to tell us something is wrong but there isn't anyway for it to know which one of the sector was a good read and which isn't, or is there?

Yes it can, and it does.

Emmanuel, Ed, we're at a point where I simply don't have the time nor inclination to continue answering these basic questions about the base level functions of storage hardware. You both have serious misconceptions about how many things work. To answer the questions you're asking will require me to teach you the basics of hardware signaling protocols, SCSI, SATA, Fiber Channel, and Ethernet transmission error detection protocols, disk drive firmware error recovery routines, etc, etc, etc.

I don't mind, and actually enjoy, passing knowledge. But the amount that seems to be required here to bring you up to speed is about 2^15 times above and beyond the scope of mailing list conversation.

In closing, I'll simply say this: If hardware, whether a mobo-down SATA chip, or a $100K SGI SAN RAID controller, allowed silent data corruption or transmission to occur, there would be no storage industry, and we'll all still be using pen and paper. The questions you're asking were solved by hardware and software engineers decades ago. You're fretting and asking about things that were solved decades ago.

-- Stan

Jim Lawson

4:12 p.m.

New subject: [Dovecot] Better to use a single large storage server or multiple smaller for mdbox?

On 04/13/2012 08:33 AM, Stan Hoeppner wrote:

...

...
What I meant wasn't the drive throwing uncorrectable read errors but the drives are returning different data that each think is correct or both may have sent the correct data but one of the set got corrupted on the fly. After reading the articles posted, maybe the correct term would be the controller receiving silently corrupted data, say due to bad cable on one. This simply can't happen. What articles are you referring to? If the author is stating what you say above, he simply doesn't know what he's talking about.

?! Stan, are you really saying that silent data corruption "simply can't happen"? People who have been studying this have been talking about it for years now. It can happen in the same way that Emmanuel describes.

USENIX FAST08:

http://static.usenix.org/event/fast08/tech/bairavasundaram.html

CERN:

http://storagemojo.com/2007/09/19/cerns-data-corruption-research/

http://fuji.web.cern.ch/fuji/talk/2007/kelemen-2007-C5-Silent_Corruptions.pd...

LANL:

http://institute.lanl.gov/resilience/conferences/2009/HPCResilience09_Michal...

There are others if you search for it. This problem has been well-known in large (petabyte+) data storage systems for some time.

Jim

Stan Hoeppner

5:20 p.m.

New subject: [Dovecot] Better to use a single large storage server or multiple smaller for mdbox?

On 4/13/2012 8:12 AM, Jim Lawson wrote:

...

On 04/13/2012 08:33 AM, Stan Hoeppner wrote:

...
...
What I meant wasn't the drive throwing uncorrectable read errors but the drives are returning different data that each think is correct or both may have sent the correct data but one of the set got corrupted on the fly. After reading the articles posted, maybe the correct term would be the controller receiving silently corrupted data, say due to bad cable on one. This simply can't happen. What articles are you referring to? If the author is stating what you say above, he simply doesn't know what he's talking about.

?! Stan, are you really saying that silent data corruption "simply can't happen"?

Yes, I did. Did you read the context in which I made that statement?

...

People who have been studying this have been talking about it for years now.

Yes, they have. Did you miss the paragraph where I stated exactly that? Did you also miss the part about the probably of such being dictated by total storage system size and access rate?

...

It can happen in the same way that Emmanuel describes.

No, it can't. Not in the way Emmanuel described. I already stated the reason, and all of this research backs my statement. You won't see this with a 2 drive mirror, or a 20 drive RAID10. Not until each drive has a capacity in the 15TB+ range, if not more, and again, depending on the total system size. This doesn't address the "RAID5", better known as "parity RAID" write hole, which is a separate issue. Which is also one of the reasons I don't use it.

In lieu of an actual controller firmware bug, or mdraid or lvm bug, you'll never see this on small scale systems.

...

USENIX FAST08:

http://static.usenix.org/event/fast08/tech/bairavasundaram.html

CERN:

http://storagemojo.com/2007/09/19/cerns-data-corruption-research/

http://fuji.web.cern.ch/fuji/talk/2007/kelemen-2007-C5-Silent_Corruptions.pd...

LANL:

http://institute.lanl.gov/resilience/conferences/2009/HPCResilience09_Michal...

There are others if you search for it. This problem has been well-known in large (petabyte+) data storage systems for some time.

And again, this is the crux of it. One doesn't see this problem until one hits extreme scale, which I spent at least a paragraph or two explaining, referencing the same research. Please re-read my post at least twice, critically. Then tell me if I've stated anything substantively different than what any of these researches have.

The statements "shouldn't" "wouldn't" and "can't" are based on probabilities. "Can't" or "won't" does not need equal probability 0. The probability of this type of silent data corruption occurring on a 2 disk or 20 disk array of today's drives is not zero over 10 years, but it is so low the effective statement is "can't" or "won't" see this corruption. As I said, when we reach 15-30TB+ disk drives, this may change for small count arrays.

-- Stan

Ed W

6:31 p.m.

New subject: [Dovecot] Better to use a single large storage server or multiple smaller for mdbox?

On 13/04/2012 13:33, Stan Hoeppner wrote:

...

...
What I meant wasn't the drive throwing uncorrectable read errors but the drives are returning different data that each think is correct or both may have sent the correct data but one of the set got corrupted on the fly. After reading the articles posted, maybe the correct term would be the controller receiving silently corrupted data, say due to bad cable on one. This simply can't happen. What articles are you referring to? If the author is stating what you say above, he simply doesn't know what he's talking about.

It quite clearly can??!

Just grab your drive, lever the connector off a little bit until it's a bit flaky and off you go? *THIS* type of problem I have heard of and you can find easy examples with a quick google search of any hobbyist storage board. Very common other examples are such problems due to failing PSUs and other interference driven examples causing explicit disk errors (and once the error rate goes up, some will make it past the checksum)

Note this is NOT what I was originally asking about. My interest is more about when the hardware is working reliably and as you agree, the error levels are vastly lower. However, it would be incredibly foolish to claim that it's not trivial to construct a scenario where bad hardware causes plenty of silent corruption?

...

...
If the controller simply returns the fastest result, it could be the bad sector and that doesn't protect the integrity of the data right? I already answered this in a previous post.

Not obviously?!

I will also add my understanding that linux software RAID1,5&6 *DO NOT* read all disks and hence will not be aware when disks have different data. In fact with software raid you need to run a regular "scrub" job to check this consistency.

I also believe that most commodity hardware raid implementations work exactly the same way and a background scrub is needed to detect inconsistent arrays. However, feel free to correct that understanding?

...

...
if the controller gets 1st half from one drive and 2nd half from the other drive to speed up performance, we could still get the corrupted half and the controller itself still can't tell if the sector it got was corrupted isn't it? No, this is not correct.

I definitely think you are wrong and Emmanuel is right?

If the controller gets a good read from the disk then it will trust that read and will NOT check the result with the other disk (or parity in the case of RAID5/6). If that read was incorrect for some reason then the data will be passed as good.

...

...
If the controller compares the two sectors from the drives, it may be able to tell us something is wrong but there isn't anyway for it to know which one of the sector was a good read and which isn't, or is there? Yes it can, and it does.

No it definitely does not!! At least not with linux software raid and I don't believe on commodity hardware controllers either! (You would be able to tell because the disk IO would be doubled)

Linux software raid 1 isn't that smart, but reads only one disk and trusts the answer if the read did not trigger an error. It does not check the other disk except during an explicit disk scrub.

...

Emmanuel, Ed, we're at a point where I simply don't have the time nor inclination to continue answering these basic questions about the base level functions of storage hardware.

You mean those "answers" like: "I answered that in another thread" or "you need to read 'those' articles again"

Referring to some unknown and hard to find previous emails is not the same as answering?

Also you are wondering off at extreme tangents. The question is simple:

Disk 1 Read good, checksum = A
Disk 2 Read good, checksum = B

Disks are a raid 1 pair. How do we know which disk is correct. Please specify raid 1 implementation and mechanism used with any answer

...

To answer the questions you're asking will require me to teach you the basics of hardware signaling protocols, SCSI, SATA, Fiber Channel, and Ethernet transmission error detection protocols, disk drive firmware error recovery routines, etc, etc, etc.

I really think not... A simple statement of:

Each sector on disk has a certain sized checksum
Controller checks checksum on read
Sent back over SATA connection, with a certain sized checksum
After that you are on your own vs corruption

...Should cover it I think?

...

In closing, I'll simply say this: If hardware, whether a mobo-down SATA chip, or a $100K SGI SAN RAID controller, allowed silent data corruption or transmission to occur, there would be no storage industry, and we'll all still be using pen and paper. The questions you're asking were solved by hardware and software engineers decades ago. You're fretting and asking about things that were solved decades ago.

So why are so many people getting excited about it now?

Note, there have been plenty of shoddy disk controller implementations before today - ie there exists hardware on sale with *known* defects.
Despite that the industry continues without collapse. Now you claim that if corruption is silent and people only tend to notice it much later and under certain edge conditions that this can't be possible because it should cause the industry to collapse..???

...Not buying your logic...

Ed W

Maarten Bezemer

10:10 p.m.

New subject: [Dovecot] Better to use a single large storage server or multiple smaller for mdbox?

On Fri, 13 Apr 2012, Ed W wrote:

...

On 13/04/2012 13:33, Stan Hoeppner wrote:

...
...
What I meant wasn't the drive throwing uncorrectable read errors but the drives are returning different data that each think is correct or both may have sent the correct data but one of the set got corrupted on the fly. After reading the articles posted, maybe the correct term would be the controller receiving silently corrupted data, say due to bad cable on one. This simply can't happen. What articles are you referring to? If the author is stating what you say above, he simply doesn't know what he's talking about. It quite clearly can??!

I totally agree with Ed here. Drives sure can and sometimes really do return different data, without reporting errors. Also, data can get corrupted on any of the busses or chips it passes through.

The math about 10^15 or 10^16 and all that stuff is not only about array sizes. It's also about data transfer.

I've seen silent corruption on a few systems myself. (Luckily, only 3 times in a couple years.) Those systems were only in the 2TB-5TB size category, which is substantially lower than the 67TB claimed elsewhere. Yet, statistically, it's well within normal probability levels.

Linux mdraid only reads one mirror as long as the drives don't return an error. Easy to check, the read speeds are way beyond a single drive's read speed. When the kernel would have to read all (possibly more than two) mirrors, and compare them, and make a decision based on this comparison, things would be horribly slow. Hardware raid typically uses this exact same approach. This goes for Areca, 3ware, LSI, which cover most of the regular (i.e. non-SAN) professional hardware raid setups.

If you don't believe it, just don't take my word for it but test it for yourself. Cleanly power down a raid1 array, take the individual drives, put them into a simple desktop machine, and write different data to both, using some raw disk writing tool like dd. Then put the drives back into the raid1 array, power it up, and re-read the information. You'll see data from both drives will be intermixed as parts of the reads come from one disk, and parts come from the other. Only when you order the raid array to do a verification pass, it'll start screaming and yelling. At least, I hope it will...

But as explained elsewhere, silent corruption can occur at numerous places. If you don't have an explicit checksumming/checking mechanism, there are indeed cases that will haunt you if you don't do regular scrubbing or at least do regular verification runs. Heck, that's why Linux mdadm comes with cron jobs to do just that, and hardware raid controllers have similar scheduling capabilities.

Of course, scrubbing/verification is not going to magically protect you from all problems. But you would at least get notifications if it detects problems.

...

...
...
If the controller compares the two sectors from the drives, it may be able to tell us something is wrong but there isn't anyway for it to know which one of the sector was a good read and which isn't, or is there? Yes it can, and it does.

No it definitely does not!! At least not with linux software raid and I don't believe on commodity hardware controllers either! (You would be able to tell because the disk IO would be doubled)

Obviously there is no way to tell which versions of a story are correct if you are not biased to believe one of the storytellers and distrust the other. You would have to add a checksum layer for that. (And hope the checksum isn't the part that got corrupted!)

...

...
To answer the questions you're asking will require me to teach you the basics of hardware signaling protocols, SCSI, SATA, Fiber Channel, and Ethernet transmission error detection protocols, disk drive firmware error recovery routines, etc, etc, etc.

I'm quite familiar with the basics of these protocols. I'm also quite familiar with the flaws in several implementations of "seemingly straightforward protocols". More often than not, there's a pressing need to get new devices onto the market before the competition has something similar and you loose your advantage. More often than not, this results in suboptimal implementations of all those fine protocols and algorithms. And let's face it: flaws in error recovery routines often don't surface until someone actually needs those routines. As long as drives (or any other device) are functioning as expected, everything is all right. But as soon as something starts to get flaky, error recovery has to kick in but may just as well fail to do the right thing.

Just consider the real-world analogy of politicians. They do or say something stupid every once in a while, and error recovery (a.k.a. damage control) has to kick in. But even though those well trained professionals, having decades of experience in the political arena, sometimes simply fail to do the right thing. They may have overlooked some pesky details, or they may take actions that don't have the expected outcome because... indeed, things work differently in damage control mode, and the only law you can trust is physics: you always go down when you can't stay on your feet.

With hard drives, raid controllers, mainboards, data buses, it's exactly the same. If _something_ isn't working as it should, how should we know which part of it we _can_ trust?

...

...
In closing, I'll simply say this: If hardware, whether a mobo-down SATA chip, or a $100K SGI SAN RAID controller, allowed silent data corruption or transmission to occur, there would be no storage industry, and we'll all still be using pen and paper. The questions you're asking were solved by hardware and software engineers decades ago. You're fretting and asking about things that were solved decades ago.

Isn't it just "worked around" by adding more layers of checksuming and adding more redundancy into the mix? Don't believe this "storage industry" because they tell you it's OK. It simply is not OK. You might want to talk to people in the data and computing cluster business about their opinion on "storage industry professionals"...

Timo's suggestion to add checksums to mailboxes/metadata could help to (at least) report these types of failures. Re-reading from different storage when available could also recover the data that got corrupted, but I'm not sure what would be the best way to handle these situations. If you know there is a corruption problem on one of your storage locations, you might want to switch that to read-only asap. Automagically trying to recover might not be the best thing to do. Given all kinds of different use cases, I think that should at least be configurable :-P

-- Maarten

Stan Hoeppner

14 Apr 14 Apr

6:31 a.m.

New subject: [Dovecot] Better to use a single large storage server or multiple smaller for mdbox?

On 4/13/2012 10:31 AM, Ed W wrote:

...

On 13/04/2012 13:33, Stan Hoeppner wrote:

...

...
In closing, I'll simply say this: If hardware, whether a mobo-down SATA chip, or a $100K SGI SAN RAID controller, allowed silent data corruption or transmission to occur, there would be no storage industry, and we'll all still be using pen and paper. The questions you're asking were solved by hardware and software engineers decades ago. You're fretting and asking about things that were solved decades ago.

So why are so many people getting excited about it now?

"So many"? I know of one person "getting excited" about it.

Data densities and overall storage sizes and complexity at the top end of the spectrum are increasing at a faster rate than the consistency/validation mechanisms. That's the entire point of the various academic studies on the issue. Note that the one study required a sample set of 1.5 million disk drives. If the phenomenon were a regular occurrence as you would have everyone here believe, they could have used a much smaller sample set.

Ed, this is an academic exercise. Academia leads industry. Almost always has. Academia blows the whistle and waves hands, prompting industry to take action.

There is nothing normal users need to do to address this problem. The hardware and software communities will make the necessary adjustments to address this issue before it filters down to the general user community in a half decade or more--when normal users have a 10-20 drive array of 500TB to 1PB or more.

Having the prestigious degree that you do, you should already understand the relationship between academic research and industry, and the considerable lead times involved.

-- Stan

Ed W

1:22 p.m.

New subject: [Dovecot] Better to use a single large storage server or multiple smaller for mdbox?

On 14/04/2012 04:31, Stan Hoeppner wrote:

...

On 4/13/2012 10:31 AM, Ed W wrote:

...
On 13/04/2012 13:33, Stan Hoeppner wrote:

...
In closing, I'll simply say this: If hardware, whether a mobo-down SATA chip, or a $100K SGI SAN RAID controller, allowed silent data corruption or transmission to occur, there would be no storage industry, and we'll all still be using pen and paper. The questions you're asking were solved by hardware and software engineers decades ago. You're fretting and asking about things that were solved decades ago. So why are so many people getting excited about it now? "So many"? I know of one person "getting excited" about it.

You love being vague don't you? Go on, I'll bite again, do you mean yourself?

:-)

...

Data densities and overall storage sizes and complexity at the top end of the spectrum are increasing at a faster rate than the consistency/validation mechanisms. That's the entire point of the various academic studies on the issue.

Again, you love being vague. By your dismissive "academic studies" phrase, do you mean studies done on a major industrial player, ie NetApp in this case? Or do you mean that it's rubbish because they asked someone with some background in statistics to do the work, rather than asking someone sitting nearby in the office to do it?

I don't think the researcher broke into NetApp to do this research, so we have to conclude that the industrial partner was onboard. NetApp seem to do a bunch of engineering of their own (got enough patents..) that I think we can safely assume they very much do their own research on this and it's not just "academic"... I doubt they publish all their own internal research, be thankful you got to see some of the results this way...

...

Note that the one study required a sample set of 1.5 million disk drives. If the phenomenon were a regular occurrence as you would have everyone here believe, they could have used a much smaller sample set.

Sigh... You could criticise the study if it had a small number of drives as being under-representive and now you criticise a large study for having too many observations...

You cannot have "too many" observations when measuring a small and unpredictable phenomena...

Where does it say that they could NOT have reproduced this study with just 10 drives? If you have 1.5 million available, why not use all the results??

...

Ed, this is an academic exercise. Academia leads industry. Almost always has. Academia blows the whistle and waves hands, prompting industry to take action.

Sigh... We are back to the start of the email thread again... Gosh you seem to love arguing and muddying the water for zero reason but to have the last word?

It's *trivial* to do a google search and hit *lots* of reports of corruptions in various parts of the system, from corrupting drivers, to hardware which writes incorrectly, to operating system flaws. I just found a bunch more in the Redhat database today while looking for something else. You yourself are very vocal on avoiding certain brands of HD controller which have been rumoured to cause corrupted data... (and thankyou for revealing that kind of thing - it's very helpful)

Don't veer off at a tangent now: The *original* email this has spawned is about a VERY specific point. RAID1 appears to offer less protection against a class of error conditions than does RAID6. Nothing more, nothing less. Don't veer off and talk about the minutiae of testing studies at universities, this is a straightforward claim that you have been jumping around and avoiding answering with claims of needing to educate me on SCSI protocols and other fatuous responses. Nor deviate and discuss that RAID6 is inappropriate for many situations - we all get that...

...

There is nothing normal users need to do to address this problem.

...except sit tight and hope they don't loose anything important!

:-)

...

Having the prestigious degree that you do, you should already understand the relationship between academic research and industry, and the considerable lead times involved.

I'm guessing you haven't attended higher education then? You are confusing graduate and post-graduate systems...

Byee

Ed W

Stan Hoeppner

6:48 a.m.

New subject: [Dovecot] Better to use a single large storage server or multiple smaller for mdbox?

On 4/13/2012 10:31 AM, Ed W wrote:

...

You mean those "answers" like:

...

"you need to read 'those' articles again"
Referring to some unknown and hard to find previous emails is not the same as answering?

No, referring to this:

On 4/12/2012 5:58 AM, Ed W wrote:

...

The claim by ZFS/BTRFS authors and others is that data silently "bit rots" on it's own.

Is it not a correct assumption that you read this in articles? If you read this in books, scrolls, or chiseled tablets, my apologies for assuming it was articles.

-- Stan

Ed W

1 p.m.

New subject: [Dovecot] Better to use a single large storage server or multiple smaller for mdbox?

On 14/04/2012 04:48, Stan Hoeppner wrote:

...

On 4/13/2012 10:31 AM, Ed W wrote:

...
You mean those "answers" like: "you need to read 'those' articles again"

Referring to some unknown and hard to find previous emails is not the same as answering? No, referring to this:

On 4/12/2012 5:58 AM, Ed W wrote:

...
The claim by ZFS/BTRFS authors and others is that data silently "bit rots" on it's own. Is it not a correct assumption that you read this in articles? If you read this in books, scrolls, or chiseled tablets, my apologies for assuming it was articles.

WHAT?!! The original context was that you wanted me to learn some very specific thing that you accused me of misunderstanding, and then it turns out that the thing I'm supposed to learn comes from re-reading every email, every blog post, every video, every slashdot post, every wiki, every ... that mentions ZFS's reason for including end to end checksumming?!!

Please stop wasting our time and get specific

You have taken my email which contained a specific question, been asked of you multiple times now and yet you insist on only answering irrelevant details with a pointed and personal dig on each answer. The rudeness is unnecessary, and your evasiveness of answers does not fill me with confidence that you actually know the answer...

For the benefit of anyone reading this via email archives or whatever, I think the conclusion we have reached is that: modern systems are now a) a complex sum of pieces, any of which can cause an error to be injected, b) the level of error correction which was originally specified as being sufficient is now starting to be reached in real systems, possibly even consumer systems. There is no "solution", however, the first step is to enhance "detection". Various solutions have been proposed, all increase cost, computation or have some disadvantage - however, one of the more promising detection mechanisms is an end to end checksum, which will then have the effect of augmenting ALL the steps in the chain, not just one specific step. As of today, only a few filesystems offer this, roll on more adopting it

Regards

Ed W

Stan Hoeppner

15 Apr 15 Apr

3:05 a.m.

New subject: [Dovecot] Better to use a single large storage server or multiple smaller for mdbox?

On 4/14/2012 5:00 AM, Ed W wrote:

...

On 14/04/2012 04:48, Stan Hoeppner wrote:

...
On 4/13/2012 10:31 AM, Ed W wrote:

...
You mean those "answers" like: "you need to read 'those' articles again"

Referring to some unknown and hard to find previous emails is not the same as answering? No, referring to this:

On 4/12/2012 5:58 AM, Ed W wrote:

...
The claim by ZFS/BTRFS authors and others is that data silently "bit rots" on it's own. Is it not a correct assumption that you read this in articles? If you read this in books, scrolls, or chiseled tablets, my apologies for assuming it was articles.

WHAT?!! The original context was that you wanted me to learn some very specific thing that you accused me of misunderstanding, and then it turns out that the thing I'm supposed to learn comes from re-reading every email, every blog post, every video, every slashdot post, every wiki, every ... that mentions ZFS's reason for including end to end checksumming?!!

No, the original context was your town crier statement that the sky is falling due to silent data corruption. I pointed out that this is not the case, currently, that most wouldn't see this until quite a few years down the road. I provided facts to back my statement, which you didn't seem to grasp or comprehend. I pointed this out and your top popped with a cloud of steam.

...

Please stop wasting our time and get specific

Whose time am I wasting Ed? You're the primary person one on this list who wastes everyone's time with these drawn out threads, usually unrelated to Dovecot. I have been plenty specific. The problem is you lack the knowledge and understanding of hardware communication. You're upset because I'm not pointing out the knowledge you seem to lack? Is that not a waste of everyone's time? Is that not be even "more insulting"? Causing even more excited/heated emails from you?

...

You have taken my email which contained a specific question, been asked of you multiple times now and yet you insist on only answering irrelevant details with a pointed and personal dig on each answer. The rudeness is unnecessary, and your evasiveness of answers does not fill me with confidence that you actually know the answer...

Ed, I have not been rude. I've been attempting to prevent you dragging us into the mud, which you've done, as you often do. How specific would you like me to get? This is what you seem to be missing:

Drives perform per sector CRC before transmitting data to the HBA. ATA, SATA, SCSI, SAS, fiber channel devices and HBAs all perform CRC on wire data. The PCI/PCI-X/PCIe buses/channels and Southbridge all perform CRC on wire data. HyperTransport, and Intel's proprietary links also perform CRC on wire transmissions. Server memory is protected by ECC, some by ChipKill which can tolerate double bit errors.

With today's systems and storage densities, with error correcting code on all data paths within the system, and on the drives themselves, "silent data corruption" is not an issue--in absence of defective hardware or a bug, which are not relevant to the discussion.

...

For the benefit of anyone reading this via email archives or whatever, I think the conclusion we have reached is that: modern systems are now a) a complex sum of pieces, any of which can cause an error to be injected,

Errors occur all the time. And they're corrected nearly all of the time, on modern complex systems. Silent errors do not occur frequently, usually not at all, on most modern systems.

...

b) the level of error correction which was originally specified as being sufficient is now starting to be reached in real systems,

FSVO 'real systems'. The few occurrences of "silent data corruption" I'm aware of have been documented in academic papers published by researches working at taxpayer funded institutions. In the case of CERN, the problem was a firmware bug in the Western Digital drives that caused an issue with the 3Ware controllers. This kind of thing happens when using COTS DIY hardware in the absence of proper load validation testing. So this case doesn't really fit the Henny-penny silent data corruption scenario as a firmware bug caused it. One that should have been caught and corrected during testing.

In the other cases I'm aware of, all were HPC systems which generated SDC under extended high loads, and these SDCs nearly all occurred somewhere other than the storage systems--CPUs, RAM, interconnect, etc. HPC apps tend to run the CPUs, interconnects, storage, etc, at full bandwidth for hours at a time, across tens of thousands of nodes, so the probability of SDC is much higher simply due to scale.

...

possibly even consumer systems.

Possibly? If you're going to post pure conjecture why not say "possibly even iPhones or Androids"? There's no data to back either claim. Stick to the facts.

...

There is no "solution", however, the first step is to enhance "detection". Various solutions have been proposed, all increase cost, computation or have some disadvantage - however, one of the more promising detection mechanisms is an end to end checksum, which will then have the effect of augmenting ALL the steps in the chain, not just one specific step. As of today, only a few filesystems offer this, roll on more adopting it

So after all the steam blowing, we're back to where we started. I disagree with your assertion that this is an issue that we--meaning "average" users not possessing 1PB storage systems or massive clusters--need to be worried about TODAY. I gave sound reasons as to why this is the case. You've given us 'a couple of academic papers say the sky is falling so I'm repeating the sky is falling'. Without apparently truly understanding the issue.

The data available and the experience of the vast majority of IT folks backs my position--which is why that's my position. There is little to no data supporting your position.

I say this isn't going to be an issue for average users, if at all, for a few years to come. You say it's here now. That's a fairly minor point of disagreement to cause such a heated (on your part) lengthy exchange.

BTW, if you see anything I've stated as rude you've apparently not been on the Interwebs long. ;)

-- Stan

Jan-Frode Myklebust

14 Apr 14 Apr

1:04 p.m.

New subject: [Dovecot] Better to use a single large storage server or multiple smaller for mdbox?

On Fri, Apr 13, 2012 at 07:33:19AM -0500, Stan Hoeppner wrote:

...

...
What I meant wasn't the drive throwing uncorrectable read errors but the drives are returning different data that each think is correct or both may have sent the correct data but one of the set got corrupted on the fly. After reading the articles posted, maybe the correct term would be the controller receiving silently corrupted data, say due to bad cable on one.

This simply can't happen. What articles are you referring to? If the author is stating what you say above, he simply doesn't know what he's talking about.

It has happened to me, with RAID5 not RAID1. It was a firmware bug in the raid controller that caused the RAID array to go silently corrupted. The HW reported everything green -- but the filesystem was reporting lots of strange errors.. This LUN was part of a larger filesystem striped over multiple LUNs, so parts of the fs was OK, while other parts was corrupt.

It was this bug:

http://delivery04.dhe.ibm.com/sar/CMA/SDA/02igj/7/ibm_fw1_ds4kfc_07605200_an...

Fix 432525 - CR139339 Data corruption found on drive after reconstruct from GHSP (Global Hot Spare)

<snip>

...

In closing, I'll simply say this: If hardware, whether a mobo-down SATA chip, or a $100K SGI SAN RAID controller, allowed silent data corruption or transmission to occur, there would be no storage industry, and we'll all still be using pen and paper. The questions you're asking were solved by hardware and software engineers decades ago. You're fretting and asking about things that were solved decades ago.

Look at the plans are for your favorite fs:

http://www.youtube.com/watch?v=FegjLbCnoBw

They're planning on doing metadata checksumming to be sure they don't receive corrupted metadata from the backend storage, and say that data validation is a storage subsystem *or* application problem.

Hardly a solved problem..

-jf

Stan Hoeppner

15 Apr 15 Apr

1:39 a.m.

New subject: [Dovecot] Better to use a single large storage server or multiple smaller for mdbox?

On 4/14/2012 5:04 AM, Jan-Frode Myklebust wrote:

...

On Fri, Apr 13, 2012 at 07:33:19AM -0500, Stan Hoeppner wrote:

...
...
What I meant wasn't the drive throwing uncorrectable read errors but the drives are returning different data that each think is correct or both may have sent the correct data but one of the set got corrupted on the fly. After reading the articles posted, maybe the correct term would be the controller receiving silently corrupted data, say due to bad cable on one.

This simply can't happen. What articles are you referring to? If the author is stating what you say above, he simply doesn't know what he's talking about.

It has happened to me, with RAID5 not RAID1. It was a firmware bug in the raid controller that caused the RAID array to go silently corrupted. The HW reported everything green -- but the filesystem was reporting lots of strange errors.. This LUN was part of a larger filesystem striped over multiple LUNs, so parts of the fs was OK, while other parts was corrupt.

It was this bug:

http://delivery04.dhe.ibm.com/sar/CMA/SDA/02igj/7/ibm_fw1_ds4kfc_07605200_an...

Fix 432525 - CR139339 Data corruption found on drive after reconstruct from GHSP (Global Hot Spare)

Note my comments were specific to the RAID1 case, or a concatenated set of RAID1 devices. And note the discussion was framed around silent corruption in the absence of bugs and hardware failure, or should I say, where no bugs or hardware failures can be identified.

...

<snip>

...
In closing, I'll simply say this: If hardware, whether a mobo-down SATA chip, or a $100K SGI SAN RAID controller, allowed silent data corruption or transmission to occur, there would be no storage industry, and we'll all still be using pen and paper. The questions you're asking were solved by hardware and software engineers decades ago. You're fretting and asking about things that were solved decades ago.

Look at the plans are for your favorite fs:

http://www.youtube.com/watch?v=FegjLbCnoBw

They're planning on doing metadata checksumming to be sure they don't receive corrupted metadata from the backend storage, and say that data validation is a storage subsystem *or* application problem.

You can't made sure you don't receive corrupted data. You take steps to mitigate the negative effects of it if and when it happens. The XFS devs are planning this for the future. If the problem was here now, this work would have already been done.

...

Hardly a solved problem..

It has been up to this point. The issue going forward is that current devices don't employ sufficient consistency checking to meet future needs. And the disk drive makers apparently don't want to consume the additional bits required to properly do this in the drives.

If they'd dedicate far more bits to ECC we may not have this issue. But since it appears this isn't going to change, kernel, filesystem and application developers are taking steps to mitigate it. Again, this "silent corruption" issue as described in the various academic papers is a future problem for most, not a current problem. It's only a current problem for those are the bleeding edge of large scale storage. Note that firmware bugs in individual products aren't part of this issue. Those will be with us forever in various products because humans make mistakes. No amount of filesystem or application code can mitigate those. The solution to that is standard best practices: snapshots, backups, or even mirroring all your storage across different vendor hardware.

-- Stan

Ed W

12 Apr 12 Apr

2:45 p.m.

New subject: [Dovecot] Better to use a single large storage server or multiple smaller for mdbox?

On 12/04/2012 02:18, Stan Hoeppner wrote:

...

On 4/11/2012 11:50 AM, Ed W wrote:

...
Re XFS. Have you been watching BTRFS recently?

I will concede that despite the authors considering it production ready I won't be using it for my servers just yet. However, it's benchmarking on single disk benchmarks fairly similarly to XFS and in certain cases (multi-threaded performance) can be somewhat better. I haven't yet seen any benchmarks on larger disk arrays yet, eg 6+ disks, so no idea how it scales up. Basically what I have seen seems "competitive" Links?

http://btrfs.ipv5.de/index.php?title=Main_Page#Benchmarking

See the regular Phoronix benchmarks in particular. However, I believe these are all single disk?

...

...
I don't have such hardware spare to benchmark, but I would be interested to hear from someone who benchmarks your RAID1+linear+XFS suggestion, especially if they have compared a cutting edge btrfs kernel on the same array? http://btrfs.boxacle.net/repository/raid/history/History_Mail_server_simulat...

This is with an 8 wide LVM stripe over 8 17 drive hardware RAID0 arrays. If the disks had been setup as a concat of 68 RAID1 pairs, XFS would have turned in numbers significantly higher, anywhere from a 100% increase to 500%.

My instinct is that this is an irrelevant benchmark for BTRFS because its performance characteristics for these workloads have changed so significantly? I would be far more interested in a 3.2 and then a 3.6/3.7 benchmark in a years time

In particular recent benchmarks on Phoronix show btrfs exceeding XFS performance on heavily threaded benchmarks - however, I doubt this is representative of performance on a multi-disk benchmark?

...

It would be nice to see these folks update these results with a 3.2.6 kernel, as both BTRFS and XFS have improved significantly since 2.6.35. EXT4 and JFS have seen little performance work since.

My understanding is that there was a significant multi-thread performance boost for EXT4 in the last year kind of timeframe? I don't have a link to hand, but someone did some work to reduce lock contention (??) which I seem to recall made a very large difference on multi-user or multi-cpu workloads? I seem to recall that the summary was that it allowed Ext4 to scale up to a good fraction of XFS performance on "medium sized" systems? (I believe that XFS still continues to scale far better than anything else on large systems)

Point is that I think it's a bit unfair to say that little has changed on Ext4? It still seems to be developing faster than "maintenance only"

However, well OT... The original question was: anyone tried very recent BTRFS on a multi-disk system. Seems like the answer is no. My proposal is that it may be worth watching in the future

Cheers

Ed W

P.S. I have always been intrigued by the idea that a COW based filesystem could potentially implement much faster "RAID" parity, because it can avoid reading the whole stripe. The idea is that you treat unallocated space as "zero", which means you can compute the incremental parity with only a read/write of the checksum value (and with a COW filesystem you only ever update by rewriting to new "zero'd" space). I had in mind something like a fixed parity disk (RAID4?) and allowing the parity disk to be "write behind" cached in ram (ie exposed to risk of: power fails AND data disk fails at the same time). My code may not be following along for a while though...

Adrian Minta

10 Apr 10 Apr

1:22 p.m.

New subject: [Dovecot] Better to use a single large storage server or multiple smaller for mdbox?

On 04/10/12 08:00, Stan Hoeppner wrote:

...

Interestingly, I designed a COTS server back in January to handle at least 5k concurrent IMAP users, using best of breed components. If you or someone there has the necessary hardware skills, you could assemble this system and simply use it for NFS instead of Dovecot. The parts list: secure.newegg.com/WishList/PublicWishDetail.aspx?WishListNumber=17069985

Don't forget the Battery Backup Unit for RAID card !!!

Stan Hoeppner

12 Apr 12 Apr

1:46 a.m.

New subject: [Dovecot] Better to use a single large storage server or multiple smaller for mdbox?

On 4/10/2012 5:22 AM, Adrian Minta wrote:

...

On 04/10/12 08:00, Stan Hoeppner wrote:

...
Interestingly, I designed a COTS server back in January to handle at least 5k concurrent IMAP users, using best of breed components. If you or someone there has the necessary hardware skills, you could assemble this system and simply use it for NFS instead of Dovecot. The parts list: secure.newegg.com/WishList/PublicWishDetail.aspx?WishListNumber=17069985

Don't forget the Battery Backup Unit for RAID card !!!

Heh, thanks for the reminder Adrian. :)

I got to your email a little late--already corrected the omission. Yes, battery or flash backup for the RAID cache is always a necessity when doing write-back caching.

-- Stan

Robin

7 Apr 7 Apr

11:45 p.m.

New subject: [Dovecot] Better to use a single large storage server or multiple smaller for mdbox?

...

Putting XFS on a singe RAID1 pair, as you seem to be describing above for the multiple "thin" node case, and hitting one node with parallel writes to multiple user mail dirs, you'll get less performance than EXT3/4 on that mirror pair--possibly less than half, depending on the size of the disks and thus the number of AGs created. The 'secret' to XFS performance with this workload is concatenation of spindles. Without it you can't spread the AGs--thus directories, thus parallel file writes--horizontally across the spindles--and this is the key. By spreading AGs 'horizontally' across the disks in a concat, instead of 'vertically' down a striped array, you accomplish two important things:

You dramatically reduce disk head seeking by using the concat array. With XFS on a RAID10 array of 24 2TB disks you end up with 24 AGs evenly spaced vertically down each disk in the array, following the stripe pattern. Each user mailbox is stored in a different directory. Each directory was created in a different AG. So if you have 96 users writing their dovecot index concurrently, you have at worst case a minimum 192 head movements occurring back and forth across the entire platter of each disk, and likely not well optimized by TCQ/NCQ. Why 192 instead of 96? The modification time in the directory metadata must be updated for each index file, among other things.

Does the XFS allocator automatically distribute AGs in this way even when disk usage is extremely light, i.e, a freshly formatted system with user directories initially created, and then the actual mailbox contents copied into them?

If this is indeed the case, then what you describe is a wondrous revelation, since you're scaling out the number of simultaneous metadata reads+writes/second as you add RAID1 pairs, if my understanding of this is correct. I'm assuming of course, but should look at the code, that metadata locks imposed by the filesystem "distribute" as the number of pairs increase - if it's all just one Big Lock, then that wouldn't be the case.

Forgive my laziness, as I could just experiment and take a look at the on-disk structures myself, but I don't have four empty drives handy to experiment.

The bandwidth improvements due to striping (RAID0/5/6 style) are no help for metadata-intensive IO loads, and probably of little value for even mdbox loads too, I suspect, unless the mdbox max size is set to something pretty large, no?

Have you tried other filesystems and seen if they distribute metadata in a similarly efficient and scalable manner across concatenated drive sets?

Is there ANY point to using striping at all, a la "RAID10" in this? I'd have thought just making as many RAID1 pairs out of your drives as possible would be the ideal strategy - is this not the case?

=R=

Stan Hoeppner

8 Apr 8 Apr

3:46 a.m.

New subject: [Dovecot] Better to use a single large storage server or multiple smaller for mdbox?

On 4/7/2012 3:45 PM, Robin wrote:

...

...
Putting XFS on a singe RAID1 pair, as you seem to be describing above for the multiple "thin" node case, and hitting one node with parallel writes to multiple user mail dirs, you'll get less performance than EXT3/4 on that mirror pair--possibly less than half, depending on the size of the disks and thus the number of AGs created. The 'secret' to XFS performance with this workload is concatenation of spindles. Without it you can't spread the AGs--thus directories, thus parallel file writes--horizontally across the spindles--and this is the key. By spreading AGs 'horizontally' across the disks in a concat, instead of 'vertically' down a striped array, you accomplish two important things:

You dramatically reduce disk head seeking by using the concat array. With XFS on a RAID10 array of 24 2TB disks you end up with 24 AGs evenly spaced vertically down each disk in the array, following the stripe pattern. Each user mailbox is stored in a different directory. Each directory was created in a different AG. So if you have 96 users writing their dovecot index concurrently, you have at worst case a minimum 192 head movements occurring back and forth across the entire platter of each disk, and likely not well optimized by TCQ/NCQ. Why 192 instead of 96? The modification time in the directory metadata must be updated for each index file, among other things.

Does the XFS allocator automatically distribute AGs in this way even when disk usage is extremely light, i.e, a freshly formatted system with user directories initially created, and then the actual mailbox contents copied into them?

It doesn't distribute AGs. There are a static number created during mkfs.xfs. The inode64 allocator round robins new directory creation across the AGs, and does the same with files created in those directories. Having the directory metadata and file extents in the same AG decreases head movement and thus seek latency for mixed metadata/extent high IOPS workloads.

...

If this is indeed the case, then what you describe is a wondrous revelation, since you're scaling out the number of simultaneous metadata reads+writes/second as you add RAID1 pairs, if my understanding of this is correct.

Correct. And adding more space and IOPS is uncomplicated. No chunk calculations, no restriping of the array. You simply grow the md linear array adding the new disk device. Then grow XFS to add the new free space to the filesystem. AFAIK this can be done infinitely, theoretically. I'm guessing md has a device count limit somewhere. If not your bash line buffer might. ;)

...

I'm assuming of course, but should look at the code, that metadata locks imposed by the filesystem "distribute" as the number of pairs increase - if it's all just one Big Lock, then that wouldn't be the case.

XFS locking is done as minimally as possibly and is insanely fast. I've not come across any reported performance issues relating to it. And yes, any single metadata lock will occur in a single AG on one mirror pair using the concat setup.

...

Forgive my laziness, as I could just experiment and take a look at the on-disk structures myself, but I don't have four empty drives handy to experiment.

Don't sweat it. All of this stuff is covered in the XFS Filesystem Structure Guide, exciting reading if you enjoy a root canal while watching snales race: http://xfs.org/docs/xfsdocs-xml-dev/XFS_Filesystem_Structure//tmp/en-US/html...

...

The bandwidth improvements due to striping (RAID0/5/6 style) are no help for metadata-intensive IO loads, and probably of little value for even mdbox loads too, I suspect, unless the mdbox max size is set to something pretty large, no?

The problem with striped parity RAID is not allocation, which takes place in free space and is pretty fast. The problem is the extra read seeks and bandwidth of the RMW cycle when you modify an existing stripe. Updating a single flag in a Dovecot index causes md or the hardware RAID controller to read the entire stripe into buffer space or RAID cache, modify the flag byte, recalculate parity, then write the whole stripe and parity block back out across all the disks.

With a linear concat of RAID1 pairs we're simply rewriting a single 4KB filesystem block, maybe only a single 512B sector. I'm at the edge of my knowledge here. I don't know exactly how Timo does the index updates. Regardless of the method, the index update is light years faster with the concat setup as there is no RMW and full stripe writeback as with the RAID5/6 case.

...

Have you tried other filesystems and seen if they distribute metadata in a similarly efficient and scalable manner across concatenated drive sets?

EXT, any version, does not. ReiserFS does not. Both require disk striping to achieve any parallelism. With concat they both simply start writing at the beginning sectors of the first RAID1 pair and 4 years later maybe reach the last pair as they fill up the volume. ;) JFS has a more advanced allocation strategy that EXT or ReiserFS, not as advanced as XFS. I've never read of a concat example with JFS and I've never tested it. It's all but a dead filesystem at this point anyway, less than 2 dozen commits in 8 years last I checked, and these were simple bug fixes and changes to keep it building on new kernels. If it's not suffering bit rot now I'm sure it will be in the near future.

...

Is there ANY point to using striping at all, a la "RAID10" in this? I'd have thought just making as many RAID1 pairs out of your drives as possible would be the ideal strategy - is this not the case?

If you're using XFS, and your workload is overwhelmingly mail, RAID1+concat is the only way to fly, and it flies. If the workload is not mail, say large file streaming writes, then you're limited to 100-200MB/s, a single drive of throughput, as each file is written to a single directory on a single AG on a single disk. For streaming write performance you'll need striping. If you have many concurrent large streaming writes, you'll want to concat multiple striped arrays.

-- Stan

4875

Age (days ago)

4885

Last active (days ago)

List overview

41 comments

11 participants

participants (11)

Adrian Minta
Charles Marcus
Dirk Jahnke-Zumbusch
Ed W
Emmanuel Noobadmin
Jan-Frode Myklebust
Jim Lawson
Maarten Bezemer
Robin
Stan Hoeppner
Timo Sirainen