[Dovecot] Better to use a single large storage server or multiple smaller for mdbox?

Fri Apr 13 22:10:04 EEST 2012

On Fri, 13 Apr 2012, Ed W wrote:
> On 13/04/2012 13:33, Stan Hoeppner wrote:
>>> What I meant wasn't the drive throwing uncorrectable read errors but
>>> the drives are returning different data that each think is correct or
>>> both may have sent the correct data but one of the set got corrupted
>>> on the fly. After reading the articles posted, maybe the correct term
>>> would be the controller receiving silently corrupted data, say due to
>>> bad cable on one.
>> This simply can't happen.  What articles are you referring to?  If the
>> author is stating what you say above, he simply doesn't know what he's
>> talking about.
> It quite clearly can??!

I totally agree with Ed here. Drives sure can and sometimes really do 
return different data, without reporting errors. Also, data can get 
corrupted on any of the busses or chips it passes through.

The math about 10^15 or 10^16 and all that stuff is not only about array 
sizes. It's also about data transfer.

I've seen silent corruption on a few systems myself. (Luckily, only 3 
times in a couple years.) Those systems were only in the 2TB-5TB size 
category, which is substantially lower than the 67TB claimed elsewhere. 
Yet, statistically, it's well within normal probability levels.

Linux mdraid only reads one mirror as long as the drives don't return an 
error. Easy to check, the read speeds are way beyond a single drive's read 
speed. When the kernel would have to read all (possibly more than two) 
mirrors, and compare them, and make a decision based on this comparison, 
things would be horribly slow. Hardware raid typically uses this exact 
same approach. This goes for Areca, 3ware, LSI, which cover most of the 
regular (i.e. non-SAN) professional hardware raid setups.

If you don't believe it, just don't take my word for it but test it for 
yourself. Cleanly power down a raid1 array, take the individual drives, 
put them into a simple desktop machine, and write different data to both, 
using some raw disk writing tool like dd. Then put the drives back into 
the raid1 array, power it up, and re-read the information. You'll see data 
from both drives will be intermixed as parts of the reads come from one 
disk, and parts come from the other. Only when you order the raid array to 
do a verification pass, it'll start screaming and yelling. At least, I 
hope it will...

But as explained elsewhere, silent corruption can occur at numerous 
places. If you don't have an explicit checksumming/checking mechanism, 
there are indeed cases that will haunt you if you don't do regular 
scrubbing or at least do regular verification runs. Heck, that's why Linux 
mdadm comes with cron jobs to do just that, and hardware raid controllers 
have similar scheduling capabilities.

Of course, scrubbing/verification is not going to magically protect you 
from all problems. But you would at least get notifications if it detects 
problems.

>>> If the controller compares the two sectors from the drives, it may be
>>> able to tell us something is wrong but there isn't anyway for it to
>>> know which one of the sector was a good read and which isn't, or is
>>> there?
>> Yes it can, and it does.
>
> No it definitely does not!! At least not with linux software raid and I don't 
> believe on commodity hardware controllers either!  (You would be able to tell 
> because the disk IO would be doubled)

Obviously there is no way to tell which versions of a story are correct if 
you are not biased to believe one of the storytellers and distrust the 
other. You would have to add a checksum layer for that. (And hope the 
checksum isn't the part that got corrupted!)

>> To answer the questions
>> you're asking will require me to teach you the basics of hardware
>> signaling protocols, SCSI, SATA, Fiber Channel, and Ethernet
>> transmission error detection protocols, disk drive firmware error
>> recovery routines, etc, etc, etc.

I'm quite familiar with the basics of these protocols. I'm also quite 
familiar with the flaws in several implementations of "seemingly 
straightforward protocols". More often than not, there's a pressing need 
to get new devices onto the market before the competition has something 
similar and you loose your advantage. More often than not, this results in 
suboptimal implementations of all those fine protocols and algorithms. And 
let's face it: flaws in error recovery routines often don't surface until 
someone actually needs those routines. As long as drives (or any other 
device) are functioning as expected, everything is all right. But as soon 
as something starts to get flaky, error recovery has to kick in but may 
just as well fail to do the right thing.

Just consider the real-world analogy of politicians. They do or say 
something stupid every once in a while, and error recovery (a.k.a. damage 
control) has to kick in. But even though those well trained professionals, 
having decades of experience in the political arena, sometimes simply fail 
to do the right thing. They may have overlooked some pesky details, or 
they may take actions that don't have the expected outcome because... 
indeed, things work differently in damage control mode, and the only law 
you can trust is physics: you always go down when you can't stay on your 
feet.

With hard drives, raid controllers, mainboards, data buses, it's exactly 
the same. If _something_ isn't working as it should, how should we know 
which part of it we _can_ trust?

>> In closing, I'll simply say this:  If hardware, whether a mobo-down SATA
>> chip, or a $100K SGI SAN RAID controller, allowed silent data corruption
>> or transmission to occur, there would be no storage industry, and we'll
>> all still be using pen and paper.  The questions you're asking were
>> solved by hardware and software engineers decades ago.  You're fretting
>> and asking about things that were solved decades ago.

Isn't it just "worked around" by adding more layers of checksuming and 
adding more redundancy into the mix? Don't believe this "storage industry" 
because they tell you it's OK. It simply is not OK. You might want to talk 
to people in the data and computing cluster business about their opinion 
on "storage industry professionals"...

Timo's suggestion to add checksums to mailboxes/metadata could help to 
(at least) report these types of failures. Re-reading from different 
storage when available could also recover the data that got corrupted, but 
I'm not sure what would be the best way to handle these situations. If you 
know there is a corruption problem on one of your storage locations, you 
might want to switch that to read-only asap. Automagically trying to 
recover might not be the best thing to do. Given all kinds of different 
use cases, I think that should at least be configurable :-P

-- 
Maarten