[Dovecot] Better to use a single large storage server or multiple smaller for mdbox?
Maarten Bezemer
mcbdovecot at robuust.nl
Fri Apr 13 22:10:04 EEST 2012
On Fri, 13 Apr 2012, Ed W wrote:
> On 13/04/2012 13:33, Stan Hoeppner wrote:
>>> What I meant wasn't the drive throwing uncorrectable read errors but
>>> the drives are returning different data that each think is correct or
>>> both may have sent the correct data but one of the set got corrupted
>>> on the fly. After reading the articles posted, maybe the correct term
>>> would be the controller receiving silently corrupted data, say due to
>>> bad cable on one.
>> This simply can't happen. What articles are you referring to? If the
>> author is stating what you say above, he simply doesn't know what he's
>> talking about.
> It quite clearly can??!
I totally agree with Ed here. Drives sure can and sometimes really do
return different data, without reporting errors. Also, data can get
corrupted on any of the busses or chips it passes through.
The math about 10^15 or 10^16 and all that stuff is not only about array
sizes. It's also about data transfer.
I've seen silent corruption on a few systems myself. (Luckily, only 3
times in a couple years.) Those systems were only in the 2TB-5TB size
category, which is substantially lower than the 67TB claimed elsewhere.
Yet, statistically, it's well within normal probability levels.
Linux mdraid only reads one mirror as long as the drives don't return an
error. Easy to check, the read speeds are way beyond a single drive's read
speed. When the kernel would have to read all (possibly more than two)
mirrors, and compare them, and make a decision based on this comparison,
things would be horribly slow. Hardware raid typically uses this exact
same approach. This goes for Areca, 3ware, LSI, which cover most of the
regular (i.e. non-SAN) professional hardware raid setups.
If you don't believe it, just don't take my word for it but test it for
yourself. Cleanly power down a raid1 array, take the individual drives,
put them into a simple desktop machine, and write different data to both,
using some raw disk writing tool like dd. Then put the drives back into
the raid1 array, power it up, and re-read the information. You'll see data
from both drives will be intermixed as parts of the reads come from one
disk, and parts come from the other. Only when you order the raid array to
do a verification pass, it'll start screaming and yelling. At least, I
hope it will...
But as explained elsewhere, silent corruption can occur at numerous
places. If you don't have an explicit checksumming/checking mechanism,
there are indeed cases that will haunt you if you don't do regular
scrubbing or at least do regular verification runs. Heck, that's why Linux
mdadm comes with cron jobs to do just that, and hardware raid controllers
have similar scheduling capabilities.
Of course, scrubbing/verification is not going to magically protect you
from all problems. But you would at least get notifications if it detects
problems.
>>> If the controller compares the two sectors from the drives, it may be
>>> able to tell us something is wrong but there isn't anyway for it to
>>> know which one of the sector was a good read and which isn't, or is
>>> there?
>> Yes it can, and it does.
>
> No it definitely does not!! At least not with linux software raid and I don't
> believe on commodity hardware controllers either! (You would be able to tell
> because the disk IO would be doubled)
Obviously there is no way to tell which versions of a story are correct if
you are not biased to believe one of the storytellers and distrust the
other. You would have to add a checksum layer for that. (And hope the
checksum isn't the part that got corrupted!)
>> To answer the questions
>> you're asking will require me to teach you the basics of hardware
>> signaling protocols, SCSI, SATA, Fiber Channel, and Ethernet
>> transmission error detection protocols, disk drive firmware error
>> recovery routines, etc, etc, etc.
I'm quite familiar with the basics of these protocols. I'm also quite
familiar with the flaws in several implementations of "seemingly
straightforward protocols". More often than not, there's a pressing need
to get new devices onto the market before the competition has something
similar and you loose your advantage. More often than not, this results in
suboptimal implementations of all those fine protocols and algorithms. And
let's face it: flaws in error recovery routines often don't surface until
someone actually needs those routines. As long as drives (or any other
device) are functioning as expected, everything is all right. But as soon
as something starts to get flaky, error recovery has to kick in but may
just as well fail to do the right thing.
Just consider the real-world analogy of politicians. They do or say
something stupid every once in a while, and error recovery (a.k.a. damage
control) has to kick in. But even though those well trained professionals,
having decades of experience in the political arena, sometimes simply fail
to do the right thing. They may have overlooked some pesky details, or
they may take actions that don't have the expected outcome because...
indeed, things work differently in damage control mode, and the only law
you can trust is physics: you always go down when you can't stay on your
feet.
With hard drives, raid controllers, mainboards, data buses, it's exactly
the same. If _something_ isn't working as it should, how should we know
which part of it we _can_ trust?
>> In closing, I'll simply say this: If hardware, whether a mobo-down SATA
>> chip, or a $100K SGI SAN RAID controller, allowed silent data corruption
>> or transmission to occur, there would be no storage industry, and we'll
>> all still be using pen and paper. The questions you're asking were
>> solved by hardware and software engineers decades ago. You're fretting
>> and asking about things that were solved decades ago.
Isn't it just "worked around" by adding more layers of checksuming and
adding more redundancy into the mix? Don't believe this "storage industry"
because they tell you it's OK. It simply is not OK. You might want to talk
to people in the data and computing cluster business about their opinion
on "storage industry professionals"...
Timo's suggestion to add checksums to mailboxes/metadata could help to
(at least) report these types of failures. Re-reading from different
storage when available could also recover the data that got corrupted, but
I'm not sure what would be the best way to handle these situations. If you
know there is a corruption problem on one of your storage locations, you
might want to switch that to read-only asap. Automagically trying to
recover might not be the best thing to do. Given all kinds of different
use cases, I think that should at least be configurable :-P
--
Maarten
More information about the dovecot
mailing list