On Fri, 13 Apr 2012, Ed W wrote:
On 13/04/2012 13:33, Stan Hoeppner wrote:
What I meant wasn't the drive throwing uncorrectable read errors but the drives are returning different data that each think is correct or both may have sent the correct data but one of the set got corrupted on the fly. After reading the articles posted, maybe the correct term would be the controller receiving silently corrupted data, say due to bad cable on one. This simply can't happen. What articles are you referring to? If the author is stating what you say above, he simply doesn't know what he's talking about. It quite clearly can??!
I totally agree with Ed here. Drives sure can and sometimes really do return different data, without reporting errors. Also, data can get corrupted on any of the busses or chips it passes through.
The math about 10^15 or 10^16 and all that stuff is not only about array sizes. It's also about data transfer.
I've seen silent corruption on a few systems myself. (Luckily, only 3 times in a couple years.) Those systems were only in the 2TB-5TB size category, which is substantially lower than the 67TB claimed elsewhere. Yet, statistically, it's well within normal probability levels.
Linux mdraid only reads one mirror as long as the drives don't return an error. Easy to check, the read speeds are way beyond a single drive's read speed. When the kernel would have to read all (possibly more than two) mirrors, and compare them, and make a decision based on this comparison, things would be horribly slow. Hardware raid typically uses this exact same approach. This goes for Areca, 3ware, LSI, which cover most of the regular (i.e. non-SAN) professional hardware raid setups.
If you don't believe it, just don't take my word for it but test it for yourself. Cleanly power down a raid1 array, take the individual drives, put them into a simple desktop machine, and write different data to both, using some raw disk writing tool like dd. Then put the drives back into the raid1 array, power it up, and re-read the information. You'll see data from both drives will be intermixed as parts of the reads come from one disk, and parts come from the other. Only when you order the raid array to do a verification pass, it'll start screaming and yelling. At least, I hope it will...
But as explained elsewhere, silent corruption can occur at numerous places. If you don't have an explicit checksumming/checking mechanism, there are indeed cases that will haunt you if you don't do regular scrubbing or at least do regular verification runs. Heck, that's why Linux mdadm comes with cron jobs to do just that, and hardware raid controllers have similar scheduling capabilities.
Of course, scrubbing/verification is not going to magically protect you from all problems. But you would at least get notifications if it detects problems.
If the controller compares the two sectors from the drives, it may be able to tell us something is wrong but there isn't anyway for it to know which one of the sector was a good read and which isn't, or is there? Yes it can, and it does.
No it definitely does not!! At least not with linux software raid and I don't believe on commodity hardware controllers either! (You would be able to tell because the disk IO would be doubled)
Obviously there is no way to tell which versions of a story are correct if you are not biased to believe one of the storytellers and distrust the other. You would have to add a checksum layer for that. (And hope the checksum isn't the part that got corrupted!)
To answer the questions you're asking will require me to teach you the basics of hardware signaling protocols, SCSI, SATA, Fiber Channel, and Ethernet transmission error detection protocols, disk drive firmware error recovery routines, etc, etc, etc.
I'm quite familiar with the basics of these protocols. I'm also quite familiar with the flaws in several implementations of "seemingly straightforward protocols". More often than not, there's a pressing need to get new devices onto the market before the competition has something similar and you loose your advantage. More often than not, this results in suboptimal implementations of all those fine protocols and algorithms. And let's face it: flaws in error recovery routines often don't surface until someone actually needs those routines. As long as drives (or any other device) are functioning as expected, everything is all right. But as soon as something starts to get flaky, error recovery has to kick in but may just as well fail to do the right thing.
Just consider the real-world analogy of politicians. They do or say something stupid every once in a while, and error recovery (a.k.a. damage control) has to kick in. But even though those well trained professionals, having decades of experience in the political arena, sometimes simply fail to do the right thing. They may have overlooked some pesky details, or they may take actions that don't have the expected outcome because... indeed, things work differently in damage control mode, and the only law you can trust is physics: you always go down when you can't stay on your feet.
With hard drives, raid controllers, mainboards, data buses, it's exactly the same. If _something_ isn't working as it should, how should we know which part of it we _can_ trust?
In closing, I'll simply say this: If hardware, whether a mobo-down SATA chip, or a $100K SGI SAN RAID controller, allowed silent data corruption or transmission to occur, there would be no storage industry, and we'll all still be using pen and paper. The questions you're asking were solved by hardware and software engineers decades ago. You're fretting and asking about things that were solved decades ago.
Isn't it just "worked around" by adding more layers of checksuming and adding more redundancy into the mix? Don't believe this "storage industry" because they tell you it's OK. It simply is not OK. You might want to talk to people in the data and computing cluster business about their opinion on "storage industry professionals"...
Timo's suggestion to add checksums to mailboxes/metadata could help to (at least) report these types of failures. Re-reading from different storage when available could also recover the data that got corrupted, but I'm not sure what would be the best way to handle these situations. If you know there is a corruption problem on one of your storage locations, you might want to switch that to read-only asap. Automagically trying to recover might not be the best thing to do. Given all kinds of different use cases, I think that should at least be configurable :-P
-- Maarten