On Fri, Apr 13, 2012 at 07:33:19AM -0500, Stan Hoeppner wrote:
What I meant wasn't the drive throwing uncorrectable read errors but the drives are returning different data that each think is correct or both may have sent the correct data but one of the set got corrupted on the fly. After reading the articles posted, maybe the correct term would be the controller receiving silently corrupted data, say due to bad cable on one.
This simply can't happen. What articles are you referring to? If the author is stating what you say above, he simply doesn't know what he's talking about.
It has happened to me, with RAID5 not RAID1. It was a firmware bug in the raid controller that caused the RAID array to go silently corrupted. The HW reported everything green -- but the filesystem was reporting lots of strange errors.. This LUN was part of a larger filesystem striped over multiple LUNs, so parts of the fs was OK, while other parts was corrupt.
It was this bug:
http://delivery04.dhe.ibm.com/sar/CMA/SDA/02igj/7/ibm_fw1_ds4kfc_07605200_an...
- Fix 432525 - CR139339 Data corruption found on drive after reconstruct from GHSP (Global Hot Spare)
<snip>
In closing, I'll simply say this: If hardware, whether a mobo-down SATA chip, or a $100K SGI SAN RAID controller, allowed silent data corruption or transmission to occur, there would be no storage industry, and we'll all still be using pen and paper. The questions you're asking were solved by hardware and software engineers decades ago. You're fretting and asking about things that were solved decades ago.
Look at the plans are for your favorite fs:
http://www.youtube.com/watch?v=FegjLbCnoBw
They're planning on doing metadata checksumming to be sure they don't receive corrupted metadata from the backend storage, and say that data validation is a storage subsystem *or* application problem.
Hardly a solved problem..
-jf