On 13/04/2012 13:33, Stan Hoeppner wrote:
What I meant wasn't the drive throwing uncorrectable read errors but the drives are returning different data that each think is correct or both may have sent the correct data but one of the set got corrupted on the fly. After reading the articles posted, maybe the correct term would be the controller receiving silently corrupted data, say due to bad cable on one. This simply can't happen. What articles are you referring to? If the author is stating what you say above, he simply doesn't know what he's talking about.
It quite clearly can??!
Just grab your drive, lever the connector off a little bit until it's a bit flaky and off you go? *THIS* type of problem I have heard of and you can find easy examples with a quick google search of any hobbyist storage board. Very common other examples are such problems due to failing PSUs and other interference driven examples causing explicit disk errors (and once the error rate goes up, some will make it past the checksum)
Note this is NOT what I was originally asking about. My interest is more about when the hardware is working reliably and as you agree, the error levels are vastly lower. However, it would be incredibly foolish to claim that it's not trivial to construct a scenario where bad hardware causes plenty of silent corruption?
If the controller simply returns the fastest result, it could be the bad sector and that doesn't protect the integrity of the data right? I already answered this in a previous post.
Not obviously?!
I will also add my understanding that linux software RAID1,5&6 *DO NOT* read all disks and hence will not be aware when disks have different data. In fact with software raid you need to run a regular "scrub" job to check this consistency.
I also believe that most commodity hardware raid implementations work exactly the same way and a background scrub is needed to detect inconsistent arrays. However, feel free to correct that understanding?
if the controller gets 1st half from one drive and 2nd half from the other drive to speed up performance, we could still get the corrupted half and the controller itself still can't tell if the sector it got was corrupted isn't it? No, this is not correct.
I definitely think you are wrong and Emmanuel is right?
If the controller gets a good read from the disk then it will trust that read and will NOT check the result with the other disk (or parity in the case of RAID5/6). If that read was incorrect for some reason then the data will be passed as good.
If the controller compares the two sectors from the drives, it may be able to tell us something is wrong but there isn't anyway for it to know which one of the sector was a good read and which isn't, or is there? Yes it can, and it does.
No it definitely does not!! At least not with linux software raid and I don't believe on commodity hardware controllers either! (You would be able to tell because the disk IO would be doubled)
Linux software raid 1 isn't that smart, but reads only one disk and trusts the answer if the read did not trigger an error. It does not check the other disk except during an explicit disk scrub.
Emmanuel, Ed, we're at a point where I simply don't have the time nor inclination to continue answering these basic questions about the base level functions of storage hardware.
You mean those "answers" like: "I answered that in another thread" or "you need to read 'those' articles again"
Referring to some unknown and hard to find previous emails is not the same as answering?
Also you are wondering off at extreme tangents. The question is simple:
- Disk 1 Read good, checksum = A
- Disk 2 Read good, checksum = B
Disks are a raid 1 pair. How do we know which disk is correct. Please specify raid 1 implementation and mechanism used with any answer
To answer the questions you're asking will require me to teach you the basics of hardware signaling protocols, SCSI, SATA, Fiber Channel, and Ethernet transmission error detection protocols, disk drive firmware error recovery routines, etc, etc, etc.
I really think not... A simple statement of:
- Each sector on disk has a certain sized checksum
- Controller checks checksum on read
- Sent back over SATA connection, with a certain sized checksum
- After that you are on your own vs corruption
...Should cover it I think?
In closing, I'll simply say this: If hardware, whether a mobo-down SATA chip, or a $100K SGI SAN RAID controller, allowed silent data corruption or transmission to occur, there would be no storage industry, and we'll all still be using pen and paper. The questions you're asking were solved by hardware and software engineers decades ago. You're fretting and asking about things that were solved decades ago.
So why are so many people getting excited about it now?
Note, there have been plenty of shoddy disk controller implementations
before today - ie there exists hardware on sale with *known* defects.
Despite that the industry continues without collapse. Now you claim
that if corruption is silent and people only tend to notice it much
later and under certain edge conditions that this can't be possible
because it should cause the industry to collapse..???
...Not buying your logic...
Ed W