On 4/14/2012 5:00 AM, Ed W wrote:
On 14/04/2012 04:48, Stan Hoeppner wrote:
On 4/13/2012 10:31 AM, Ed W wrote:
You mean those "answers" like: "you need to read 'those' articles again"
Referring to some unknown and hard to find previous emails is not the same as answering? No, referring to this:
On 4/12/2012 5:58 AM, Ed W wrote:
The claim by ZFS/BTRFS authors and others is that data silently "bit rots" on it's own. Is it not a correct assumption that you read this in articles? If you read this in books, scrolls, or chiseled tablets, my apologies for assuming it was articles.
WHAT?!! The original context was that you wanted me to learn some very specific thing that you accused me of misunderstanding, and then it turns out that the thing I'm supposed to learn comes from re-reading every email, every blog post, every video, every slashdot post, every wiki, every ... that mentions ZFS's reason for including end to end checksumming?!!
No, the original context was your town crier statement that the sky is falling due to silent data corruption. I pointed out that this is not the case, currently, that most wouldn't see this until quite a few years down the road. I provided facts to back my statement, which you didn't seem to grasp or comprehend. I pointed this out and your top popped with a cloud of steam.
Please stop wasting our time and get specific
Whose time am I wasting Ed? You're the primary person one on this list who wastes everyone's time with these drawn out threads, usually unrelated to Dovecot. I have been plenty specific. The problem is you lack the knowledge and understanding of hardware communication. You're upset because I'm not pointing out the knowledge you seem to lack? Is that not a waste of everyone's time? Is that not be even "more insulting"? Causing even more excited/heated emails from you?
You have taken my email which contained a specific question, been asked of you multiple times now and yet you insist on only answering irrelevant details with a pointed and personal dig on each answer. The rudeness is unnecessary, and your evasiveness of answers does not fill me with confidence that you actually know the answer...
Ed, I have not been rude. I've been attempting to prevent you dragging us into the mud, which you've done, as you often do. How specific would you like me to get? This is what you seem to be missing:
Drives perform per sector CRC before transmitting data to the HBA. ATA, SATA, SCSI, SAS, fiber channel devices and HBAs all perform CRC on wire data. The PCI/PCI-X/PCIe buses/channels and Southbridge all perform CRC on wire data. HyperTransport, and Intel's proprietary links also perform CRC on wire transmissions. Server memory is protected by ECC, some by ChipKill which can tolerate double bit errors.
With today's systems and storage densities, with error correcting code on all data paths within the system, and on the drives themselves, "silent data corruption" is not an issue--in absence of defective hardware or a bug, which are not relevant to the discussion.
For the benefit of anyone reading this via email archives or whatever, I think the conclusion we have reached is that: modern systems are now a) a complex sum of pieces, any of which can cause an error to be injected,
Errors occur all the time. And they're corrected nearly all of the time, on modern complex systems. Silent errors do not occur frequently, usually not at all, on most modern systems.
b) the level of error correction which was originally specified as being sufficient is now starting to be reached in real systems,
FSVO 'real systems'. The few occurrences of "silent data corruption" I'm aware of have been documented in academic papers published by researches working at taxpayer funded institutions. In the case of CERN, the problem was a firmware bug in the Western Digital drives that caused an issue with the 3Ware controllers. This kind of thing happens when using COTS DIY hardware in the absence of proper load validation testing. So this case doesn't really fit the Henny-penny silent data corruption scenario as a firmware bug caused it. One that should have been caught and corrected during testing.
In the other cases I'm aware of, all were HPC systems which generated SDC under extended high loads, and these SDCs nearly all occurred somewhere other than the storage systems--CPUs, RAM, interconnect, etc. HPC apps tend to run the CPUs, interconnects, storage, etc, at full bandwidth for hours at a time, across tens of thousands of nodes, so the probability of SDC is much higher simply due to scale.
possibly even consumer systems.
Possibly? If you're going to post pure conjecture why not say "possibly even iPhones or Androids"? There's no data to back either claim. Stick to the facts.
There is no "solution", however, the first step is to enhance "detection". Various solutions have been proposed, all increase cost, computation or have some disadvantage - however, one of the more promising detection mechanisms is an end to end checksum, which will then have the effect of augmenting ALL the steps in the chain, not just one specific step. As of today, only a few filesystems offer this, roll on more adopting it
So after all the steam blowing, we're back to where we started. I disagree with your assertion that this is an issue that we--meaning "average" users not possessing 1PB storage systems or massive clusters--need to be worried about TODAY. I gave sound reasons as to why this is the case. You've given us 'a couple of academic papers say the sky is falling so I'm repeating the sky is falling'. Without apparently truly understanding the issue.
The data available and the experience of the vast majority of IT folks backs my position--which is why that's my position. There is little to no data supporting your position.
I say this isn't going to be an issue for average users, if at all, for a few years to come. You say it's here now. That's a fairly minor point of disagreement to cause such a heated (on your part) lengthy exchange.
BTW, if you see anything I've stated as rude you've apparently not been on the Interwebs long. ;)
-- Stan