On 2012-06-29 12:07 PM, Ed W lists@wildgooses.com wrote:
On 29/06/2012 12:15, Charles Marcus wrote:
Depends on what you mean exactly by 'incorrect'...
I'm sorry, this wasn't meant to be an attack on you,
No worries - it wasn't taken that way - I simply disagreed with the main point you were making, and still do. While I do agree there is some truth to the issue you have raised, I just don't see it as quite the disaster-in-waiting that you do. I have been running small RAID setups for quite a while, and while I had one older RAID5 (with NO hot spare) that I inherited (this was many years ago) that gave me fits for about a month once (had drives randomly 'failing', but a rebuild - which took a few HOURS, and this was with small (by today's standards - 120GB drives) would fix it, then another one would do drop out 2 or 3 days later, etc. I finally found an identical replacement controller on ebay (old 3ware card) and once it was replaced it fixed the problem). I also had one instance in a RAID10 setup I configured myself a few years ago where one of the pairs had some errors on an unclean shutdown (this was after about 3 years of 24/7 operation on a mail server) and went into automatic rebuild, which went smoothly (and was mucho faster than the RAID5 rebuilds were even though the drives were much bigger).
So, yes, while I acknowledge the risk, it is the risk we all run storing data on hard drives.
I thought I was pointing out what is now fairly obvious stuff, but it's only recently that the maths has been popularised by the common blogs on the interwebs. Whilst I guess not everyone read the flurry of blog articles about this last year, I think it's due to be repeated in increasing frequency as we go forward:
The most recent article which prompted all of the above is I think this one: http://queue.acm.org/detail.cfm?id=1670144 More here (BARF = Battle Against Raid 5/4) http://www.miracleas.com/BAARF/
I'll find time to read these over the next week or two, thanks...
Intel have a whitepaper which says:
Intelligent RAID 6 Theory Overview And Implementation
RAID 5 systems are commonly deployed for data protection in most business environments.
While maybe true many years ago, I don't think this is true today. I wouldn't touch RAID5 with a ten foot pole, but yes, maybe there are still people who use it for some reason - and maybe there are some corner cases where it is even desirable?
However, RAID 5 systems only tolerate a single drive failure, and the probability of encountering latent defects [i.e. UREs, among other problems] of drives approaches 100 percent as disk capacity and array width increase.
Well, this is definitely true, but I wouldn't touch RAID5 today.
And to be clear - RAID5/RAID1 has a very significant probability that once your first disk has failed, in the process of replacing that disk you will discover an unrecoverable error on your remaining drive and hence you have lost some data...
Well, this is true, but the part of your comment that I was responding to and challenging was that the entire RAID just 'died' and you lost ALL of your data.
That is simply not true on modern systems.
So the vulnerability is not the first failed disk, but discovering subsequent problems during the rebuild.
True, but this applies to every RAID mode (RAID6 included).
No, see RAID6 has a dramatically lower chance of this happening than RAID1/5. See this is the real insight and I think it's important that this fairly (obvious in retrospect) idea becomes widely known and understood to those who manage arrays.
RAID6 needs a failed drive and *two* subsequent errors *per stripe* to lose data. RAID5/1 simply need one subsequent error *per array* to lose data. Quite a large difference!
Interesting... I'll look at this more closely then, thanks.
Also, one big disadvantage of RAID5/6 is the rebuild times
Hmm, at least theoretically both need a full linear read of the other disks. The time for an idle array should be similar in both cases. Agree though that for an active array the raid5/6 generally causes more drives to read/write, hence yes, the impact is probably greater.
No 'probably' to it. It is definitely greater, even comparing the smallest possible RAID setups (4 drives are minimum for each). But, as the size of (number of disks in) the array increases, the difference increases dramatically. With RAID10, when a drive fails and a rebuild occurs, only ONE drive must be read (remirrored) - in a RAID5/6, most if not *all* of the drives must be read from (depends on how it is configured I guess).
However, don't miss the big picture, your risk is a second error occurring anywhere on the array with raid1/5, but with raid 6 your risk is *two* errors per stripe, ie you can fail a whole second drive and still continue rebuilding with raid6
And is the same with a RAID10, as long as the second drive failure isn't the one currently being remirrored.
I think you have proven your case that a RAID6 is statistically a little less likely to suffer a catastrophic cascading disk failure scenario than RAID10.
I personally feel that raid arrays *are* very fragile. Backups are often the option when you get multi-drive failures (even if theoretically the array is repairable). However, it's about the best option we have right now, so all we can do is be aware of the limitations...
And since backups are stored on drives (well, mine are, I stopped using tape long ago), they have the same associated risks... but of course I agree with you that they are absolutely essential.
Additionally I have very much suffered this situation of a failing RAID5 which was somehow hanging together with just the odd uncorrectable read error reported here and there (once a month say). I copied off all the data and then as an experiment replaced one disk in this otherwise working array, which then triggered a cascade of discovered errors all over the disk and rebuilding was basically impossible.
Sounds like you had a bad controller to me... and yes, when a controller goes bad, lots of weirdness and 'very bad things' can occur.
Roll on btrfs I say...
+1000 ;)
--
Best regards,
Charles