[Dovecot] RAID1+md concat+XFS as mailstorage

Fri Jun 29 19:07:56 EEST 2012

On 29/06/2012 12:15, Charles Marcus wrote:
> On 2012-06-28 4:35 PM, Ed W <lists at wildgooses.com> wrote:
>> On 28/06/2012 17:54, Charles Marcus wrote:
>>> RAID10 also statistically has a much better chance of surviving a
>>> multi drive failure than RAID5 or 6, because it will only die if two
>>> drives in the same pair fail, and only then if the second one fails
>>> before the hot spare is rebuilt.
>
>> Actually this turns out to be incorrect... Curious, but there you go!
>
> Depends on what you mean exactly by 'incorrect'...

I'm sorry, this wasn't meant to be an attack on you, I thought I was 
pointing out what is now fairly obvious stuff, but it's only recently 
that the maths has been popularised by the common blogs on the 
interwebs.  Whilst I guess not everyone read the flurry of blog articles 
about this last year, I think it's due to be repeated in increasing 
frequency as we go forward:

The most recent article which prompted all of the above is I think this one:
     http://queue.acm.org/detail.cfm?id=1670144
More here (BARF = Battle Against Raid 5/4)
     http://www.miracleas.com/BAARF/

There are some badly phrased ZDnet articles also if you google "raid 5 
stops working in 2009"

Intel have a whitepaper which says:

    Intelligent RAID 6 Theory Overview And Implementation

    RAID 5 systems are commonly deployed for data protection in most
    business environments. However, RAID 5 systems only tolerate a
    single drive failure, and the probability of encountering latent
    defects [i.e. UREs, among other problems] of drives approaches 100
    percent as disk capacity and array width increase.

The upshot is that:
- Drives often fail slowly rather than bang/dead
- You will only scrub the array on a frequency F, which means that 
faults can develop since the last scrub (good on you if you actually 
remembered to set an automatic regular scrub...)
- Once you decide to pull a disk for some reason to replace it, then 
with RAID1/5 (raid1 is a kind of degenerate form of raid5) you are 
exposed in that if a *second* error is detected during the rebuild then 
you are inconsistent and have no way to correctly rebuild your entire array
- My experience is that linux-raid will stop the rebuild if a second 
error is detected during rebuild, but with some understanding it's 
possible to proceed (obviously understanding that data loss has 
therefore occurred).  However, some hardware controllers will kick out 
the whole array if a rebuild error is discovered- some will not, but 
given the probability of a second error being discovered during rebuild 
is significantly non zero, it's worth worrying over this and figuring 
out what you do if it happens...

> I'm fairly sure that you do not mean that my comment that 'having a 
> hot spare is good' is incorrect,

Well, hotspare seems like a good idea, but the point is that the 
situation will be that you have lost parity protection.  At that point 
you effectively run a disk scrub to rebuild the array.  The probability 
of discovering a second error somewhere on your remaining array is non 
zero and hence your array has lost data.  So it's not about how quickly 
you get the spare in, so much as the significant probability that you 
have two drives with errors, but only one drive of protection

Raid6 increases this protection *quite substantially*, because if a 
second error is found on a stripe, then you still haven't lost data.  
However, a *third* error on a single stripe will lose data.

The bad news: Estimates suggest that drive sizes will become large 
enough that RAID6 is insufficient to give a reasonable probability of 
successful repair of a single failed disk in around 7+ years time.  So 
at that point there becomes a significant probability that the single 
failed disk cannot be successfully replaced in a RAID6 array because of 
the high probability of *two* additional defects becoming discovered on 
the same stripe of the remaining array.  Therefore many folks are 
requesting 3 disk parity to be implemented (RAID7?)

> 'Sometimes'... '...under some circumstances...' - hey, it's all a 
> crapshoot anyway, all you can do is try to make sure the dice aren't 
> loaded against you.

And to be clear - RAID5/RAID1 has a very significant probability that 
once your first disk has failed, in the process of replacing that disk 
you will discover an unrecoverable error on your remaining drive and 
hence you have lost some data...

> Also, modern enterprise SAS drives and RAID controllers do have 
> hardware based algorithms to protect data integrity (much better than 
> consumer grade drives at least).

I can't categorically disagree, but I should check carefully your 
claims?  My understanding is that there is minimal additional protection 
from "enterprise" stuff, and by that I'm thinking of quality gear that I 
can buy from the likes of newegg/ebuyer, not the custom SAN products 
from certain big name providers.  It seems possible that the big name 
SAN providers implement additional protection, but at that point we are 
talking custom hardware and it's hard to analyse (or even get the full 
details)

My limited understanding is that "enterprise" quality buys you only:
- almost identical drives, but with a longer warranty and tighter 
quality control. We might hope for internal changes that improve 
longevity, but there is only minimal evidence of this
- drives have certain firmware features which can be advantage, eg TLER 
type features
- drives have (more) bad block reallocation sectors available, hence you 
won't get bad block warnings as quickly (which could be good or bad...)
- controllers might have ECC ram in the cache ram

However, whilst we might desire features which reduce the probability of 
failed block reads/writes, in practice I'm not aware that the common LSI 
controllers (et al) offer this and so in practice I don't think you get 
any useful additional protection from "enterprise" stuff?

For example remember a few years back the google survey of drives from 
their data centers (and several others) where they observed that 
enterprise drives showed no real difference in failure characteristics 
from non enterprise drives.  Also that SMART was a fairly poor predictor 
of failing drives...

>> So the vulnerability is not the first failed disk, but discovering
>> subsequent problems during the rebuild.
>
> True, but this applies to every RAID mode (RAID6 included). 

No, see RAID6 has a dramatically lower chance of this happening than 
RAID1/5.  See this is the real insight and I think it's important that 
this fairly (obvious in retrospect) idea becomes widely known and 
understood to those who manage arrays.

RAID6 needs a failed drive and *two* subsequent errors *per stripe* to 
lose data.  RAID5/1 simply need one subsequent error *per array* to lose 
data.  Quite a large difference!

> Also, one big disadvantage of RAID5/6 is the rebuild times (sometimes 
> can take many hours, or even days depending on drive sizes) - it is 
> the stress of the rebuild that often causes a second drive failure, 
> thereby killing your RAID, and RAID10 rebuilds happen *much* faster 
> that RAID5/6 rebuilds (and are less stressful), so there is much less 
> chance of losing another disk during a rebuild.

Hmm, at least theoretically both need a full linear read of the other 
disks. The time for an idle array should be similar in both cases.  
Agree though that for an active array the raid5/6 generally causes more 
drives to read/write, hence yes, the impact is probably greater.

However, don't miss the big picture, your risk is a second error 
occurring anywhere on the array with raid1/5, but with raid 6 your risk 
is *two* errors per stripe, ie you can fail a whole second drive and 
still continue rebuilding with raid6

>> This certainly correlates with my (admittedly limited) experiences.
>> Disk array scrubbing on a regular basis seems like a mandatory
>> requirement (but how many people do..?) to have any chance of
>> actually repairing a failing raid1/5 array
>
> Regular scrubbing is something I will give some thought to, but again, 
> your remarks are not 100% accurate... RAID is not quite so fragile as 
> you make it out to be.

We humans are all far too shaped by our own limited experiences. I'm the 
same.

I personally feel that raid arrays *are* very fragile.  Backups are 
often the option when you get multi-drive failures (even if 
theoretically the array is repairable).  However, it's about the best 
option we have right now, so all we can do is be aware of the limitations...

Additionally I have very much suffered this situation of a failing RAID5 
which was somehow hanging together with just the odd uncorrectable read 
error reported here and there (once a month say).  I copied off all the 
data and then as an experiment replaced one disk in this otherwise 
working array, which then triggered a cascade of discovered errors all 
over the disk and rebuilding was basically impossible.  I was expecting 
it to fail of course and had proactively copied off the data, but my 
point was at that point all I had were hints of failure and the odd UCE 
report.  Presumably my data was being quietly corrupted in the 
background though, and the recovered data (low value) is likely peppered 
with read errors...  Scary if it had been high value data...

Remember, remember: Raid5/6/1 does NOT do parity checking on read... 
Only fancy filesystems like ZFS and perhaps btrfs do an end to end check 
which can spot a read error...  If your write fails or a disk error 
corrupts a sector, then you will NOT find out about it until you scrub 
your array...  Reading the corrupted sector will read the error and when 
you rewrite you will correct the parity and the original error will then 
be undetectable...  Same effect actually if you just rewrite any block 
in the stripe containing a corrupted block, the parity gets updated to 
imply the corrupted block isn't corrupted anymore, now it's undetectable 
to a scrub...

Roll on btrfs I say...

Cheers

Ed W