[Dovecot] RAID1+md concat+XFS as mailstorage

Костырев Александр Алексеевич

28 Jun 2012 28 Jun '12

3:01 p.m.

Hello!

somewhere in maillist I've seen RAID1+md concat+XFS being promoted as mailstorage. Does anybody in here actually use this setup?

I've decided to give it a try, but ended up with not being able to recover any data off survived pairs from linear array when _the_first of raid1 pairs got down.

thanks!

Show replies by date

Ed W

28 Jun 28 Jun

3:15 p.m.

On 28/06/2012 13:01, Костырев Александр Алексеевич wrote:

...

Hello!

somewhere in maillist I've seen RAID1+md concat+XFS being promoted as mailstorage. Does anybody in here actually use this setup?

I've decided to give it a try, but ended up with not being able to recover any data off survived pairs from linear array when _the_first of raid1 pairs got down.

This is the configuration endorsed by Stan Hoeppner. His description of the benefits is quite compelling, but real world feedback is interesting to achieve.

Note that you wouldn't get anything back from a similar fail of a RAID10 array either (unless we are talking temporary removal and re-insertion?)

Ed W

Wojciech Puchar

3:22 p.m.

...

Note that you wouldn't get anything back from a similar fail of a RAID10 array either (unless we are talking temporary removal and re-insertion?)

use multiple RAID1 arrays, 2 drives each, one filesystem each.

Костырев Александр Алексеевич

3:32 p.m.

...

Note that you wouldn't get anything back from a similar fail of a RAID10 array either I wasn't aware of it, that's interesting.

...

(unless we are talking temporary removal and re-insertion?) nope, I'm talking about complete pair's crash when two disks die. I do understand that's the possibility of such outcome (when two disks in the same pair crash) is not high, but when we have 12 or 24 disks in storage...

-----Original Message----- From: dovecot-bounces@dovecot.org [mailto:dovecot-bounces@dovecot.org] On Behalf Of Ed W Sent: Thursday, June 28, 2012 11:15 PM To: dovecot@dovecot.org Subject: Re: [Dovecot] RAID1+md concat+XFS as mailstorage

On 28/06/2012 13:01, Костырев Александр Алексеевич wrote:

...

Hello!

somewhere in maillist I've seen RAID1+md concat+XFS being promoted as mailstorage. Does anybody in here actually use this setup?

I've decided to give it a try, but ended up with not being able to recover any data off survived pairs from linear array when _the_first of raid1 pairs got down.

This is the configuration endorsed by Stan Hoeppner. His description of the benefits is quite compelling, but real world feedback is interesting to achieve.

Note that you wouldn't get anything back from a similar fail of a RAID10 array either (unless we are talking temporary removal and re-insertion?)

Ed W

Wojciech Puchar

3:46 p.m.

...

...
(unless we are talking temporary removal and re-insertion?) nope, I'm talking about complete pair's crash when two disks die. I do understand that's the possibility of such outcome (when two disks in the same pair crash) is not high, but when we have 12 or 24 disks in storage...

then may 6-12 filesystems. overall probability of double disk failure is same, but you will loose 1/6-1/12 of data.

...

Ed W

3:56 p.m.

On 28/06/2012 13:46, Wojciech Puchar wrote:

...

...
...
(unless we are talking temporary removal and re-insertion?) nope, I'm talking about complete pair's crash when two disks die. I do understand that's the possibility of such outcome (when two disks in the same pair crash) is not high, but when we have 12 or 24 disks in storage...

then may 6-12 filesystems. overall probability of double disk failure is same, but you will loose 1/6-1/12 of data.

But the compromise is that you gain the complexity of maintaining more filesystems and needing to figure out how to split your data across multiple filesystems

The options today however seem to be only:

RAID6 (suffers slow write speeds, especially for smaller files)
RAID1 pairs with striping (raid0) over the top. (doesn't achieve max speeds for small files. 2 disk failures a problem. No protection against "silent corruption" of 1 disk)
RAID1 pairs, plus some kind of intelligent overlay filesystem, eg md-linear+XFS / BTRFS. With the filesystem aware of the underlying arrangement it can theoretically optimise file placement and dramatically increase write speeds for small files in the same manner that RAID-0 theoretically achieves. (However, still no protection against "silent" single drive corruption unless btrfs perhaps adds this in the future?)

So given the statistics show us that 2 disk failures are much more common than we expect, and that "silent corruption" is likely occurring within (larger) real world file stores, there really aren't many battle tested options that can protect against this - really only RAID6 right now and that has significant limitations...

RAID1+XFS sounds very interesting. Curious to hear some failure testing on this now. Also I'm watching btrfs with a 12 month+ view

Cheers

Ed W

Костырев Александр Алексеевич

4:06 p.m.

...

RAID1 pairs, plus some kind of intelligent overlay filesystem, eg md-linear+XFS / BTRFS. With the filesystem aware of the underlying arrangement it can theoretically optimise file placement and dramatically increase write speeds for small files in the same manner that RAID-0 theoretically achieves. (However, still no protection against "silent" single drive corruption unless btrfs perhaps adds this in the future?)

not only "silent" single drive corruption problem but as I stated in start of topic - crash of first pair.

Ed W

7:20 p.m.

On 28/06/2012 14:06, Костырев Александр Алексеевич wrote:

...

...

RAID1 pairs, plus some kind of intelligent overlay filesystem, eg md-linear+XFS / BTRFS. With the filesystem aware of the underlying arrangement it can theoretically optimise file placement and dramatically increase write speeds for small files in the same manner that RAID-0 theoretically achieves. (However, still no protection against "silent" single drive corruption unless btrfs perhaps adds this in the future?) not only "silent" single drive corruption problem but as I stated in start of topic - crash of first pair.

Bad things are going to happen if you loose a complete chunk of your filesystem. I think the current state of the world is that you should assume that realistically you will be looking to your backups if you loose the wrong 2 disks in a raid1 or raid10 array.

However, the thing which worries me more with multidisk arrays is accidental disconnection of multiple disks, eg backplane fails, or a multi-lane connector is accidently unplugged. Linux MD raid often seems to have the ability to reconstruct arrays after such accidents. I don't have more recent experience with hardware controller arrays, but I have (sadly) found that such a situation is terminal on some older hardware controllers...

Interested to hear other failure modes (and successful rescues) from RAID1+linear+XFS setups?

Cheers

Ed W

Charles Marcus

7:54 p.m.

On 2012-06-28 12:20 PM, Ed W <lists@wildgooses.com> wrote:

...

Bad things are going to happen if you loose a complete chunk of your filesystem. I think the current state of the world is that you should assume that realistically you will be looking to your backups if you loose the wrong 2 disks in a raid1 or raid10 array.

Which is a very good reason to have at least one hot spare in any RAID setup, if not 2.

RAID10 also statistically has a much better chance of surviving a multi drive failure than RAID5 or 6, because it will only die if two drives in the same pair fail, and only then if the second one fails before the hot spare is rebuilt.

Best regards,

Charles

Ed W

11:35 p.m.

On 28/06/2012 17:54, Charles Marcus wrote:

...

On 2012-06-28 12:20 PM, Ed W <lists@wildgooses.com> wrote:

...
Bad things are going to happen if you loose a complete chunk of your filesystem. I think the current state of the world is that you should assume that realistically you will be looking to your backups if you loose the wrong 2 disks in a raid1 or raid10 array.

Which is a very good reason to have at least one hot spare in any RAID setup, if not 2.

RAID10 also statistically has a much better chance of surviving a multi drive failure than RAID5 or 6, because it will only die if two drives in the same pair fail, and only then if the second one fails before the hot spare is rebuilt.

Actually this turns out to be incorrect... Curious, but there you go!

Search google for a recent very helpful expose on this. Basically RAID10 can sometimes tolerate multi-drive failure, but on average raid6 appears less likely to trash your data, plus under some circumstances it better survives recovering from a single failed disk in practice

The executive summary is something like: when raid5 fails, because at that point you effectively do a raid "scrub" you tend to suddenly notice a bunch of other hidden problems which were lurking and your rebuild fails (this happened to me...). RAID1 has no better bad block detection than assuming the non bad disk is perfect (so won't spot latent unscrubbed errors), and again if you hit a bad block during the rebuild you loose the whole of your mirrored pair.

So the vulnerability is not the first failed disk, but discovering subsequent problems during the rebuild. This certainly correlates with my (admittedly limited) experiences. Disk array scrubbing on a regular basis seems like a mandatory requirement (but how many people do..?) to have any chance of actually repairing a failing raid1/5 array

Digressing, but it occurs there would be a potentially large performance improvement if spinning disks could do a read/rewrite cycle with the disk only moving a minimal distance (my understanding is this can't happen at present without a full revolution of the disk). Then you could rewrite parity blocks extremely quickly without re-reading a full stripe...

Anyway, challenging problem and basically the observation is that large disk arrays are going to have a moderate tail risk of failure whether you use raid10 or raid5 (raid6 giving a decent practical improvement in real reliability, but at a cost in write performance).

Cheers

Ed W

Wojciech Puchar

29 Jun 29 Jun

9:18 a.m.

...

The executive summary is something like: when raid5 fails, because at that point you effectively do a raid "scrub" you tend to suddenly notice a bunch of other hidden problems which were lurking and your rebuild fails (this

and no raid will protect you from every failure. You have to do backups. EOT

Charles Marcus

2:15 p.m.

On 2012-06-28 4:35 PM, Ed W <lists@wildgooses.com> wrote:

...

On 28/06/2012 17:54, Charles Marcus wrote:

...
RAID10 also statistically has a much better chance of surviving a multi drive failure than RAID5 or 6, because it will only die if two drives in the same pair fail, and only then if the second one fails before the hot spare is rebuilt.

...

Actually this turns out to be incorrect... Curious, but there you go!

Depends on what you mean exactly by 'incorrect'...

I'm fairly sure that you do not mean that my comment that 'having a hot spare is good' is incorrect, so that leaves my last comment above...

I'm far from expert (Stan? Where are you? Am looking forward to your comments here), but...

...

Search google for a recent very helpful expose on this. Basically RAID10 can sometimes tolerate multi-drive failure, but on average raid6 appears less likely to trash your data, plus under some circumstances it better survives recovering from a single failed disk in practice

'Sometimes'... '...under some circumstances...' - hey, it's all a crapshoot anyway, all you can do is try to make sure the dice aren't loaded against you.

...

The executive summary is something like: when raid5 fails, because at that point you effectively do a raid "scrub" you tend to suddenly notice a bunch of other hidden problems which were lurking and your rebuild fails (this happened to me...). RAID1 has no better bad block detection than assuming the non bad disk is perfect (so won't spot latent unscrubbed errors), and again if you hit a bad block during the rebuild you loose the whole of your mirrored pair.

Not true (at least not for real hardware based RAID controllers that I have ever worked with)... yes, it may revert to degraded mode, but you don't just 'lose' the RAID if the rebuild fails.

You can then run filesystem check tools on the system, hopefully find/fix the bad sectors, then rebuild the array - I have had to do/done this before myself, so I know that this is possible.

Also, modern enterprise SAS drives and RAID controllers do have hardware based algorithms to protect data integrity (much better than consumer grade drives at least).

...

So the vulnerability is not the first failed disk, but discovering subsequent problems during the rebuild.

True, but this applies to every RAID mode (RAID6 included). Also, one big disadvantage of RAID5/6 is the rebuild times (sometimes can take many hours, or even days depending on drive sizes) - it is the stress of the rebuild that often causes a second drive failure, thereby killing your RAID, and RAID10 rebuilds happen *much* faster that RAID5/6 rebuilds (and are less stressful), so there is much less chance of losing another disk during a rebuild.

...

This certainly correlates with my (admittedly limited) experiences. Disk array scrubbing on a regular basis seems like a mandatory requirement (but how many people do..?) to have any chance of actually repairing a failing raid1/5 array

Regular scrubbing is something I will give some thought to, but again, your remarks are not 100% accurate... RAID is not quite so fragile as you make it out to be.

Best regards,

Charles

Ed W

7:07 p.m.

On 29/06/2012 12:15, Charles Marcus wrote:

...

On 2012-06-28 4:35 PM, Ed W <lists@wildgooses.com> wrote:

...
On 28/06/2012 17:54, Charles Marcus wrote:

...
RAID10 also statistically has a much better chance of surviving a multi drive failure than RAID5 or 6, because it will only die if two drives in the same pair fail, and only then if the second one fails before the hot spare is rebuilt.

...
Actually this turns out to be incorrect... Curious, but there you go!

Depends on what you mean exactly by 'incorrect'...

I'm sorry, this wasn't meant to be an attack on you, I thought I was pointing out what is now fairly obvious stuff, but it's only recently that the maths has been popularised by the common blogs on the interwebs. Whilst I guess not everyone read the flurry of blog articles about this last year, I think it's due to be repeated in increasing frequency as we go forward:

The most recent article which prompted all of the above is I think this one: http://queue.acm.org/detail.cfm?id=1670144 More here (BARF = Battle Against Raid 5/4) http://www.miracleas.com/BAARF/

There are some badly phrased ZDnet articles also if you google "raid 5 stops working in 2009"

Intel have a whitepaper which says:

Intelligent RAID 6 Theory Overview And Implementation

RAID 5 systems are commonly deployed for data protection in most
business environments. However, RAID 5 systems only tolerate a
single drive failure, and the probability of encountering latent
defects [i.e. UREs, among other problems] of drives approaches 100
percent as disk capacity and array width increase.

The upshot is that:

Drives often fail slowly rather than bang/dead
You will only scrub the array on a frequency F, which means that faults can develop since the last scrub (good on you if you actually remembered to set an automatic regular scrub...)
Once you decide to pull a disk for some reason to replace it, then with RAID1/5 (raid1 is a kind of degenerate form of raid5) you are exposed in that if a *second* error is detected during the rebuild then you are inconsistent and have no way to correctly rebuild your entire array
My experience is that linux-raid will stop the rebuild if a second error is detected during rebuild, but with some understanding it's possible to proceed (obviously understanding that data loss has therefore occurred). However, some hardware controllers will kick out the whole array if a rebuild error is discovered- some will not, but given the probability of a second error being discovered during rebuild is significantly non zero, it's worth worrying over this and figuring out what you do if it happens...

...

I'm fairly sure that you do not mean that my comment that 'having a hot spare is good' is incorrect,

Well, hotspare seems like a good idea, but the point is that the situation will be that you have lost parity protection. At that point you effectively run a disk scrub to rebuild the array. The probability of discovering a second error somewhere on your remaining array is non zero and hence your array has lost data. So it's not about how quickly you get the spare in, so much as the significant probability that you have two drives with errors, but only one drive of protection

Raid6 increases this protection *quite substantially*, because if a second error is found on a stripe, then you still haven't lost data.
However, a *third* error on a single stripe will lose data.

The bad news: Estimates suggest that drive sizes will become large enough that RAID6 is insufficient to give a reasonable probability of successful repair of a single failed disk in around 7+ years time. So at that point there becomes a significant probability that the single failed disk cannot be successfully replaced in a RAID6 array because of the high probability of *two* additional defects becoming discovered on the same stripe of the remaining array. Therefore many folks are requesting 3 disk parity to be implemented (RAID7?)

...

'Sometimes'... '...under some circumstances...' - hey, it's all a crapshoot anyway, all you can do is try to make sure the dice aren't loaded against you.

And to be clear - RAID5/RAID1 has a very significant probability that once your first disk has failed, in the process of replacing that disk you will discover an unrecoverable error on your remaining drive and hence you have lost some data...

...

Also, modern enterprise SAS drives and RAID controllers do have hardware based algorithms to protect data integrity (much better than consumer grade drives at least).

I can't categorically disagree, but I should check carefully your claims? My understanding is that there is minimal additional protection from "enterprise" stuff, and by that I'm thinking of quality gear that I can buy from the likes of newegg/ebuyer, not the custom SAN products from certain big name providers. It seems possible that the big name SAN providers implement additional protection, but at that point we are talking custom hardware and it's hard to analyse (or even get the full details)

My limited understanding is that "enterprise" quality buys you only:

almost identical drives, but with a longer warranty and tighter quality control. We might hope for internal changes that improve longevity, but there is only minimal evidence of this
drives have certain firmware features which can be advantage, eg TLER type features
drives have (more) bad block reallocation sectors available, hence you won't get bad block warnings as quickly (which could be good or bad...)
controllers might have ECC ram in the cache ram

However, whilst we might desire features which reduce the probability of failed block reads/writes, in practice I'm not aware that the common LSI controllers (et al) offer this and so in practice I don't think you get any useful additional protection from "enterprise" stuff?

For example remember a few years back the google survey of drives from their data centers (and several others) where they observed that enterprise drives showed no real difference in failure characteristics from non enterprise drives. Also that SMART was a fairly poor predictor of failing drives...

...

...
So the vulnerability is not the first failed disk, but discovering subsequent problems during the rebuild.

True, but this applies to every RAID mode (RAID6 included).

No, see RAID6 has a dramatically lower chance of this happening than RAID1/5. See this is the real insight and I think it's important that this fairly (obvious in retrospect) idea becomes widely known and understood to those who manage arrays.

RAID6 needs a failed drive and *two* subsequent errors *per stripe* to lose data. RAID5/1 simply need one subsequent error *per array* to lose data. Quite a large difference!

...

Also, one big disadvantage of RAID5/6 is the rebuild times (sometimes can take many hours, or even days depending on drive sizes) - it is the stress of the rebuild that often causes a second drive failure, thereby killing your RAID, and RAID10 rebuilds happen *much* faster that RAID5/6 rebuilds (and are less stressful), so there is much less chance of losing another disk during a rebuild.

Hmm, at least theoretically both need a full linear read of the other disks. The time for an idle array should be similar in both cases.
Agree though that for an active array the raid5/6 generally causes more drives to read/write, hence yes, the impact is probably greater.

However, don't miss the big picture, your risk is a second error occurring anywhere on the array with raid1/5, but with raid 6 your risk is *two* errors per stripe, ie you can fail a whole second drive and still continue rebuilding with raid6

...

...
This certainly correlates with my (admittedly limited) experiences. Disk array scrubbing on a regular basis seems like a mandatory requirement (but how many people do..?) to have any chance of actually repairing a failing raid1/5 array

Regular scrubbing is something I will give some thought to, but again, your remarks are not 100% accurate... RAID is not quite so fragile as you make it out to be.

We humans are all far too shaped by our own limited experiences. I'm the same.

I personally feel that raid arrays *are* very fragile. Backups are often the option when you get multi-drive failures (even if theoretically the array is repairable). However, it's about the best option we have right now, so all we can do is be aware of the limitations...

Additionally I have very much suffered this situation of a failing RAID5 which was somehow hanging together with just the odd uncorrectable read error reported here and there (once a month say). I copied off all the data and then as an experiment replaced one disk in this otherwise working array, which then triggered a cascade of discovered errors all over the disk and rebuilding was basically impossible. I was expecting it to fail of course and had proactively copied off the data, but my point was at that point all I had were hints of failure and the odd UCE report. Presumably my data was being quietly corrupted in the background though, and the recovered data (low value) is likely peppered with read errors... Scary if it had been high value data...

Remember, remember: Raid5/6/1 does NOT do parity checking on read... Only fancy filesystems like ZFS and perhaps btrfs do an end to end check which can spot a read error... If your write fails or a disk error corrupts a sector, then you will NOT find out about it until you scrub your array... Reading the corrupted sector will read the error and when you rewrite you will correct the parity and the original error will then be undetectable... Same effect actually if you just rewrite any block in the stripe containing a corrupted block, the parity gets updated to imply the corrupted block isn't corrupted anymore, now it's undetectable to a scrub...

Roll on btrfs I say...

Cheers

Ed W

Charles Marcus

1 Jul 1 Jul

1:34 p.m.

On 2012-06-29 12:07 PM, Ed W <lists@wildgooses.com> wrote:

...

On 29/06/2012 12:15, Charles Marcus wrote:

...
Depends on what you mean exactly by 'incorrect'...

...

I'm sorry, this wasn't meant to be an attack on you,

No worries - it wasn't taken that way - I simply disagreed with the main point you were making, and still do. While I do agree there is some truth to the issue you have raised, I just don't see it as quite the disaster-in-waiting that you do. I have been running small RAID setups for quite a while, and while I had one older RAID5 (with NO hot spare) that I inherited (this was many years ago) that gave me fits for about a month once (had drives randomly 'failing', but a rebuild - which took a few HOURS, and this was with small (by today's standards - 120GB drives) would fix it, then another one would do drop out 2 or 3 days later, etc. I finally found an identical replacement controller on ebay (old 3ware card) and once it was replaced it fixed the problem). I also had one instance in a RAID10 setup I configured myself a few years ago where one of the pairs had some errors on an unclean shutdown (this was after about 3 years of 24/7 operation on a mail server) and went into automatic rebuild, which went smoothly (and was mucho faster than the RAID5 rebuilds were even though the drives were much bigger).

So, yes, while I acknowledge the risk, it is the risk we all run storing data on hard drives.

...

I thought I was pointing out what is now fairly obvious stuff, but it's only recently that the maths has been popularised by the common blogs on the interwebs. Whilst I guess not everyone read the flurry of blog articles about this last year, I think it's due to be repeated in increasing frequency as we go forward:

The most recent article which prompted all of the above is I think this one: http://queue.acm.org/detail.cfm?id=1670144 More here (BARF = Battle Against Raid 5/4) http://www.miracleas.com/BAARF/

I'll find time to read these over the next week or two, thanks...

...

Intel have a whitepaper which says:

Intelligent RAID 6 Theory Overview And Implementation

RAID 5 systems are commonly deployed for data protection in most business environments.

While maybe true many years ago, I don't think this is true today. I wouldn't touch RAID5 with a ten foot pole, but yes, maybe there are still people who use it for some reason - and maybe there are some corner cases where it is even desirable?

...

However, RAID 5 systems only tolerate a single drive failure, and the probability of encountering latent defects [i.e. UREs, among other problems] of drives approaches 100 percent as disk capacity and array width increase.

Well, this is definitely true, but I wouldn't touch RAID5 today.

...

And to be clear - RAID5/RAID1 has a very significant probability that once your first disk has failed, in the process of replacing that disk you will discover an unrecoverable error on your remaining drive and hence you have lost some data...

Well, this is true, but the part of your comment that I was responding to and challenging was that the entire RAID just 'died' and you lost ALL of your data.

That is simply not true on modern systems.

...

...
...
So the vulnerability is not the first failed disk, but discovering subsequent problems during the rebuild.

...

...
True, but this applies to every RAID mode (RAID6 included).

...

No, see RAID6 has a dramatically lower chance of this happening than RAID1/5. See this is the real insight and I think it's important that this fairly (obvious in retrospect) idea becomes widely known and understood to those who manage arrays.

...

RAID6 needs a failed drive and *two* subsequent errors *per stripe* to lose data. RAID5/1 simply need one subsequent error *per array* to lose data. Quite a large difference!

Interesting... I'll look at this more closely then, thanks.

...

...
Also, one big disadvantage of RAID5/6 is the rebuild times

...

Hmm, at least theoretically both need a full linear read of the other disks. The time for an idle array should be similar in both cases. Agree though that for an active array the raid5/6 generally causes more drives to read/write, hence yes, the impact is probably greater.

No 'probably' to it. It is definitely greater, even comparing the smallest possible RAID setups (4 drives are minimum for each). But, as the size of (number of disks in) the array increases, the difference increases dramatically. With RAID10, when a drive fails and a rebuild occurs, only ONE drive must be read (remirrored) - in a RAID5/6, most if not *all* of the drives must be read from (depends on how it is configured I guess).

...

However, don't miss the big picture, your risk is a second error occurring anywhere on the array with raid1/5, but with raid 6 your risk is *two* errors per stripe, ie you can fail a whole second drive and still continue rebuilding with raid6

And is the same with a RAID10, as long as the second drive failure isn't the one currently being remirrored.

I think you have proven your case that a RAID6 is statistically a little less likely to suffer a catastrophic cascading disk failure scenario than RAID10.

...

I personally feel that raid arrays *are* very fragile. Backups are often the option when you get multi-drive failures (even if theoretically the array is repairable). However, it's about the best option we have right now, so all we can do is be aware of the limitations...

And since backups are stored on drives (well, mine are, I stopped using tape long ago), they have the same associated risks... but of course I agree with you that they are absolutely essential.

...

Additionally I have very much suffered this situation of a failing RAID5 which was somehow hanging together with just the odd uncorrectable read error reported here and there (once a month say). I copied off all the data and then as an experiment replaced one disk in this otherwise working array, which then triggered a cascade of discovered errors all over the disk and rebuilding was basically impossible.

Sounds like you had a bad controller to me... and yes, when a controller goes bad, lots of weirdness and 'very bad things' can occur.

...

Roll on btrfs I say...

+1000 ;)

Best regards,

Charles

Kelsey Cummings

29 Jun 29 Jun

1:45 a.m.

On 06/28/12 05:56, Ed W wrote:

...

So given the statistics show us that 2 disk failures are much more common than we expect, and that "silent corruption" is likely occurring within (larger) real world file stores, there really aren't many battle tested options that can protect against this - really only RAID6 right now and that has significant limitations...

Has anyone tried or benchmarked ZFS, perhaps ZFS+NFS as backing store for spools? Sorry if I've missed it and this has already come up. We're using Netapp/NFS, and are likely to continue to do so but still curious.

-K

Wojciech Puchar

9:19 a.m.

...

Has anyone tried or benchmarked ZFS, perhaps ZFS+NFS as backing store for

yes. long time ago. ZFS isn't useful for anything more than a toy. I/O performance is just bad.

Charles Marcus

2:15 p.m.

On 2012-06-29 2:19 AM, Wojciech Puchar <wojtek@wojtek.tensor.gdynia.pl> wrote:

...

...
Has anyone tried or benchmarked ZFS, perhaps ZFS+NFS as backing store for

...

yes. long time ago. ZFS isn't useful for anything more than a toy. I/O performance is just bad.

Please stop with the FUD... 'long time ago'? No elaboration on what implementation/platform you 'played with'?

With a proper implementation, ZFS is an excellent, mature, reliable option for storage... maybe not quite the fastest/highest performing screaming speed demon, but enterprises are concerned with more than just raw performance - in fact, data integrity tops the list.

http://www.nexenta.com/corp/nexentastor

http://www.freenas.org/

Yes, the LINUX version has a long way to go (due to stupid licensing restrictions it must be rewritten from scratch to get into the kernel), but personally I'm chomping at the bit for BTRFS, which looks like it is coming closer to usability for production systems (just got a basic fsck tool which now just needs to be perfected).

Best regards,

Charles

Stan Hoeppner

30 Jun 30 Jun

8:23 a.m.

On 6/28/2012 7:15 AM, Ed W wrote:

...

On 28/06/2012 13:01, Костырев Александр Алексеевич wrote:

...

...
somewhere in maillist I've seen RAID1+md concat+XFS being promoted as mailstorage. Does anybody in here actually use this setup?

I've decided to give it a try, but ended up with not being able to recover any data off survived pairs from linear array when _the_first of raid1 pairs got down.

The failure of the RAID1 pair was due to an intentional breakage test. Your testing methodology was severely flawed. The result is the correct expected behavior of your test methodology. Proper testing will yield a different result.

One should not be surprised that something breaks when he intentionally attempts to break it.

...

This is the configuration endorsed by Stan Hoeppner.

Yes. It works very well for metadata heavy workloads, i.e. maildir.

-- Stan

Костырев Александр Алекс еевич

2:17 p.m.

So, you say that one should use this configuration in production with hope that such failure would never happen?

-----Original Message----- From: dovecot-bounces@dovecot.org [mailto:dovecot-bounces@dovecot.org] On Behalf Of Stan Hoeppner Sent: Saturday, June 30, 2012 4:24 PM To: dovecot@dovecot.org Subject: Re: [Dovecot] RAID1+md concat+XFS as mailstorage

On 6/28/2012 7:15 AM, Ed W wrote:

...

On 28/06/2012 13:01, Костырев Александр Алексеевич wrote:

...

...
somewhere in maillist I've seen RAID1+md concat+XFS being promoted as mailstorage. Does anybody in here actually use this setup?

I've decided to give it a try, but ended up with not being able to recover any data off survived pairs from linear array when _the_first of raid1 pairs got down.

One should not be surprised that something breaks when he intentionally attempts to break it.

...

This is the configuration endorsed by Stan Hoeppner.

Yes. It works very well for metadata heavy workloads, i.e. maildir.

-- Stan

Stan Hoeppner

1 Jul 1 Jul

10:17 a.m.

On 6/30/2012 6:17 AM, Костырев Александр Алексеевич wrote:

...

So, you say that one should use this configuration in production with hope that such failure would never happen?

No, I'm saying you are trolling. A concat of RAID1 pairs has reliability identical to RAID10. I don't see you ripping a mirror pair from a RAID10 array and saying RAID10 sucks. Your argument has several flaws.

In a production environment, a dead drive will be replaced and rebuilt before the partner fails. In a production environment, the mirror pairs will be duplexed across two SAS/SATA controllers.

Duplexing the mirrors makes a concat/RAID1, and a properly configured RAID10, inherently more reliable than RAID5 or RAID6, which simply can't be protected against controller failure.

By stating the concat/RAID1 configuration is unreliable simply shows your ignorance of storage system design and operation.

-- Stan

...

-----Original Message----- From: dovecot-bounces@dovecot.org [mailto:dovecot-bounces@dovecot.org] On Behalf Of Stan Hoeppner Sent: Saturday, June 30, 2012 4:24 PM To: dovecot@dovecot.org Subject: Re: [Dovecot] RAID1+md concat+XFS as mailstorage

On 6/28/2012 7:15 AM, Ed W wrote:

...
On 28/06/2012 13:01, Костырев Александр Алексеевич wrote:

...
...
somewhere in maillist I've seen RAID1+md concat+XFS being promoted as mailstorage. Does anybody in here actually use this setup?

I've decided to give it a try, but ended up with not being able to recover any data off survived pairs from linear array when _the_first of raid1 pairs got down.

The failure of the RAID1 pair was due to an intentional breakage test. Your testing methodology was severely flawed. The result is the correct expected behavior of your test methodology. Proper testing will yield a different result.

One should not be surprised that something breaks when he intentionally attempts to break it.

...
This is the configuration endorsed by Stan Hoeppner.

Yes. It works very well for metadata heavy workloads, i.e. maildir.

Charles Marcus

1:48 p.m.

On 2012-07-01 3:17 AM, Stan Hoeppner <stan@hardwarefreak.com> wrote:

...

In a production environment, the mirror pairs will be duplexed across two SAS/SATA controllers.

Duplexing the mirrors makes a concat/RAID1, and a properly configured RAID10, inherently more reliable than RAID5 or RAID6, which simply can't be protected against controller failure.

Stan, am I correct that this - dual/redundant controllers - is the reason that a real SAN is more reliable than just running local storage on a mod-high end server?

Best regards,

Charles

Stan Hoeppner

2 Jul 2 Jul

2:12 a.m.

On 7/1/2012 5:48 AM, Charles Marcus wrote:

...

On 2012-07-01 3:17 AM, Stan Hoeppner <stan@hardwarefreak.com> wrote:

...
In a production environment, the mirror pairs will be duplexed across two SAS/SATA controllers.

Duplexing the mirrors makes a concat/RAID1, and a properly configured RAID10, inherently more reliable than RAID5 or RAID6, which simply can't be protected against controller failure.

Stan, am I correct that this - dual/redundant controllers - is the reason that a real SAN is more reliable than just running local storage on a mod-high end server?

In this case I was simply referring to using two PCIe SAS HBAs in a server, mirroring drive pairs across the HBAs with md, then concatenating the RAID1 pairs with md --linear. This gives protection against all failure modes. You can achieve the former with RAID5/6 but not the latter. Consider something like:

2x http://www.lsi.com/products/storagecomponents/Pages/LSISAS9200-8e.aspx 2x http://www.dataonstorage.com/dataon-products/6g-sas-jbod/dns-1640-2u-24-bay-... 48x Seagate ST9300605SS 300GB SAS 10k RPM

This hardware yields a high IOPS, high concurrency, high performance mail store. Drives are mirrored across HBAs and JBODs. Each HBA is connected to an expander/controller in both chassis, yielding full path redundancy. Each controller can see every disk in both enclosures. With this setup and SCSI multipath, you have redundancy against drive, HBA, cable, expander, and chassis failure. You can't get any more redundant than that. And you can't achieve this with RAID5/6, unless you simply mirror two of them across the HBAs/chassis. But that eliminates the whole reason for RAID5/6--space/cost efficiency.

People like Костырев Александр Алексеевич may wish to waste 3 more HBAs, JBOD chassis, and 24 more drives for a 3-way md mirror, to protect against the 1000 year scenario of two drives in the same mirror pair failing before the first is rebuilt.

A quality SAN head with dual redundant controllers will give you all of the above for all RAID levels, assuming you have multiple network connections (FC or iSCSI). You'd need two HBAs in the host, one connected to each controller. This is a direct connect scenario. In a fabric scenario, you'd have an independent FC or iSCSI switch on each path. And of course you need SCSI multipath configured on the host.

-- Stan

Wojciech Puchar

6:40 p.m.

...

No, I'm saying you are trolling. A concat of RAID1 pairs has reliability identical to RAID10.

not a concat but separate filesystem.

4784

Age (days ago)

4788

Last active (days ago)

List overview

22 comments

7 participants

participants (7)

Charles Marcus
Ed W
Kelsey Cummings
Stan Hoeppner
Wojciech Puchar
Костырев Александр Алекс еевич
Костырев Александр Алексеевич