[Dovecot] Better to use a single large storage server or multiple smaller for mdbox?

Jan-Frode Myklebust janfrode at tanso.net
Sat Apr 14 13:04:22 EEST 2012


On Fri, Apr 13, 2012 at 07:33:19AM -0500, Stan Hoeppner wrote:
> > 
> > What I meant wasn't the drive throwing uncorrectable read errors but
> > the drives are returning different data that each think is correct or
> > both may have sent the correct data but one of the set got corrupted
> > on the fly. After reading the articles posted, maybe the correct term
> > would be the controller receiving silently corrupted data, say due to
> > bad cable on one.
> 
> This simply can't happen.  What articles are you referring to?  If the
> author is stating what you say above, he simply doesn't know what he's
> talking about.

It has happened to me, with RAID5 not RAID1. It was a firmware bug
in the raid controller that caused the RAID array to go silently
corrupted. The HW reported everything green -- but the filesystem was
reporting lots of strange errors..  This LUN was part of a larger
filesystem striped over multiple LUNs, so parts of the fs was OK, while
other parts was corrupt.

It was this bug:

   http://delivery04.dhe.ibm.com/sar/CMA/SDA/02igj/7/ibm_fw1_ds4kfc_07605200_anyos_anycpu.chg
   - Fix 432525 - CR139339  Data corruption found on drive after
     reconstruct from GHSP (Global Hot Spare)


<snip>

> In closing, I'll simply say this:  If hardware, whether a mobo-down SATA
> chip, or a $100K SGI SAN RAID controller, allowed silent data corruption
> or transmission to occur, there would be no storage industry, and we'll
> all still be using pen and paper.  The questions you're asking were
> solved by hardware and software engineers decades ago.  You're fretting
> and asking about things that were solved decades ago.

Look at the plans are for your favorite fs:

	http://www.youtube.com/watch?v=FegjLbCnoBw

They're planning on doing metadata checksumming to be sure they don't
receive corrupted metadata from the backend storage, and say that data
validation is a storage subsystem *or* application problem. 

Hardly a solved problem..


  -jf



More information about the dovecot mailing list