[Dovecot] SSD drives are really fast running Dovecot

Thu Jan 20 08:06:06 EET 2011

Ed W put forth on 1/17/2011 12:23 PM:
> On 17/01/2011 02:20, Stan Hoeppner wrote:
>> Ed W put forth on 1/16/2011 4:11 PM:
>>>> Using XFS with delayed logging mount option (requires kernel 2.6.36 or later).
>>>>
>>>> XFS has natively used delayed allocation for quite some time, coalescing
>>>> multiple pending writes before pushing them into the buffer cache.  This not
>>>> only decreases physical IOPS, but it also decreases filesystem fragmentation by
>>>> packing more files into each extent.  Decreased fragmentation means fewer disk
>>>> seeks required per file read, which also decreases physical IOPS.  This also
>>>> greatly reduces the wasted space typical of small file storage.  Works very
>>>> well
>>>> with maildir, but also with the other mail storage formats.
>>> What happens if you pull out the wrong cable in the rack, kernel lockup/oops,
>>> power failure, hot swap disk pulled, or something else which causes an
>>> unexpected loss of a few seconds of written data?
>> Read the XFS FAQ.  These questions have been answered hundreds of times since
>> XFS was released in Irix in 1994.  I'm not your personal XFS tutor.
> 
> Why the hostile reply?

If you think the above is "hostile" you have lived a privileged and sheltered
life, and I envy you. :)  That isn't "hostile" but a combination of losing
patience and being blunt.  "Hostile" is "f--k you!".  Obviously I wasn't being
"hostile".

> The question was deeper than your response?

Do you want to troll or learn something?

Prior to 2007 there was a bug in XFS that caused filesystem corruption upon
power loss under some circumstances--actual FS corruption, not simply zeroing of
files that hadn't been fully committed to disk.  Many (uneducated) folk in the
Linux world still to this day tell others to NOT use XFS because "Power loss
will always corrupt your file system."  Some probably know better but are EXT or
JFS (or god forbid, BTRFS) fans and spread fud regarding XFS.  This is amusing
considering XFS is hands down the best filesystem available on any platform,
including ZFS.  Others are simply ignorant and repeat what they've heard without
looking for current information.

Thus, when you asked the question the way you did, you appeared to be trolling,
just like the aforementioned souls who do the same.  So I directed you to the
XFS FAQ where all of the facts are presented and all of your questions would be
answered, from the authoritative source, instead of wasting my time on a troll.

>>> Surely your IOPs are hard limited by the number of fsyncs (and size of any
>>> battery backed ram)?
>> Depends on how your applications are written and how often they call fsync.  Do
>> you mean BBWC?  WRT delayed logging BBWC is mostly irrelevant.  Keep in mind
>> that for delayed logging to have a lot of metadata writes in memory someone, or
>> many someones, must be doing something like an 'rm -rf' or equivalent on a large
>> dir with many thousands of files.  Even in this case, the processing is _very_
>> fast.
> 
> You have completely missed my point.

No, I haven't.

> Your data isn't safe until it hits the disk.  There are plenty of ways to spool
> data to ram rather than committing it, but they are all vulnerable to data loss
> until the data is written to disk.

The delayed logging code isn't a "ram spooler", although that is a mild side
effect.  Apparently I didn't explain it fully, or precisely.  And keep in mind,
I'm not the dev who wrote the code.  So I'm merely repeating my recollection of
the description from the architectural document and what was stated on the XFS
list by the author, Dave Chinner of Red Hat.

> You wrote: "filesystem metadata write operations are pushed almost entirely into
> RAM", but if the application requests an fsync then you still have to write it
> to disk?  As such you are again limited by disk IO, which itself is limited by
> the performance of the device (and temporarily accelerated by any persistent
> write cache).  Hence my point that your IOPs are generally limited by the number
> of fsyncs and any persistent write cache?

In my desire to be brief I didn't fully/correctly explain how delayed logging
works.  I attempted a simplified explanation that I thought most would
understand.  Here is the design document:
http://oss.sgi.com/archives/xfs/2010-05/msg00329.html

Early performance numbers:
http://oss.sgi.com/archives/xfs/2010-05/msg00329.html

> As I write this email I'm struggling with getting a server running again that
> has just been rudely powered down due to a UPS failing (power was fine, UPS
> failed...).  This isn't such a rare event (IMHO) and hence I think we do need to
> assume that at some point every machine will suffer a rude and unexpected event
> which looses all in progress write cache.  I have no complaints at XFS in
> general, but I think it's important that filesystem designers in general have
> give some thought to this event and recovering from it?

Rest assured this is a top priority.  Ever heard of SGI by chance?  They sell
supercomputers with 1024 CPUs, 16 terabytes of RAM, and petabyte FC RAID
systems, in a shared memory NUMA configuration, i.e "SMP", but the memory access
times aren't symmetric.  In short, it's a 1024 CPU server--that costs something
like $4+ million USD.  SGI was the creator of XFS in 93/94 and open sourced it
in 2000 when they decided to move from MIPS/IRIX to Itanium/Linux.  SGI has used
nothing but XFS since 1994 on all their systems.  NASA currently has almost a
petabyte of XFS storage, and 10 petabytes of CXFS storage.  CXFS is the
proprietary clustered version of XFS.

NASA is but one high profile XFS user on this planet.  There are hundreds of
others, including many US Government labs of all sorts.  With customers such as
these, data security/reliability is a huge priority.

> Please try not to be so hostile in your email construction - we aren't all
> idiots here, and even if we were, your writing style is not conducive to us
> wanting to learn from your apparent wealth of experience?

You're overreacting.  Saying "I'm not your personal XFS tutor" is not being
hostile.  Heh, if you think that was hostile, go live on NANAE for a few days or
a week and report back on what real hostility is. ;)

-- 
Stan