Ed W put forth on 1/17/2011 12:23 PM:
On 17/01/2011 02:20, Stan Hoeppner wrote:
Ed W put forth on 1/16/2011 4:11 PM:
Using XFS with delayed logging mount option (requires kernel 2.6.36 or later).
XFS has natively used delayed allocation for quite some time, coalescing multiple pending writes before pushing them into the buffer cache. This not only decreases physical IOPS, but it also decreases filesystem fragmentation by packing more files into each extent. Decreased fragmentation means fewer disk seeks required per file read, which also decreases physical IOPS. This also greatly reduces the wasted space typical of small file storage. Works very well with maildir, but also with the other mail storage formats. What happens if you pull out the wrong cable in the rack, kernel lockup/oops, power failure, hot swap disk pulled, or something else which causes an unexpected loss of a few seconds of written data? Read the XFS FAQ. These questions have been answered hundreds of times since XFS was released in Irix in 1994. I'm not your personal XFS tutor.
Why the hostile reply?
If you think the above is "hostile" you have lived a privileged and sheltered life, and I envy you. :) That isn't "hostile" but a combination of losing patience and being blunt. "Hostile" is "f--k you!". Obviously I wasn't being "hostile".
The question was deeper than your response?
Do you want to troll or learn something?
Prior to 2007 there was a bug in XFS that caused filesystem corruption upon power loss under some circumstances--actual FS corruption, not simply zeroing of files that hadn't been fully committed to disk. Many (uneducated) folk in the Linux world still to this day tell others to NOT use XFS because "Power loss will always corrupt your file system." Some probably know better but are EXT or JFS (or god forbid, BTRFS) fans and spread fud regarding XFS. This is amusing considering XFS is hands down the best filesystem available on any platform, including ZFS. Others are simply ignorant and repeat what they've heard without looking for current information.
Thus, when you asked the question the way you did, you appeared to be trolling, just like the aforementioned souls who do the same. So I directed you to the XFS FAQ where all of the facts are presented and all of your questions would be answered, from the authoritative source, instead of wasting my time on a troll.
Surely your IOPs are hard limited by the number of fsyncs (and size of any battery backed ram)? Depends on how your applications are written and how often they call fsync. Do you mean BBWC? WRT delayed logging BBWC is mostly irrelevant. Keep in mind that for delayed logging to have a lot of metadata writes in memory someone, or many someones, must be doing something like an 'rm -rf' or equivalent on a large dir with many thousands of files. Even in this case, the processing is _very_ fast.
You have completely missed my point.
No, I haven't.
Your data isn't safe until it hits the disk. There are plenty of ways to spool data to ram rather than committing it, but they are all vulnerable to data loss until the data is written to disk.
The delayed logging code isn't a "ram spooler", although that is a mild side effect. Apparently I didn't explain it fully, or precisely. And keep in mind, I'm not the dev who wrote the code. So I'm merely repeating my recollection of the description from the architectural document and what was stated on the XFS list by the author, Dave Chinner of Red Hat.
You wrote: "filesystem metadata write operations are pushed almost entirely into RAM", but if the application requests an fsync then you still have to write it to disk? As such you are again limited by disk IO, which itself is limited by the performance of the device (and temporarily accelerated by any persistent write cache). Hence my point that your IOPs are generally limited by the number of fsyncs and any persistent write cache?
In my desire to be brief I didn't fully/correctly explain how delayed logging works. I attempted a simplified explanation that I thought most would understand. Here is the design document: http://oss.sgi.com/archives/xfs/2010-05/msg00329.html
Early performance numbers: http://oss.sgi.com/archives/xfs/2010-05/msg00329.html
As I write this email I'm struggling with getting a server running again that has just been rudely powered down due to a UPS failing (power was fine, UPS failed...). This isn't such a rare event (IMHO) and hence I think we do need to assume that at some point every machine will suffer a rude and unexpected event which looses all in progress write cache. I have no complaints at XFS in general, but I think it's important that filesystem designers in general have give some thought to this event and recovering from it?
Rest assured this is a top priority. Ever heard of SGI by chance? They sell supercomputers with 1024 CPUs, 16 terabytes of RAM, and petabyte FC RAID systems, in a shared memory NUMA configuration, i.e "SMP", but the memory access times aren't symmetric. In short, it's a 1024 CPU server--that costs something like $4+ million USD. SGI was the creator of XFS in 93/94 and open sourced it in 2000 when they decided to move from MIPS/IRIX to Itanium/Linux. SGI has used nothing but XFS since 1994 on all their systems. NASA currently has almost a petabyte of XFS storage, and 10 petabytes of CXFS storage. CXFS is the proprietary clustered version of XFS.
NASA is but one high profile XFS user on this planet. There are hundreds of others, including many US Government labs of all sorts. With customers such as these, data security/reliability is a huge priority.
Please try not to be so hostile in your email construction - we aren't all idiots here, and even if we were, your writing style is not conducive to us wanting to learn from your apparent wealth of experience?
You're overreacting. Saying "I'm not your personal XFS tutor" is not being hostile. Heh, if you think that was hostile, go live on NANAE for a few days or a week and report back on what real hostility is. ;)
-- Stan