[Dovecot] Best filesystem?
Hi everyone! What's the best filesystem to use for the mail spool? I'm debating between Debian with xfs or FreeBSD with zfs. I'm not sure which way to go. I'm migrating from cyrus. I have about 50 users so it's not a large setup.
Cheers! Monika
-- Monika Janek Systems Administrator, Side Effects Software Toronto, Ontario Canada 416-504-9876 x207 www.sidefx.com
Quoting Monika Janek <mjanek@sidefx.com>:
Hi everyone! What's the best filesystem to use for the mail spool? I'm
Yikes! Opening that can of worms again...
debating between Debian with xfs or FreeBSD with zfs. I'm not sure which way to go. I'm migrating from cyrus. I have about 50 users so it's not a large setup.
I would say both XFS and ZFS are fine. Pick the OS you want (Debian or FreeBSD) and then use the filesystem that comes with that OS. At the size you state, and the filesystems you list, the OS choice will be much more important than the filesystem is.
If you really want to pick a file system rather than an OS, then it would depend on admininstrative/management features of the OS, not the purpose (mail spool). You would pick which ever is best for your needs as far as managing it goes.
Cheers! Monika
-- Eric Rostetter The Department of Physics The University of Texas at Austin
Go Longhorns!
Thanks! I like both...hmm..which one do I like better? :)
On 01/27/2011 03:59 PM, Eric Rostetter wrote:
Quoting Monika Janek <mjanek@sidefx.com>:
Hi everyone! What's the best filesystem to use for the mail spool? I'm
Yikes! Opening that can of worms again...
debating between Debian with xfs or FreeBSD with zfs. I'm not sure which way to go. I'm migrating from cyrus. I have about 50 users so it's not a large setup.
I would say both XFS and ZFS are fine. Pick the OS you want (Debian or FreeBSD) and then use the filesystem that comes with that OS. At the size you state, and the filesystems you list, the OS choice will be much more important than the filesystem is.
If you really want to pick a file system rather than an OS, then it would depend on admininstrative/management features of the OS, not the purpose (mail spool). You would pick which ever is best for your needs as far as managing it goes.
Cheers! Monika
-- Monika Janek Systems Administrator, Side Effects Software Toronto, Ontario Canada 416-504-9876 x207 www.sidefx.com
On 1/27/11 2:59 PM -0600 Eric Rostetter wrote:
Quoting Monika Janek <mjanek@sidefx.com>:
Hi everyone! What's the best filesystem to use for the mail spool? I'm
Yikes! Opening that can of worms again...
lol
debating between Debian with xfs or FreeBSD with zfs. I'm not sure which way to go. I'm migrating from cyrus. I have about 50 users so it's not a large setup.
I would say both XFS and ZFS are fine. Pick the OS you want (Debian or FreeBSD) and then use the filesystem that comes with that OS.
+1. At that scale it doesn't matter.
On Fri, Jan 28, 2011 at 6:49 AM, Monika Janek <mjanek@sidefx.com> wrote:
Hi everyone! What's the best filesystem to use for the mail spool? I'm debating between Debian with xfs or FreeBSD with zfs. I'm not sure which way to go. I'm migrating from cyrus. I have about 50 users so it's not a large setup.
Cheers! Monika
xfs is not very nice to you if you lose power, it's not as bad as it used to be, but it still gives you 0 byte files, so make sure you have a good UPS to issue a safe shutdown of the server, if you do, xfs is better, using CentOS.
Nick Edwards put forth on 1/28/2011 9:47 PM:
xfs is not very nice to you if you lose power, it's not as bad as it used to be, but it still gives you 0 byte files, so make sure you have a good UPS to issue a safe shutdown of the server, if you do, xfs is better, using CentOS.
It would be nice if you spoke from experience or documentation instead of rumor. I've been using XFS on Postfix/Dovecot servers for quite some time and have had power loss on at least 2 occasions with zero data loss or FS corruption. What you state is factually incorrect. Stop spreading XFS FUD please, Nick.
From: http://xfs.org/index.php/XFS_FAQ
"Q: Why do I see binary NULLS in some files after recovery when I unplugged the power?
Update: This issue has been addressed with a CVS fix on the 29th March 2007 and merged into mainline on 8th May 2007 for 2.6.22-rc1.
XFS journals metadata updates, not data updates. After a crash you are supposed to get a consistent filesystem which looks like the state sometime shortly before the crash, NOT what the in memory image looked like the instant before the crash.
Since XFS does not write data out immediately unless you tell it to with fsync, an O_SYNC or O_DIRECT open (the same is true of other filesystems), you are looking at an inode which was flushed out, but whose data was not. Typically you'll find that the inode is not taking any space since all it has is a size but no extents allocated (try examining the file with the xfs_bmap(8) command)."
Note "the same is true of other filesystems". Since 2007, XFS is no worse nor better at handling power loss than any other journaling filesystem. If power loss is a frequent concern in any organization, using any filesystem, you had best be fsync'ing all writes. No filesystems today cause data loss due to power loss. Data loss due to power loss today is caused by the Linux buffer cache having pending disk writes held in memory. When the power goes, so do the contents of RAM, so you loose that data. This is filesystem independent.
-- Stan
On Sun, Jan 30, 2011 at 4:00 AM, Stan Hoeppner <stan@hardwarefreak.com>wrote:
Nick Edwards put forth on 1/28/2011 9:47 PM:
xfs is not very nice to you if you lose power, it's not as bad as it used to be, but it still gives you 0 byte files, so make sure you have a good UPS to issue a safe shutdown of the server, if you do, xfs is better, using CentOS.
It would be nice if you spoke from experience or documentation instead of rumor.
I do speak from experience, it is also a /very/ well known fact, for someone who rants on and on and on with self justification, you sure don't actually know a lot do you.
Not that it is any of anyones business but we operated XFS years before we moved to a dedicated hardware NFS server, because we too did not think it could be as bad as we had heard, but it was, and at that time we housed our equipment off-site, so were dependent upon the colos UPS and generators which failed often enough for us to see the mess it causes, now days, we host in-house and wont ever outsource hosting again, if it fails now, we do only have ourselves to blame, luckily, we have had perfect uptime since that move and never needed snapshots to recover.
I've been using XFS on Postfix/Dovecot servers for quite some time and have had power loss on at least 2 occasions with zero data loss or FS corruption. What you state is factually incorrect. Stop spreading XFS FUD please, Nick.
BS, how many users do you have? How busy are your servers? obviously not much of either I'd say based on what you claim on this list, until you have real world experience, that means greater than the pewny userbase you have, please do not sprout what you think as being gospel, you only make yourself look like an ever bigger ranting fool.
Here's another hint, never believe everything you read on wikipedia or Google, so much of it is from people who have less of a clue than you.
Timo moderated some other clown last week, it is a pitty he did not moderate a few others, I have seen so much rubbish from you, and your 555 pages of ranting replies trying to blind those who have not woken up to you yet, it is astounding. Wietse put you in your place on postfix list late last year about ranting on about stuff you have no idea about, but you clearly did not learn a thing from that at all either.
People like yourself, are in the category of "a little information in the hands of some can be very, very dangerous" for newbies.
nick
Nick Edwards put forth on 1/29/2011 7:14 PM:
On Sun, Jan 30, 2011 at 4:00 AM, Stan Hoeppner <stan@hardwarefreak.com>wrote:
Nick Edwards put forth on 1/28/2011 9:47 PM:
xfs is not very nice to you if you lose power, it's not as bad as it used to be, but it still gives you 0 byte files, so make sure you have a good UPS to issue a safe shutdown of the server, if you do, xfs is better, using CentOS.
It would be nice if you spoke from experience or documentation instead of rumor.
I do speak from experience, it is also a /very/ well known fact, for someone who rants on and on and on with self justification, you sure don't actually know a lot do you.
It's interesting that you cut the "fact" out of my reply, then proceeded to rant against me personally, instead of discussing the subject at hand. Here, I'll add the XFS power loss fact back in, as that is the subject of this thread currently: From: http://xfs.org/index.php/XFS_FAQ
"Q: Why do I see binary NULLS in some files after recovery when I unplugged the power?
Update: This issue has been addressed with a CVS fix on the 29th March 2007 and merged into mainline on 8th May 2007 for 2.6.22-rc1..."
That proves you factually incorrect. Now, if you were using XFS prior to May 2007, or using an antique kernel in a Linux ditsro (RHEL/CentOS) that shipped well after that date but without the patches, I can see how you might have had issues. Patching systems is not the responsibility of the devs however. If you were running un-patched systems after May 2007, that's not the fault of XFS.
Given the problem you describe was fixed almost 4 years ago, do you feel it's proper to continue to denigrate XFS today, spreading FUD 4 years after the issue was resolved?
Also, I didn't quote anything from Wikipedia. It seems you have confused the XFS FAQ website with Wikipedia simply because it uses Wiki software and thus has the same look/feel as Wikipedia. Are you saying that one can't trust the content on any site running Wiki software because it simply "looks like" Wikipedia?
-- Stan
On Sun, 2011-01-30 at 16:13 -0600, Stan Hoeppner wrote:
be, but it still gives you 0 byte files, so make sure you have a good UPS .. "Q: Why do I see binary NULLS in some files after recovery when I unplugged the power?
0 byte files != NULL bytes in files. My guess is it's the same problem as described in http://thunk.org/tytso/blog/2009/03/12/delayed-allocation-and-the-zero-lengt...
On Mon, Jan 31, 2011 at 12:40:11AM +0200, Timo Sirainen wrote:
On Sun, 2011-01-30 at 16:13 -0600, Stan Hoeppner wrote:
be, but it still gives you 0 byte files, so make sure you have a good UPS .. "Q: Why do I see binary NULLS in some files after recovery when I unplugged the power?
0 byte files != NULL bytes in files. My guess is it's the same problem as described in http://thunk.org/tytso/blog/2009/03/12/delayed-allocation-and-the-zero-lengt...
But is it relevant for dovecot ? Isn't dovecot doing the necessary fsync()'s, so this should really be a non-issue ?
-jf
-jf
On Mon, 2011-01-31 at 00:03 +0100, Jan-Frode Myklebust wrote:
0 byte files != NULL bytes in files. My guess is it's the same problem as described in http://thunk.org/tytso/blog/2009/03/12/delayed-allocation-and-the-zero-lengt...
But is it relevant for dovecot ? Isn't dovecot doing the necessary fsync()'s, so this should really be a non-issue ?
Depends on what mail_fsync has been set to. As long as it's not "never", then it should be non-issue for Dovecot.
Timo Sirainen wrote:
On Mon, 2011-01-31 at 00:03 +0100, Jan-Frode Myklebust wrote:
But is it relevant for dovecot ? Isn't dovecot doing the necessary fsync()'s, so this should really be a non-issue ?
Depends on what mail_fsync has been set to. As long as it's not "never", then it should be non-issue for Dovecot.
Timo, is this the mail_fsync conf item (I guess not)?:
MainConfig - Dovecot Wiki "fsync_disable = no Don't use fsync() or fdatasync() calls. This makes the performance better at the cost of potential data loss if the server (or the file server) goes down."
http://wiki1.dovecot.org/MainConfig
Is mail_fsync a v2 item? We're using Dovecot v1, for now. Presumably
fsync_disable = no
is the default, so that fsyncs take place?
As I understand it, Dovecot rebuilds its indexes if they become corrupted and, if that's the case, then there is no filesystem vulnerability in respect of those. We're using maildir. How soon after each mail message is written, moved, renamed, etc, does Dovecot issue fsyncs? Is there much 'commit-delay' up to that point, which might be a vulnerability window?
Finally, and I do apologise for all the questions, we're wishing to move to NFS. (At the moment we have a 'one box' Dovecot solution, but this makes upgrade of OS, upgrade of Dovecot, or upgrade of storage always a problem. We have already exported the new XFS filestore over NFS - but Dovecot is not (yet) using it, that's the next step for us.) Does the fsync solution we've been discussing work just as well when the XFS filestore is exported over NFS?
regards, Ron
Ron Leach wrote:
Finally, and I do apologise for all the questions, we're wishing to move to NFS. (At the moment we have a 'one box' Dovecot solution, but this makes upgrade of OS, upgrade of Dovecot, or upgrade of storage always a problem. We have already exported the new XFS filestore over NFS - but Dovecot is not (yet) using it, that's the next step for us.) Does the fsync solution we've been discussing work just as well when the XFS filestore is exported over NFS?
I realise replying to self is not the best thing to do, but I do not want to waste people's time. NFS is handled with quite different system calls, and has a quite different 'sequential behaviour' (for want of a better word. Deep in the comments on Ted Tso's post - referred previously by Timo - are these remarks:
Delayed allocation and the zero-length file problem | Thoughts by Ted
"In comment #49, Ted says:
First of all, that’s not NFS’s model; NFS v2/v3 requires that each write RPC call not return until the data has hit stable storage. So in fact it’s a stronger requirement than alloc on close.
This statement is misleading.
Firstly, it accurately describes the NFSv2 semantics, but no sane person deliberately uses NFSv2 anymore so the statement is of no help in the real world.
The NFSv3 semantics are more flexible. The v3 WRITE RPC adds a flag which allows the client to say whether it wants the data on stable storage before the RPC returns, i.e. whether to do the old slow thing that was the only way with NFSv2. This flag is hardly ever used by clients (O_SYNC or a “sync” mount enable it, as does O_DIRECT in most circumstances). Instead clients will typically send a bunch of WRITE calls with data, and then a COMMIT call which does the actual forcing of data to server-side stable storage. This is significantly faster than the NFSv2 model.
NFSv4 behaves like NFSv3 in this regard, but adds a further feature called file delegations which complicate the picture even further.
Now to actually answer Frank’s question. But first some background.
The NFS protocol doesn’t know about any block-level behaviour like allocation, it works entirely on files. When the allocation occurs is a server-side implementation detail entirely. Having the data on stable storage is indeed a stronger requirement than forcing allocation, but the server could choose to do the allocation at any time from the start of unstable WRITE RPC to the end of the COMMIT RPC, which can be a window of several seconds.
NFS however does keep data in the client which has been written by applications on the client but not yet sent to the server. If the lifetime of the application is short and the file is small, this could include the entire data of the file.
NFS clients practice a behaviour known as CTO or “Close-To-Open consistency”, which is a very weak form of inter-client file cache consistency. This means that when the application close()s the last fd the client will perform the equivalent of an fsync(), i.e. issue WRITEs to the server for all any dirty data remaining in the client and a COMMIT to force that data to stable storage on the server. In other words, when close() returns to the app, the data is safe on the server. This is the behaviour on close() that Frank refers to above. Note that this is a much tighter constraint than POSIX requires."
http://thunk.org/tytso/blog/2009/03/12/delayed-allocation-and-the-zero-lengt... [Comment # 84]
Further, the NFS shares can be mounted on the client with a 'sync' option that forces physical writes before returning to the caller. Though this would be horrifically slow in any high load (network transmission times, disc io queues etc), in our situation of low load we could consider using this option to minimise the potential for email loss due to crash or power fail.
One further optimization, not relevant to Dovecot or email, but worth mentioning in the (unlikely) event that anyone is really this interested, if we were to split our XFS share into 'two' shares, one for email, and the other for general data storage, then we could apply 'sync' only to the XFS share for email (hence ensuring immediate writes) and not to the other share for general storage.
Unless I'm wrong about something here, I think this closes the NFS-related concern about XFS and Dovecot and loss of email.
regards, Ron
On 1/31/11 5:23 PM +0000 Ron Leach wrote:
Ron Leach wrote:
Finally, and I do apologise for all the questions, we're wishing to move to NFS. (At the moment we have a 'one box' Dovecot solution, but this makes upgrade of OS, upgrade of Dovecot, or upgrade of storage always a problem. We have already exported the new XFS filestore over NFS - but Dovecot is not (yet) using it, that's the next step for us.) Does the fsync solution we've been discussing work just as well when the XFS filestore is exported over NFS? ...
Further, the NFS shares can be mounted on the client with a 'sync' option that forces physical writes before returning to the caller. Though this would be horrifically slow in any high load (network transmission times, disc io queues etc), in our situation of low load we could consider using this option to minimise the potential for email loss due to crash or power fail.
One further optimization, not relevant to Dovecot or email, but worth mentioning in the (unlikely) event that anyone is really this interested, if we were to split our XFS share into 'two' shares, one for email, and the other for general data storage, then we could apply 'sync' only to the XFS share for email (hence ensuring immediate writes) and not to the other share for general storage.
Unless I'm wrong about something here, I think this closes the NFS-related concern about XFS and Dovecot and loss of email.
You're wrong.
Yes, NFS semantics "guarantee" commit, but what happens is that the underlying filesystem (e.g. ext3, xfs with metadata spooling) lies to the kernel about what has been committed. The NFS call returns, but the data has not actually been committed. There have even been NFS server tweaks for some implementations that themselves lie to the client, ie w/o depending on filesystem lies, for performance reasons.
These are all valid performance tweaks but most times people don't understand the effect on data integrity.
So the summary is, just because you're using NFS doesn't mean anything wrt data integrity or data loss. You still have to understand the underlying filesystem issue and you also have to understand the NFS server issues.
But, e.g., if you're using Netapp (WAFL), you will have high performance as well as correct NFS semantics wrt data loss.
Frank Cusack wrote:
On 1/31/11 5:23 PM +0000 Ron Leach wrote:
Further, the NFS shares can be mounted on the client with a 'sync' option that forces physical writes before returning to the caller. Though this would be horrifically slow in any high load (network transmission times, disc io queues etc), in our situation of low load we could consider using this option to minimise the potential for email loss due to crash or power fail.
Unless I'm wrong about something here, I think this closes the NFS-related concern about XFS and Dovecot and loss of email.
You're wrong.
Yes, NFS semantics "guarantee" commit, but what happens is that the underlying filesystem (e.g. ext3, xfs with metadata spooling) lies to the kernel about what has been committed. The NFS call returns, but the data has not actually been committed. There have even been NFS server tweaks for some implementations that themselves lie to the client, ie w/o depending on filesystem lies, for performance reasons.
Oh, dear.
So does that mean we're lost, having built these XFS data servers? Is XFS, then, the 'wrong' choice for email integrity across crashes and power fail, even when using NFS in low load systems?
Fortunately, these two machines are only being used for backup at the moment - we haven't migrated the application 'live' data stores - yet. So we could rebuild them. With ZFS, maybe.
All we want to do is not lose emails.
What does everyone else do? Lose emails?
regards, Ron
Ron Leach put forth on 1/31/2011 5:00 PM:
What does everyone else do? Lose emails?
No. We have decent power backup systems and management interfaces so systems don't abruptly lose power. We also use good hardware with a good mainline kernel driver track record.
I think you forgot about the two failure scenarios that you need to worry about in this thread:
- Kernel/system crash
- Power loss
If you're using decent hardware with decent drivers that have been fleshed out over the years in mainline, you can forget #1. If you have decent UPS units, management interfaces, and shutdown software, you don't need to worry about #2.
After those two are covered, the only thing you need to worry about is hardware going flaky. In that case, nothing will save you but good backups. Thankfully most hardware today is pretty reliable (system boards, HBAs, etc).
-- Stan
On 1/31/2011 3:00 PM, Ron Leach wrote:
All we want to do is not lose emails.
What does everyone else do? Lose emails?
regards, Ron
I've followed several of the filesystem threads here with interest - and tips & insight on improving performance are always exciting. But I'm very confused with regards to some of the questions about data integrity. Filesystem selection can be at least as serious a commitment as marriage - but if the goal is to avoid data loss due to power failure, it seems the solution to the problem is not having the problem in the first place.
I'm responsible for running a massive server farm (1 box) for an extraordinary number of users (at least 5 active accounts) so I may have a distorted view of reality. But I will say the only time I've lost mail has been through OP error (at least since I left MS Exchange behind...shudder). And since that idiot IT guy looks an awful like my mirror image...
I'm sure those OPs with larger budgets might have some hardware suggestions for reducing the chance of hardware failure leading to data loss (I mean, other than using good components, installed properly with oversized cooling - and possibly proactive upgrade/replacements prior to anticipated lifetime failure - how can you ELIMINATE the possibility of a CPU/controller/HD just deciding to blow up?)
My first exposure to the wonders of UPS's had nothing to do with keeping the system on while the lights were out. Having moved from a location where the power was relatively stable, I came to a new city and it turns out power quality sucks - and so do the grounds. I had constant failures, crashes, and general unstable behaviour from my PC. As soon as I added a UPS (with line conditioning) - problems magically disappeared.
If you have a proper-sized UPS, combined with notification from the UPS to the servers to perform orderly shutdowns - including telling the application servers to shutdown prior to the storage servers, etc. - doesn't that render the (possibly more than theoretical) chances of data loss due to power interruption a moot point?
-- Daniel
Daniel L. Miller wrote:
On 1/31/2011 3:00 PM, Ron Leach wrote:
All we want to do is not lose emails.
What does everyone else do? Lose emails?
I'm responsible for running a massive server farm (1 box) for an extraordinary number of users (at least 5 active accounts) so I may have a distorted view of reality. But I will say the only time I've lost mail has been through OP error (at least since I left MS Exchange behind...shudder). And since that idiot IT guy looks an awful like my mirror image...
Daniel, very nicely put.
In my experience also - aside from failures of reticulated power - most problems come from maintenance staff error. Someone already posted that people can pull the wrong cable, or switch off the wrong item, etc. Let's keep this in mind ...
I'm sure those OPs with larger budgets might have some hardware suggestions for reducing the chance of hardware failure leading to data loss (I mean, other than using good components, installed properly with oversized cooling - and possibly proactive upgrade/replacements prior to anticipated lifetime failure - how can you ELIMINATE the possibility of a CPU/controller/HD just deciding to blow up?)
Exactly, you can't. But that doesn't mean you can't very substantially reduce the impact of those problems. So, in these circumstances, one thing you can do is reduce the vulnerability - the susceptibility, if you will - of the data to these types of system failure (which cannot be eliminated, as you say). Additionally, you can try to arrange a minimum recovery capability even when failure is totally catastrophic.
You can protect against HD failure by using RAID, and achieve a certain level of assurance, possibly something very close to 100% in respect of that particular failure.
Since the HDs can be considered 'secure' (well, something v close to 100% available), data can be that secure 'provided' it is written to the HD. Since failures can occur at any time, the smaller the time that data exists that is 'not' on the HD, compared to the time that data 'is' on the HD, the less 'likely' that data will be lost when one of these unpreventable system failures occurs. In filesystems that immediately write data to the HD there is, in principle, no period when data is 'unwritten'. But, (and you can see what's coming), with filesystems that wait 30 seconds before writing to disk the data that the application 'thinks' has been safely written, then there is a 30 second 'window' of vulnerability to one of these events. On a large system with a lot of transactions, there might 'always' be some data that's sitting waiting to be written, and therefore whenever one of these 'uneliminatable' events occurs, data will be lost. Let's assume, for a moment, there is a message every 5 seconds, so there are 6 email messages waiting to go to disk in each 30 second window. (For a very large corporation, the email arrival rate may be much larger, of course.)
So, adding the number of 'serious' operator mistakes that might be expected per machine per year (shall we say 1?) to the likelihood of electronic component failure (shall we say 50,000 hr MTBF, so roughly 0.2 events per year), we might expect 1.2 'events' per year. 1.2 x 6 messages is 7 email messages lost per year (7.2, actually). Due to the vulnerability window being 30 seconds. (Many more in the case of a greater message arrival rate, for a large corporate.)
Now let's see how many messages are lost if the filesystem writes to disk every 5 seconds, instead of every 30 seconds. The vulnerability window in this case is 5 seconds, and we'll have 1 message during that time. Same 'number' of events each year - 1.2 - so we'll lose 1.2 x 1 message, that's 1 message (1.2, actually). So with different filesystem behaviours, we can reduce the numbers of lost messages each year, and reduce the 'likelihood' that any particular message will be lost.
Assuming that a message availability target might be, say, fewer than 1 message lost in 10^8, the impact of each of the parameters in this calculation becomes important. Small differences in operator error rates, in vulnerability windows, and in equipment MTBFs, can make very large differences to the probability of meeting the availability targets.
And I haven't even mentioned UPSs, yet.
If you have a proper-sized UPS, combined with notification from the UPS to the servers to perform orderly shutdowns - including telling the application servers to shutdown prior to the storage servers, etc. - doesn't that render the (possibly more than theoretical) chances of data loss due to power interruption a moot point?
UPSs are a great help, but they are not failure-immune. They too, can fail, and will fail. They may just suddenly switch off, or they may fail to provide the expected duration of service, or they may fail to operate when the reticulated power does fail. We can add their failure rate into the calculations. I haven't any figures for them, but I'd guess at 3 years MTBF, so let's say another 0.3 events per year. We could redo the calculations above, with 1.5, now, instead of 1.2 - but I don't think we need to, on this list. (Of course, if we don't use a UPS, we'll have a seriously high event rate with every power glitch or drop wreaking havoc, so the lost message calculation would be much greater.)
Daniel, I'm delighted but not in the least surprised that you haven't lost a message. But I fully expect you will sometime in your operation's life unless you use (a) redundant equipment (eg RAID) with (b) very minimal windows of vulnerability (which, following that other thread, means a filesystem that does immediately write to disk when it is asked to do so and, seemingly, not all high-performance filesystems do).
regards, Ron
At 23:43 +0000 1/2/11, Ron Leach wrote:
Since the HDs can be considered 'secure' (well, something v close to 100% available), data can be that secure 'provided' it is written to the HD. Since failures can occur at any time, the smaller the time that data exists that is 'not' on the HD, compared to the time that data 'is' on the HD, the less 'likely' that data will be lost when one of these unpreventable system failures occurs. In filesystems that immediately write data to the HD there is, in principle, no period when data is 'unwritten'. But, (and you can see what's coming), with filesystems that wait 30 seconds before writing to disk the data that the application 'thinks' has been safely written, then there is a 30 second 'window' of vulnerability to one of these events. On a large system with a lot of transactions, there might 'always' be some data that's sitting waiting to be written, and therefore whenever one of these 'uneliminatable' events occurs, data will be lost. Let's assume, for a moment, there is a message every 5 seconds, so there are 6 email messages waiting to go to disk in each 30 second window. (For a very large corporation, the email arrival rate may be much larger, of course.)
As Stan says, strictly, any buffering delay in writing is independent of filesystem. It depends on the operating system and the drivers supplied for the filesystem. In practice, the access provided to the filesystem by the operating system may force a link between filesystem choice and delayed writes.
The Unix Sync flush to disc is traditionally performed every 30 secs
- by the wall-clock, not 30 secs after the data was queued to write. This means that the mean (average?) delay is 15 secs not 30.
UPSs are a great help, but they are not failure-immune. They too, can fail, and will fail. They may just suddenly switch off, or they may fail to provide the expected duration of service, or they may fail to operate when the reticulated power does fail. We can add their failure rate into the calculations. I haven't any figures for them, but I'd guess at 3 years MTBF, so let's say another 0.3 events per year. We could redo the calculations above, with 1.5, now, instead of 1.2 - but I don't think we need to, on this list. (Of course, if we don't use a UPS, we'll have a seriously high event rate with every power glitch or drop wreaking havoc, so the lost message calculation would be much greater.)
That's why the more expensive machines have multiple power supplies. Dual power supplies fed by two UPSs from different building feeds greatly reduce the chance of failure due to PSU, UPS or local power distribution board failure. One power distribution company client even had the equivalent of two power stations, but not many can manage that.
David
-- David Ledger - Freelance Unix Sysadmin in the UK. HP-UX specialist of hpUG technical user group (www.hpug.org.uk) david.ledger@ivdcs.co.uk www.ivdcs.co.uk
If you have a proper-sized UPS, combined with notification from the
UPS to the servers to perform orderly shutdowns - including telling the application servers to shutdown prior to the storage servers, etc. - doesn't that render the (possibly more than theoretical) chances of data loss due to power interruption a moot point?
UPSs are a great help, but they are not failure-immune. They too, can fail, and will fail. They may just suddenly switch off, or they may fail to provide the expected duration of service, or they may fail to operate when the reticulated power does fail. We can add their failure rate into the calculations. I haven't any figures for them, but I'd guess at 3 years MTBF, so let's say another 0.3 events per year. We could redo the calculations above, with 1.5, now, instead of 1.2 - but I don't think we need to, on this list. (Of course, if we don't use a UPS, we'll have a seriously high event rate with every power glitch or drop wreaking havoc, so the lost message calculation would be much greater.)
Daniel, I'm delighted but not in the least surprised that you haven't lost a message. But I fully expect you will sometime in your operation's life unless you use (a) redundant equipment (eg RAID) with (b) very minimal windows of vulnerability (which, following that other thread, means a filesystem that does immediately write to disk when it is asked to do so and, seemingly, not all high-performance filesystems do).
Just to add a note about power and 'knowledge' - I built my first OpenSolaris server with a decent size ZFS array, re-using a 'retired' case and power supply a couple years ago. It drove me crazy at first - I didn't even have it in production and ZFS kept failing random disks at random intervals. I happened to stumble across a post of another user who had the same problem and it turned out to be a 'poor' power supply. Sure enough, a brand new power supply 'fixed' the problem. Did I lose any data in the past? I have no idea, maybe it was temp data, maybe it culminated in a Windows crash or odd OS error. All I know is ZFS, in a round about way, found a problem I would have never known I had. I love ZFS, it's snapshots are the closest thing I've found to my beloved Novel's Salvage command ;)
Rick
On 31.1.2011, at 12.34, Ron Leach wrote:
MainConfig - Dovecot Wiki "fsync_disable = no Don't use fsync() or fdatasync() calls. This makes the performance better at the cost of potential data loss if the server (or the file server) goes down."
http://wiki1.dovecot.org/MainConfig
Is mail_fsync a v2 item? We're using Dovecot v1, for now. Presumably
fsync_disable = no
is the default, so that fsyncs take place?
Right.
As I understand it, Dovecot rebuilds its indexes if they become corrupted and, if that's the case, then there is no filesystem vulnerability in respect of those. We're using maildir. How soon after each mail message is written, moved, renamed, etc, does Dovecot issue fsyncs? Is there much 'commit-delay' up to that point, which might be a vulnerability window?
Success isn't returned to dovecot-lda or IMAP APPEND call until the mail has been fsynced. As long as the disk doesn't lie and the filesystem doesn't lie, there is zero data loss when fsyncing isn't disabled with Dovecot.
Finally, and I do apologise for all the questions, we're wishing to move to NFS. (At the moment we have a 'one box' Dovecot solution, but this makes upgrade of OS, upgrade of Dovecot, or upgrade of storage always a problem. We have already exported the new XFS filestore over NFS - but Dovecot is not (yet) using it, that's the next step for us.) Does the fsync solution we've been discussing work just as well when the XFS filestore is exported over NFS?
fsync() makes sure that the data is sent to NFS server. I don't know if NFS protocol itself has a fsync() call that guarantees that the data is written on disk on the server, but I very much doubt it does. So I don't think NFS will help with any data guarantees.
BTW. I'm pretty tired of reading about (or mostly skipping over) filesystem messages in this list. How about moving all this stuff to a wiki page where you can fight it out? Then in future related messages just point to the wiki link. Here's a suggestion for the new page: http://wiki2.dovecot.org/FileSystems
On 2/1/11 3:49 AM +0200 Timo Sirainen wrote:
fsync() makes sure that the data is sent to NFS server. I don't know if NFS protocol itself has a fsync() call that guarantees that the data is written on disk on the server, but I very much doubt it does. So I don't think NFS will help with any data guarantees.
I think it was Stan that pointed this out, sorry if that's a misattribution, but all calls in v2 are synchronous, v3 and v4 have specific calls (which are invoked by a local fsync()) which the NFS protocol requires that the data be committed to disk, so that fsync() semantics are preserved.
But as I noted earlier, if the filesystem lies to the NFS stack, or the NFS stack intentionally lies to the client, this may not be true.
As far as the NFS protocol is concerned though, there is such a call.
Timo Sirainen put forth on 1/31/2011 7:49 PM:
BTW. I'm pretty tired of reading about (or mostly skipping over) filesystem messages in this list. How about moving all this stuff to a wiki page where you can fight it out? Then in future related messages just point to the wiki link. Here's a suggestion for the new page: http://wiki2.dovecot.org/FileSystems
Sorry for the noise Timo. Maybe Frank and I, maybe some others, can try to write something up together in a coherent manner for that wiki page.
-- Stan
On Tue, 1 Feb 2011 03:49:04 +0200 Timo Sirainen <tss@iki.fi> articulated:
BTW. I'm pretty tired of reading about (or mostly skipping over) filesystem messages in this list. How about moving all this stuff to a wiki page where you can fight it out? Then in future related messages just point to the wiki link. Here's a suggestion for the new page: http://wiki2.dovecot.org/FileSystems
Hallelujah
+1
-- Jerry ✌ Dovecot.user@seibercom.net
Disclaimer: off-list followups get on-list replies or get ignored. Please do not ignore the Reply-To header.
Timo Sirainen put forth on 1/30/2011 4:40 PM:
On Sun, 2011-01-30 at 16:13 -0600, Stan Hoeppner wrote:
be, but it still gives you 0 byte files, so make sure you have a good UPS .. "Q: Why do I see binary NULLS in some files after recovery when I unplugged the power?
0 byte files != NULL bytes in files. My guess is it's the same problem as described in http://thunk.org/tytso/blog/2009/03/12/delayed-allocation-and-the-zero-lengt...
Yes, very similar Timo.
To be clear, for any subscribers who haven't followed all of the various filesystem and data security threads, with any modern *nix system, you WILL lose data when power fails. How much depends on how many writes to disk were in flight when the power failed, and how one has their RAID controller and inside-the-disk caches configured, whether using barriers, etc.
I believe I mentioned this when discussing the merits of XFS and ZFS with Frank, who stated Solaris/ZFS were immune to this, to which I called BS. They aren't immune, as Ted T'so clearly states. For those who don't know, Ted T'so is an MIT PH.D., is the creator of EXT2/3, and is to this day an active Linux kernel hacker/developer on filesystems and storage drivers.
-- Stan
On Sun, 2011-01-30 at 17:07 -0600, Stan Hoeppner wrote:
0 byte files != NULL bytes in files. My guess is it's the same problem as described in http://thunk.org/tytso/blog/2009/03/12/delayed-allocation-and-the-zero-lengt...
Yes, very similar Timo.
To be clear, for any subscribers who haven't followed all of the various filesystem and data security threads, with any modern *nix system, you WILL lose data when power fails. How much depends on how many writes to disk were in flight when the power failed, and how one has their RAID controller and inside-the-disk caches configured, whether using barriers, etc.
But it's not just about how much data is lost. It's also about if any existing data is unexpectedly lost. That's why people were complaining about ext4, because suddenly when renaming a file over another might lose both the old and the new file's contents if power got lost, while with ext3 either the old or the new data stayed behind. Then they did all kinds of things to ext4 to fix this / make it less likely.
I don't know how likely that is with XFS. Probably one way to test would be something like:
- Create 100 files of 1 MB size.
- sync
- Create a new file of 2 MB size & rename() it over a file created in step 1.
- Repeat 3 until all files are replaced
- Kill the power immediately after done
Then you can compare filesystems based on how many files there are whose size or content doesn't match.
Timo Sirainen put forth on 1/30/2011 5:27 PM:
But it's not just about how much data is lost. It's also about if any existing data is unexpectedly lost. That's why people were complaining about ext4, because suddenly when renaming a file over another might lose both the old and the new file's contents if power got lost, while with ext3 either the old or the new data stayed behind. Then they did all kinds of things to ext4 to fix this / make it less likely.
People were complaining about EXT4 because EXT2/3 implemented features to "save bad programmers from themselves", even though it is NOT the job of the filesystem code to do so. EXT4 removed these safeguards and bad programmers who relied on EXT2/3 to cross their Ts and dot their Is for them threw fits when they realized EXT4 didn't do this for them any longer. Google "O_PONIES", and the blog entry from Eric Sandeen, an XFS developer, regarding O_PONIES: http://sandeen.net/wordpress/?p=42
XFS never had such "protections for bad programmers". The bulk of IRIX developers were well/over educated and well/over paid, usually working for the US government in one from or another. Such developers knew when to fsync or take other measures to make sure critical data hit the disks. I dare say the average Linux developer didn't/doesn't have quite the same level of education or proper mindset as IRIX devs. If they'd had such skill we'd not have seen the EXT2/3 to EXT4 problem Ted describes.
I don't know how likely that is with XFS. Probably one way to test would be something like:
- Create 100 files of 1 MB size.
- sync
- Create a new file of 2 MB size & rename() it over a file created in step 1.
- Repeat 3 until all files are replaced
- Kill the power immediately after done
Then you can compare filesystems based on how many files there are whose size or content doesn't match.
Depending on your hardware, you may need a lot larger test set than 100 files. If you don't sync between steps 4/5 you may not see anything except that the 100 overwrites never occurred, as those writes may all still be in the buffer cache when you pull the plug.
Assuming you can call pull_the_plug with "perfect" timing, I can't tell you what the exact results would be, as I've never tested this. You'll likely lose a pre-existing file depending on what inodes were committed to the journal without their respective files being written. I'm pretty sure this is one of those scenarios that prompt programming professors to teach "create new/delete old/rename new to old", instead of "rename/edit in place/overwrite".
I recall back in the day that WordPerfect and Microsoft Word both implemented making a temp copy of every file opened for edit, because crashes of said text editors would routinely corrupt the opened file. I believe MS Word and Open Office Writer still do the same today.
It seems some fundamentals haven't really changed that much in 25 years.
-- Stan
On 1/30/11 5:07 PM -0600 Stan Hoeppner wrote:
To be clear, for any subscribers who haven't followed all of the various filesystem and data security threads, with any modern *nix system, you WILL lose data when power fails. How much depends on how many writes to disk were in flight when the power failed, and how one has their RAID controller and inside-the-disk caches configured, whether using barriers, etc.
That's incorrect. When you fsync() a file, all sane modern filesystems guarantee no data loss, unless you tune that out administratively for performance reasons. If you use a log structured filesystem (like zfs or WAFL) you can optimize the performance as well. With other types of filesystems (like xfs), performance suffers severely under heavy sync write loads.
As a reference point, ext3 with default settings guarantees data loss under normal conditions so I do not consider it a sane filesystem. You can tune that behavior out (so that you preserve data), but in that case ext3 operates with sub-par performance.
I believe I mentioned this when discussing the merits of XFS and ZFS with Frank, who stated Solaris/ZFS were immune to this, to which I called BS. They aren't immune, as Ted T'so clearly states. For those who don't know, Ted T'so is an MIT PH.D., is the creator of EXT2/3, and is to this day an active Linux kernel hacker/developer on filesystems and storage drivers.
Ted is a close acquaintance of mine, and if he indeed says what you said he says, he is wrong. More likely, he was simplifying or talking about certain cases, not the general case.
There are two ways to guarantee no data loss with zfs, one is to disable the ZIL (low performance) and the 2nd is to use a slog (high performance).
-frank
Frank Cusack put forth on 1/31/2011 3:06 PM:
On 1/30/11 5:07 PM -0600 Stan Hoeppner wrote:
To be clear, for any subscribers who haven't followed all of the various filesystem and data security threads, with any modern *nix system, you WILL lose data when power fails. How much depends on how many writes to disk were in flight when the power failed, and how one has their RAID controller and inside-the-disk caches configured, whether using barriers, etc.
That's incorrect. When you fsync() a file, all sane modern filesystems guarantee no data loss, unless you tune that out administratively for performance reasons. If you use a log structured filesystem (like zfs or WAFL) you can optimize the performance as well. With other types of filesystems (like xfs), performance suffers severely under heavy sync write loads.
This depends on how the dev does his syncs. If done intelligently, XFS performance won't suffer. In fact, the preferred write method to XFS for high performance applications is using O_DIRECT. Using O_DIRECT, correctly, with XFS, actually _increases_ write performance versus going through the buffer cache. So you get the best of both worlds: higher performance and data guaranteed on disk.
But not all applications use fsync, O_DIRECT, et al. The point I was making is that on any general system, you will likely have some applications/daemons writing without fsync or O_DIRECT, so you will likely suffer some data loss when the plug is pulled or the kernel crashes. If the timing of the crash is right you can even lose data when using fsync. Depends on how busy the system is and how many synced writes are in flight when the power drops. There truly aren't any guarantees that data will always be on disk. There are always corner cases where you will lose data. Thankfully, for most of us, most of the time, they are _extremely_ rare.
As a reference point, ext3 with default settings guarantees data loss under normal conditions so I do not consider it a sane filesystem. You can tune that behavior out (so that you preserve data), but in that case ext3 operates with sub-par performance.
I believe I mentioned this when discussing the merits of XFS and ZFS with Frank, who stated Solaris/ZFS were immune to this, to which I called BS. They aren't immune, as Ted T'so clearly states. For those who don't know, Ted T'so is an MIT PH.D., is the creator of EXT2/3, and is to this day an active Linux kernel hacker/developer on filesystems and storage drivers.
Ted is a close acquaintance of mine, and if he indeed says what you said he says, he is wrong. More likely, he was simplifying or talking about certain cases, not the general case.
Read Ted's article I linked. I didn't misquote him. The simple point he was making is that unless devs specifically use fsync or other calls to guarantee their data is on disk, they will suffer data loss with any modern journaling filesystem when the power goes out or the system crashes. You seem to be assuming all devs use fsync. Apparently this is far from reality.
There are two ways to guarantee no data loss with zfs, one is to disable the ZIL (low performance) and the 2nd is to use a slog (high performance).
And exactly how does an external log device guarantee no data loss? External journal logs enhance performance but I've never heard of them being a magic cure for data loss. XFS allows external log devices as well, for performance.
-- Stan
On 1/31/11 9:11 PM -0600 Stan Hoeppner wrote:
Frank Cusack put forth on 1/31/2011 3:06 PM:
That's incorrect. When you fsync() a file, all sane modern filesystems guarantee no data loss, unless you tune that out administratively for performance reasons. If you use a log structured filesystem (like zfs or WAFL) you can optimize the performance as well. With other types of filesystems (like xfs), performance suffers severely under heavy sync write loads.
This depends on how the dev does his syncs. If done intelligently, XFS performance won't suffer. In fact, the preferred write method to XFS for high performance applications is using O_DIRECT. Using O_DIRECT, correctly, with XFS, actually _increases_ write performance versus going through the buffer cache. So you get the best of both worlds: higher performance and data guaranteed on disk.
Most applications don't work well with O_DIRECT. O_DIRECT is meant as a tunable for write-mostly applications and a few other specific classes. A mail store is decidedly not in that class of application. As a data point, zfs (and all log structured filesystems) does not support O_DIRECT because it doesn't make sense given the on-disk layout -- there is no performance benefit to be had.
But not all applications use fsync, O_DIRECT, et al. The point I was making is that on any general system, you will likely have some applications/daemons writing without fsync or O_DIRECT, so you will likely suffer some data loss when the plug is pulled or the kernel crashes. If the timing of the crash is right you can even lose data when using fsync. Depends on how busy the system is and how many synced writes are in flight when the power drops. There truly aren't any guarantees that data will always be on disk. There are always corner cases where you will lose data. Thankfully, for most of us, most of the time, they are _extremely_ rare.
*NO* there are not any corner cases that are not due to administrative knobs (e.g. always buffer metadata) or simply due to bugs. POSIX semantics require that when you call fsync(), data makes it to disk. Many filesystems implement this correctly, however in many cases performance is quite poor. So most applications do not fsync() data.
Read Ted's article I linked. I didn't misquote him. The simple point he was making is that unless devs specifically use fsync or other calls to guarantee their data is on disk, they will suffer data loss with any modern journaling filesystem when the power goes out or the system crashes. You seem to be assuming all devs use fsync. Apparently this is far from reality.
No I do not assume all applications (not devs) use fsync. Most don't. Most mail applications do, or as in dovecot's case, have a knob. If an app does not use fsync, that is not what I am calling data loss. Data loss is the expected behavior for those types of applications. Mail generally doesn't fall into that category.
There are two ways to guarantee no data loss with zfs, one is to disable the ZIL (low performance) and the 2nd is to use a slog (high performance).
And exactly how does an external log device guarantee no data loss? External journal logs enhance performance but I've never heard of them being a magic cure for data loss. XFS allows external log devices as well, for performance.
I'm not going to spoon feed you. [Sorry, I couldn't resist.]
On 1.2.2011, at 5.11, Stan Hoeppner wrote:
This depends on how the dev does his syncs. If done intelligently, XFS performance won't suffer. In fact, the preferred write method to XFS for high performance applications is using O_DIRECT. Using O_DIRECT, correctly, with XFS, actually _increases_ write performance versus going through the buffer cache. So you get the best of both worlds: higher performance and data guaranteed on disk.
O_DIRECT is completely useless for just about every application there is. It was written for Oracle, and I doubt there are many applications outside (SQL) databases that use it at all.
Read Ted's article I linked. I didn't misquote him. The simple point he was making is that unless devs specifically use fsync or other calls to guarantee their data is on disk, they will suffer data loss with any modern journaling filesystem when the power goes out or the system crashes. You seem to be assuming all devs use fsync. Apparently this is far from reality.
Ted also thinks everyone should be using SQL(ite) database rather than filesystems directly. Many people don't agree.
Timo Sirainen put forth on 1/31/2011 9:43 PM:
O_DIRECT is completely useless for just about every application there is. It was written for Oracle, and I doubt there are many applications outside (SQL) databases that use it at all.
It's not suitable at all for mail. I didn't imply that. I merely mentioned it as it is one of the calls other than fsync that guarantees the data hit the disk. Since we're discussing it, it is used outside databases, heavily in HPC and the scientific community, one example being satellite data feed capture, where it doesn't make sense to push multiple gigabytes per second through the buffer cache before hitting the disks.
Ted also thinks everyone should be using SQL(ite) database rather than filesystems directly. Many people don't agree.
I don't claim to have read all of Ted's writings so I can't really comment on this. I originally quoted his blog post because of his comments on fsync and the behavior of all modern filesystem with regard to data resiliency after power loss in response to a comment Frank made, IIRC. Dovecot does fsyncs by default so this doesn't apply obviously.
Again, sorry for the OT noise.
-- Stan
On 1/30/11 5:07 PM -0600 Stan Hoeppner wrote:
To be clear, for any subscribers who haven't followed all of the various filesystem and data security threads, with any modern *nix system, you WILL lose data when power fails.
No, you won't, at least not necessarily.
I know I'm replying with just about the same content multiple times but there are multiple messages where you are spreading this misinformation.
It is possible to configure a file system to not suffer from data loss on power loss, and for mail stores that is generally the desired behavior.
Frank Cusack put forth on 1/31/2011 3:13 PM:
On 1/30/11 5:07 PM -0600 Stan Hoeppner wrote:
To be clear, for any subscribers who haven't followed all of the various filesystem and data security threads, with any modern *nix system, you WILL lose data when power fails.
No, you won't, at least not necessarily.
I know I'm replying with just about the same content multiple times but there are multiple messages where you are spreading this misinformation.
It is possible to configure a file system to not suffer from data loss on power loss, and for mail stores that is generally the desired behavior.
Maybe not every time, but it should surely motivate OPs to look at their power continuity solution(s).
Even using fsync et al, you can still lose data with power loss. It all depends on what is in flight where, on which bus or cable, and whether the pulses made it to the platters. fsync is a best effort. It can't guarantee all the hardware was able to play its part correctly before the electrons stopped flowing to the disk head actuator or spindle motor.
This is common sense. Anyone with the slightest knowledge of electricity and background in electronics, and working with computers for any amount of time, should realize this.
There is no 100% guarantee. This is one reason why the massive power backup industry exists. The other is obviously avoiding downtime.
-- Stan
On 1/31/11 9:27 PM -0600 Stan Hoeppner wrote:
Frank Cusack put forth on 1/31/2011 3:13 PM:
On 1/30/11 5:07 PM -0600 Stan Hoeppner wrote:
To be clear, for any subscribers who haven't followed all of the various filesystem and data security threads, with any modern *nix system, you WILL lose data when power fails.
No, you won't, at least not necessarily.
I know I'm replying with just about the same content multiple times but there are multiple messages where you are spreading this misinformation.
It is possible to configure a file system to not suffer from data loss on power loss, and for mail stores that is generally the desired behavior.
Maybe not every time, but it should surely motivate OPs to look at their power continuity solution(s).
Even using fsync et al, you can still lose data with power loss. It all depends on what is in flight where, on which bus or cable, and whether the pulses made it to the platters. fsync is a best effort. It can't guarantee all the hardware was able to play its part correctly before the electrons stopped flowing to the disk head actuator or spindle motor.
This is common sense. Anyone with the slightest knowledge of electricity and background in electronics, and working with computers for any amount of time, should realize this.
There is no 100% guarantee. This is one reason why the massive power backup industry exists. The other is obviously avoiding downtime.
Sigh. On that type of failure, fsync() doesn't return to the caller and the data is still elsewhere, queued for retransmission. Nothing is lost.
On 1/31/11 7:42 PM -0800 Frank Cusack wrote:
On 1/31/11 9:27 PM -0600 Stan Hoeppner wrote:
Frank Cusack put forth on 1/31/2011 3:13 PM:
On 1/30/11 5:07 PM -0600 Stan Hoeppner wrote:
To be clear, for any subscribers who haven't followed all of the various filesystem and data security threads, with any modern *nix system, you WILL lose data when power fails.
No, you won't, at least not necessarily.
I know I'm replying with just about the same content multiple times but there are multiple messages where you are spreading this misinformation.
It is possible to configure a file system to not suffer from data loss on power loss, and for mail stores that is generally the desired behavior.
Maybe not every time, but it should surely motivate OPs to look at their power continuity solution(s).
Even using fsync et al, you can still lose data with power loss. It all depends on what is in flight where, on which bus or cable, and whether the pulses made it to the platters. fsync is a best effort. It can't guarantee all the hardware was able to play its part correctly before the electrons stopped flowing to the disk head actuator or spindle motor.
This is common sense. Anyone with the slightest knowledge of electricity and background in electronics, and working with computers for any amount of time, should realize this.
There is no 100% guarantee. This is one reason why the massive power backup industry exists. The other is obviously avoiding downtime.
Sigh. On that type of failure, fsync() doesn't return to the caller and the data is still elsewhere, queued for retransmission. Nothing is lost.
I should add, it is common these days to have disks that lie about data making it to the platter. Most disks are tunable that way (writeback cache) and some even horrendously come with that as the default. zfs accounts for this and DTRT in all cases.
Timo Sirainen put forth on 1/30/2011 4:40 PM:
On Sun, 2011-01-30 at 16:13 -0600, Stan Hoeppner wrote:
be, but it still gives you 0 byte files, so make sure you have a good UPS .. "Q: Why do I see binary NULLS in some files after recovery when I unplugged the power?
0 byte files != NULL bytes in files.
IIRC, the former is the current XFS (correct) behavior (which you quoted from an old email, not today), and the latter is the behavior fixed by the 2007 patch. The text in the FAQ can be a bit confusing as neither the before or after behavior is thoroughly explained.
My guess is it's the same problem as described in http://thunk.org/tytso/blog/2009/03/12/delayed-allocation-and-the-zero-lengt...
Similar but not quite exactly the same. For practical purposes, as seen by an OP, they are functionally the same problem, so the exact cause is a bit irrelevant from an OP's perspective. XFS, along with all the journaling filesystems, still "suffers" from this delayed allocation dilemma during power loss/crash, as I've stated previously on this list.
If you want performance, there is a required sacrifice. In this case, delayed allocation gives the performance, but it sacrifices some "on disk" guarantees WRT power loss or crash. Again, this isn't the same issue as the XFS bug fixed in 2007. An XFS system today will still suffer data loss due to power loss/crash if there is write data in the Linux buffer cache, which is the case for all Linux journaling filesystems, not just XFS, as Ted T'so so eloquently points out in his blog post.
-- Stan
On Mon, Jan 31, 2011 at 8:13 AM, Stan Hoeppner <stan@hardwarefreak.com>wrote:
Nick Edwards put forth on 1/29/2011 7:14 PM:
On Sun, Jan 30, 2011 at 4:00 AM, Stan Hoeppner <stan@hardwarefreak.com wrote:
Nick Edwards put forth on 1/28/2011 9:47 PM:
xfs is not very nice to you if you lose power, it's not as bad as it used to be, but it still gives you 0 byte files, so make sure you have a good UPS to issue a safe shutdown of the server, if you do, xfs is better, using CentOS.
It would be nice if you spoke from experience or documentation instead of rumor.
I do speak from experience, it is also a /very/ well known fact, for someone who rants on and on and on with self justification, you sure don't actually know a lot do you.
It's interesting that you cut the "fact" out of my reply, then proceeded to rant
its interesting you still have not accepted the facts, but rather rely on an FAQ
HINT: not all bugs are ironed out
but its even more interesting that I bothered to read your rantings, at least until I got to the part you already said, and cut out some bits? Do you like hearing yourself rant? If its irrelevant to what I have to say, than I cut it, just like everyone else does all the time.
Second hint: you might enjoy wasting everyones time with 5 pages of blah blah blah and your tiring rants, but we have better things to do with our time, and from many private replies I've had, no less than six, yes, six other people took the time to email me in private agreeing, christ knows how many also agree but did not bother to comment on it.
Please stop quoting stuff if you, yourself have no direct experience with, get with it, some people still see these problems, just because you don't, does NOT mean it DOES NOT still exist. people are sick of of your multi page rants trying to make self out to be some sort of important expert that you clearly are not.
I do not know how to make it any clearer to you, perhaps you should go ask the 8yo next door.
do not waste my time again.
On 1/29/11 12:00 PM -0600 Stan Hoeppner wrote:
From: http://xfs.org/index.php/XFS_FAQ
"Q: Why do I see binary NULLS in some files after recovery when I unplugged the power?
Update: This issue has been addressed with a CVS fix on the 29th March 2007 and merged into mainline on 8th May 2007 for 2.6.22-rc1.
XFS journals metadata updates, not data updates. After a crash you are supposed to get a consistent filesystem which looks like the state sometime shortly before the crash, NOT what the in memory image looked like the instant before the crash.
Since XFS does not write data out immediately unless you tell it to with fsync, an O_SYNC or O_DIRECT open (the same is true of other filesystems), you are looking at an inode which was flushed out, but whose data was not. Typically you'll find that the inode is not taking any space since all it has is a size but no extents allocated (try examining the file with the xfs_bmap(8) command)."
Note "the same is true of other filesystems".
No it isn't. Not all file systems buffer writes this way.
Since 2007, XFS is no worse nor better at handling power loss than any other journaling filesystem. If power loss is a frequent concern in any organization, using any filesystem, you had best be fsync'ing all writes.
Well, unless of course you are ok with losing data. There are many many applications that can sustain data loss, so for those applications you can tune the filesystem for highest performance, and even if power loss is frequent you just don't care.
What you meant to say was, if data loss is a concern.
No filesystems today cause data loss due to power loss. Data loss due to power loss today is caused by the Linux buffer cache having pending disk writes held in memory. When the power goes, so do the contents of RAM, so you loose that data. This is filesystem independent.
No it isn't. The buffer cache isn't independent of the filesystem.
Nick Edwards wrote:
xfs is not very nice to you if you lose power, it's not as bad as it used to be, but it still gives you 0 byte files,
I began to worry about this after that other thread showed XFS's considerable strengths, and this one weakness. Co-incidentally,we had already just built a couple of XFS servers in raid1 configurations (the second is purely an rdiff-backup server for the first, and both are raid1).
Because our work is frequently the subject of very close legal scrutiny, we're utterly paranoid about losing email - that's why we've created those two redundant servers.
I remember Stan (in the other thread) saying, also, that write-delays due to caching were more or less built-in to the kernel anyway, so XFS may not be alone in this problem. What I am not (yet) sure about, is whether XFS is any 'more' vulnerable than others, or is any 'more' catastrophically damaged, than others, due to power fail. Has any analysis of this been published?
However, like the OP, our scale is quite small and this (potentially) gives us one advantage over those very large users. We could forgo some 'performance' if there were options in XFS that could reduce its 'vulnerability'. I looked at the XFS FAQ, and several of the archived messages on the XFS list, but could not see any create options, or mount options, that would reduce or inhibit the 'vulnerability window' (but I'm no expert on filesystems, or the kernel, so maybe I didn't understand what the FAQ was telling me). Would appreciate any suggestions from those who use and know XFS.
so make sure you have a good UPS to issue a safe shutdown of the server,
We are very susceptible to power outages, duration anything from 12 seconds to 14 hours (we're not in a city) and never notified in advance. We use APC desktop UPS for workstations and the few servers we have, and we then shut down. For security, the shutdown needs to be automatic so that it takes effect if the site is unmanned - overnight, for example.
'Absolutely secure email' needs the speed of XFS, the performance of XFS on multitudes of small files, and the fault-tolerance of some kind of non-volatile storage coupled with positive confirmation of successful writes. One day, maybe.
Until then, email needs UPSs, it seems.
regards, Ron
Hi all,
Pardon me for chiming in...
Until then, email needs UPSs, it seems.
You cannot rely on commercial power without a UPS in any critical system. As a plus, UPS filters the mains giving your PSU a big relief.
What's more worrying me about this filesystem-war (let me follow this bad behaviour and throw in ext4...) is the fact no one mentions the battery backed up raid controller.
Cheers, Robert
Ron Leach put forth on 1/30/2011 5:00 AM:
Nick Edwards wrote:
xfs is not very nice to you if you lose power, it's not as bad as it used to be, but it still gives you 0 byte files,
I began to worry about this after that other thread showed XFS's considerable strengths, and this one weakness. Co-incidentally,we had already just built a couple of XFS servers in raid1 configurations (the second is purely an rdiff-backup server for the first, and both are raid1).
Because our work is frequently the subject of very close legal scrutiny, we're utterly paranoid about losing email - that's why we've created those two redundant servers.
I remember Stan (in the other thread) saying, also, that write-delays due to caching were more or less built-in to the kernel anyway, so XFS may not be alone in this problem. What I am not (yet) sure about, is whether XFS is any 'more' vulnerable than others, or is any 'more' catastrophically damaged, than others, due to power fail. Has any analysis of this been published?
It's all in the XFS FAQ. See #23 for the power fail issue patch. Did you read the other excellent XFS resources available?
Users guide: http://xfs.org/docs/xfsdocs-xml-dev/XFS_User_Guide//tmp/en-US/html/index.htm...
File system structure: http://xfs.org/docs/xfsdocs-xml-dev/XFS_Filesystem_Structure//tmp/en-US/html...
Training labs: http://xfs.org/docs/xfsdocs-xml-dev/XFS_Labs/tmp/en-US/html/index.html
However, like the OP, our scale is quite small and this (potentially) gives us one advantage over those very large users. We could forgo some 'performance' if there were options in XFS that could reduce its 'vulnerability'. I looked at the XFS FAQ, and several of the archived messages on the XFS list, but could not see any create options, or mount options, that would reduce or inhibit the 'vulnerability window' (but I'm no expert on filesystems, or the kernel, so maybe I didn't understand what the FAQ was telling me). Would appreciate any suggestions from those who use and know XFS.
Once again: http://xfs.org/index.php/XFS_FAQ See #23
There is nothing to configure to make XFS 'more resilient' to power failure. There was a bug that caused problems after power failure. Again, the bug was fixed in May 2007, almost 4 years ago. That's the Jurassic period in internet time folks. There is no 'vulnerability window'.
Now you understand my frustration with Nick for spreading FUD.
so make sure you have a good UPS to issue a safe shutdown of the server,
We are very susceptible to power outages, duration anything from 12 seconds to 14 hours (we're not in a city) and never notified in advance. We use APC desktop UPS for workstations and the few servers we have, and we then shut down. For security, the shutdown needs to be automatic so that it takes effect if the site is unmanned - overnight, for example.
Aren't you using the net enabled APC units? They have a NIC slot for exactly this purpose. You install software on your physical hosts (some come with such software like Linux) and configure it. When wall power fails the UPS goes into battery mode, and when the battery hits a configurable amount of remaining capacity, it sends a packet to all connected/configured hosts to shut down. This has been available since the mid 1990s, and it's a fabulous, necessary feature. This is not limited to APC. Many UPS vendors offer such network capabilities.
'Absolutely secure email' needs the speed of XFS, the performance of XFS on multitudes of small files, and the fault-tolerance of some kind of non-volatile storage coupled with positive confirmation of successful writes. One day, maybe.
Until then, email needs UPSs, it seems.
Anything, everything, needs a UPS, except a laptop or smartphone (anything w/an inbuilt battery). "Online" models are best, and typically more expensive.
-- Stan
Stan Hoeppner wrote:
Did you read the other excellent XFS resources available?
Users guide: http://xfs.org/docs/xfsdocs-xml-dev/XFS_User_Guide//tmp/en-US/html/index.htm...
File system structure: http://xfs.org/docs/xfsdocs-xml-dev/XFS_Filesystem_Structure//tmp/en-US/html...
Training labs: http://xfs.org/docs/xfsdocs-xml-dev/XFS_Labs/tmp/en-US/html/index.html
I don't know why I missed these, thank you for pointing them out. Following Ted Tso's blog post mentioned by Timo, I'll read up on inodes and allocation so that I understand what is, and is not, committed, and when. What I do worry about is a crash or power loss during the period between the time an application requests some data to be written, and the time that the filesystem actually completes writing it with all the file allocation data correct (and therefore tolerant of a crash from then on).
I thought I read in the XFS mount options something that suggested there was up to a 30 second window for this commit, which is a relatively long lump of time out of our UPS availability. Here it is, not a mount option but a 'sysctl' (I expect these are discussed in the docs you pointed me to, above).
git.kernel.org - linux/kernel/git/torvalds/linux-2.6.git/blob - Documentation/filesystems/xfs.txt 175 "fs.xfs.xfssyncd_centisecs (Min: 100 Default: 3000 Max: 720000) 176 The interval at which the xfssyncd thread flushes metadata 177 out to disk. This thread will flush log activity out, and 178 do some processing on unlinked inodes. 179 "
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=blob;f=Do...
Why does this period in the UPS availability-time matter? Because the UPS available-time has, of course, to first be allocated to the application machines to close their applications, before the file servers can be asked to 'commit' any delayed allocations and close down themselves (I don't want the file servers to close down while Dovecot (and any other applications) still have relevant data yet unwritten to the file-servers).
regards, Ron
Ron Leach put forth on 1/31/2011 4:06 AM:
git.kernel.org - linux/kernel/git/torvalds/linux-2.6.git/blob - Documentation/filesystems/xfs.txt 175 "fs.xfs.xfssyncd_centisecs (Min: 100 Default: 3000 Max: 720000) 176 The interval at which the xfssyncd thread flushes metadata 177 out to disk. This thread will flush log activity out, and 178 do some processing on unlinked inodes. 179 "
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=blob;f=Do...
Why does this period in the UPS availability-time matter? Because the UPS available-time has, of course, to first be allocated to the application machines to close their applications, before the file servers can be asked to 'commit' any delayed allocations and close down themselves (I don't want the file servers to close down while Dovecot (and any other applications) still have relevant data yet unwritten to the file-servers).
You need to read all about the xfssyncd thread before you jump to conclusions about what the above patch actually does, and does not, do, and how that may or may not relate to your specific concerns here.
Note that when you do a shutdown, the kernel will flush all buffers, and XFS will automatically push all writes to disk. The patch you're looking at above, IIRC, is a house cleaning parameter, not a normal operations parameter. For instance, if you create 10k directories that's not going to fit in the XFS log. Once it fills up, the inodes at the head of the log will start getting flushed to disk, and new inodes will start coming in the tail of the log. If fs.xfs.xfssyncd_centisecs was a "set in stone" parameter you'd never get any performance out of XFS. IIRC the maximum log size is 128MB. Again, once it fills, it starts writing out immediately.
Also, if Ted's article told you anything, it is that any application writing critical data needs to do fsync or O_DIRECT, because no filesystem itself makes any guarantees about when data hits the disk. Ok, EXT3 did, for various ill conceived reasons. In a way, I guess, that patch above does something similar to what EXT3 did, but only for metadata in this case.
-- Stan
participants (14)
-
Daniel L. Miller
-
David Ledger
-
Eric Rostetter
-
Frank Cusack
-
Frank Cusack
-
Jan-Frode Myklebust
-
Jerry
-
Monika Janek
-
Nick Edwards
-
Rick Romero
-
Robert Joosten
-
Ron Leach
-
Stan Hoeppner
-
Timo Sirainen