Re: [Dovecot] Maintaining data integrity through proper power supplies (slightly referencing Best filesystem)

2 Feb 2011


      Daniel L. Miller wrote:
...
On 1/31/2011 3:00 PM, Ron Leach wrote:
...
All we want to do is not lose emails.
What does everyone else do?  Lose emails?
...
I'm responsible for running a massive server farm (1 box) for an
extraordinary number of users (at least 5 active accounts) so I may have
a distorted view of reality.  But I will say the only time I've lost
mail has been through OP error (at least since I left MS Exchange
behind...shudder).  And since that idiot IT guy looks an awful like my
mirror image...
Daniel, very nicely put.
In my experience also - aside from failures of reticulated power -
most problems come from maintenance staff error.  Someone already
posted that people can pull the wrong cable, or switch off the wrong
item, etc.  Let's keep this in mind ...
...
I'm sure those OPs with larger budgets might have some hardware
suggestions for reducing the chance of hardware failure leading to data
loss (I mean, other than using good components, installed properly with
oversized cooling - and possibly proactive upgrade/replacements prior to
anticipated lifetime failure - how can you ELIMINATE the possibility of
a CPU/controller/HD just deciding to blow up?)
Exactly, you can't.  But that doesn't mean you can't very
substantially reduce the impact of those problems.  So, in these
circumstances, one thing you can do is reduce the vulnerability - the
susceptibility, if you will - of the data to these types of system
failure (which cannot be eliminated, as you say).  Additionally, you
can try to arrange a minimum recovery capability even when failure is
totally catastrophic.
You can protect against HD failure by using RAID, and achieve a
certain level of assurance, possibly something very close to 100% in
respect of that particular failure.
Since the HDs can be considered 'secure' (well, something v close to
100% available), data can be that secure 'provided' it is written to
the HD.  Since failures can occur at any time, the smaller the time
that data exists that is 'not' on the HD, compared to the time that
data 'is' on the HD, the less 'likely' that data will be lost when one
of these unpreventable system failures occurs.  In filesystems that
immediately write data to the HD there is, in principle, no period
when data is 'unwritten'.  But, (and you can see what's coming), with
filesystems that wait 30 seconds before writing to disk the data that
the application 'thinks' has been safely written, then there is a 30
second 'window' of vulnerability to one of these events.  On a large
system with a lot of transactions, there might 'always' be some data
that's sitting waiting to be written, and therefore whenever one of
these 'uneliminatable' events occurs, data will be lost.  Let's
assume, for a moment, there is a message every 5 seconds, so there are
6 email messages waiting to go to disk in each 30 second window.  (For
a very large corporation, the email arrival rate may be much larger,
of course.)
So, adding the number of 'serious' operator mistakes that might be
expected per machine per year (shall we say 1?) to the likelihood of
electronic component failure (shall we say 50,000 hr MTBF, so roughly
0.2 events per year), we might expect 1.2 'events' per year.  1.2 x 6
messages is 7 email messages lost per year (7.2, actually).  Due to
the vulnerability window being 30 seconds.  (Many more in the case of
a greater message arrival rate, for a large corporate.)
Now let's see how many messages are lost if the filesystem writes to
disk every 5 seconds, instead of every 30 seconds.  The vulnerability
window in this case is 5 seconds, and we'll have 1 message during that
time.  Same 'number' of events each year - 1.2 - so we'll lose 1.2 x 1
message, that's 1 message (1.2, actually).  So with different
filesystem behaviours, we can reduce the numbers of lost messages each
year, and reduce the 'likelihood' that any particular message will be
lost.
Assuming that a message availability target might be, say, fewer than
1 message lost in 10^8, the impact of each of the parameters in this
calculation becomes important.  Small differences in operator error
rates, in vulnerability windows, and in equipment MTBFs, can make very
large differences to the probability of meeting the availability targets.
And I haven't even mentioned UPSs, yet.
...
If you have a proper-sized UPS, combined with notification from the UPS
to the servers to perform orderly shutdowns - including telling the
application servers to shutdown prior to the storage servers, etc. -
doesn't that render the (possibly more than theoretical) chances of data
loss due to power interruption a moot point?
UPSs are a great help, but they are not failure-immune.  They too, can
fail, and will fail.  They may just suddenly switch off, or they may
fail to provide the expected duration of service, or they may fail to
operate when the reticulated power does fail.  We can add their
failure rate into the calculations.  I haven't any figures for them,
but I'd guess at 3 years MTBF, so let's say another 0.3 events per
year.  We could redo the calculations above, with 1.5, now, instead of
1.2 - but I don't think we need to, on this list.  (Of course, if we
don't use a UPS, we'll have a seriously high event rate with every
power glitch or drop wreaking havoc, so the lost message calculation
would be much greater.)
Daniel, I'm delighted but not in the least surprised that you haven't
lost a message.  But I fully expect you will sometime in your
operation's life unless you use
(a) redundant equipment (eg RAID) with
(b) very minimal windows of vulnerability (which, following that other
thread, means a filesystem that does immediately write to disk when it
is asked to do so and, seemingly, not all high-performance filesystems
do).
regards, Ron