Daniel L. Miller wrote:
On 1/31/2011 3:00 PM, Ron Leach wrote:
All we want to do is not lose emails.
What does everyone else do? Lose emails?
I'm responsible for running a massive server farm (1 box) for an extraordinary number of users (at least 5 active accounts) so I may have a distorted view of reality. But I will say the only time I've lost mail has been through OP error (at least since I left MS Exchange behind...shudder). And since that idiot IT guy looks an awful like my mirror image...
Daniel, very nicely put.
In my experience also - aside from failures of reticulated power - most problems come from maintenance staff error. Someone already posted that people can pull the wrong cable, or switch off the wrong item, etc. Let's keep this in mind ...
I'm sure those OPs with larger budgets might have some hardware suggestions for reducing the chance of hardware failure leading to data loss (I mean, other than using good components, installed properly with oversized cooling - and possibly proactive upgrade/replacements prior to anticipated lifetime failure - how can you ELIMINATE the possibility of a CPU/controller/HD just deciding to blow up?)
Exactly, you can't. But that doesn't mean you can't very substantially reduce the impact of those problems. So, in these circumstances, one thing you can do is reduce the vulnerability - the susceptibility, if you will - of the data to these types of system failure (which cannot be eliminated, as you say). Additionally, you can try to arrange a minimum recovery capability even when failure is totally catastrophic.
You can protect against HD failure by using RAID, and achieve a certain level of assurance, possibly something very close to 100% in respect of that particular failure.
Since the HDs can be considered 'secure' (well, something v close to 100% available), data can be that secure 'provided' it is written to the HD. Since failures can occur at any time, the smaller the time that data exists that is 'not' on the HD, compared to the time that data 'is' on the HD, the less 'likely' that data will be lost when one of these unpreventable system failures occurs. In filesystems that immediately write data to the HD there is, in principle, no period when data is 'unwritten'. But, (and you can see what's coming), with filesystems that wait 30 seconds before writing to disk the data that the application 'thinks' has been safely written, then there is a 30 second 'window' of vulnerability to one of these events. On a large system with a lot of transactions, there might 'always' be some data that's sitting waiting to be written, and therefore whenever one of these 'uneliminatable' events occurs, data will be lost. Let's assume, for a moment, there is a message every 5 seconds, so there are 6 email messages waiting to go to disk in each 30 second window. (For a very large corporation, the email arrival rate may be much larger, of course.)
So, adding the number of 'serious' operator mistakes that might be expected per machine per year (shall we say 1?) to the likelihood of electronic component failure (shall we say 50,000 hr MTBF, so roughly 0.2 events per year), we might expect 1.2 'events' per year. 1.2 x 6 messages is 7 email messages lost per year (7.2, actually). Due to the vulnerability window being 30 seconds. (Many more in the case of a greater message arrival rate, for a large corporate.)
Now let's see how many messages are lost if the filesystem writes to disk every 5 seconds, instead of every 30 seconds. The vulnerability window in this case is 5 seconds, and we'll have 1 message during that time. Same 'number' of events each year - 1.2 - so we'll lose 1.2 x 1 message, that's 1 message (1.2, actually). So with different filesystem behaviours, we can reduce the numbers of lost messages each year, and reduce the 'likelihood' that any particular message will be lost.
Assuming that a message availability target might be, say, fewer than 1 message lost in 10^8, the impact of each of the parameters in this calculation becomes important. Small differences in operator error rates, in vulnerability windows, and in equipment MTBFs, can make very large differences to the probability of meeting the availability targets.
And I haven't even mentioned UPSs, yet.
If you have a proper-sized UPS, combined with notification from the UPS to the servers to perform orderly shutdowns - including telling the application servers to shutdown prior to the storage servers, etc. - doesn't that render the (possibly more than theoretical) chances of data loss due to power interruption a moot point?
UPSs are a great help, but they are not failure-immune. They too, can fail, and will fail. They may just suddenly switch off, or they may fail to provide the expected duration of service, or they may fail to operate when the reticulated power does fail. We can add their failure rate into the calculations. I haven't any figures for them, but I'd guess at 3 years MTBF, so let's say another 0.3 events per year. We could redo the calculations above, with 1.5, now, instead of 1.2 - but I don't think we need to, on this list. (Of course, if we don't use a UPS, we'll have a seriously high event rate with every power glitch or drop wreaking havoc, so the lost message calculation would be much greater.)
Daniel, I'm delighted but not in the least surprised that you haven't lost a message. But I fully expect you will sometime in your operation's life unless you use (a) redundant equipment (eg RAID) with (b) very minimal windows of vulnerability (which, following that other thread, means a filesystem that does immediately write to disk when it is asked to do so and, seemingly, not all high-performance filesystems do).
regards, Ron