[Dovecot] Maintaining data integrity through proper power supplies (slightly referencing Best filesystem)

Wed Feb 2 12:00:30 EET 2011

At 23:43 +0000 1/2/11, Ron Leach wrote:
>Since the HDs can be considered 'secure' (well, something v close to 
>100% available), data can be that secure 'provided' it is written to 
>the HD.  Since failures can occur at any time, the smaller the time 
>that data exists that is 'not' on the HD, compared to the time that 
>data 'is' on the HD, the less 'likely' that data will be lost when 
>one of these unpreventable system failures occurs.  In filesystems 
>that immediately write data to the HD there is, in principle, no 
>period when data is 'unwritten'.  But, (and you can see what's 
>coming), with filesystems that wait 30 seconds before writing to 
>disk the data that the application 'thinks' has been safely written, 
>then there is a 30 second 'window' of vulnerability to one of these 
>events.  On a large system with a lot of transactions, there might 
>'always' be some data that's sitting waiting to be written, and 
>therefore whenever one of these 'uneliminatable' events occurs, data 
>will be lost.  Let's assume, for a moment, there is a message every 
>5 seconds, so there are 6 email messages waiting to go to disk in 
>each 30 second window.  (For a very large corporation, the email 
>arrival rate may be much larger, of course.)

As Stan says, strictly, any buffering delay in writing is independent 
of filesystem. It depends on the operating system and the drivers 
supplied for the filesystem. In practice, the access provided to the 
filesystem by the operating system may force a link between 
filesystem choice and delayed writes.

The Unix Sync flush to disc is traditionally performed every 30 secs 
- by the wall-clock, not 30 secs after the data was queued to write. 
This means that the mean (average?) delay is 15 secs not 30.

>UPSs are a great help, but they are not failure-immune.  They too, 
>can fail, and will fail.  They may just suddenly switch off, or they 
>may fail to provide the expected duration of service, or they may 
>fail to operate when the reticulated power does fail.  We can add 
>their failure rate into the calculations.  I haven't any figures for 
>them, but I'd guess at 3 years MTBF, so let's say another 0.3 events 
>per year.  We could redo the calculations above, with 1.5, now, 
>instead of 1.2 - but I don't think we need to, on this list.  (Of 
>course, if we don't use a UPS, we'll have a seriously high event 
>rate with every power glitch or drop wreaking havoc, so the lost 
>message calculation would be much greater.)

That's why the more expensive machines have multiple power supplies. 
Dual power supplies fed by two UPSs from different building feeds 
greatly reduce the chance of failure due to PSU, UPS or local power 
distribution board failure. One power distribution company client 
even had the equivalent of two power stations, but not many can 
manage that.

David

-- 
David Ledger - Freelance Unix Sysadmin in the UK.
HP-UX specialist of hpUG technical user group (www.hpug.org.uk)
david.ledger at ivdcs.co.uk
www.ivdcs.co.uk