On Sat, 2011-01-15 at 10:41 +0000, Ed W wrote:
One of the systems to fail was a firewall running off SSD.
SSD or CF?
That doesn't make a lot of difference. They're all broadly similar. There are better devices and worse devices, but they're mostly crap.
And as I said earlier, even if you think you've worked out which is which, it may change from batch to batch of what is allegedly the *same* product.
It would appear it's also possible to damage some flash memory by powering off at the wrong moment?
Almost all of them will fail hard it you do any serious power-fail testing on them. It's not a hardware failure; it's just that their *internal* file system is corrupt and needs a fsck (or just wiping and starting again). But since you can't *access* the underlying medium, all you can do is cry and buy a new one.
The fun thing is that their internal garbage collection could be triggered by a *read* from the host computer, or could even happen purely due to a timeout of some kind. So there is *never* a time when you know it's "safe to power off because I haven't written to it for 5 minutes".
Yes, it's perfectly possible to design journalling file systems that *are* resilient to power failure. But the "file systems" inside these devices are generally written by the same crack-smoking hobos that write PC BIOSes; you don't expect quality software here.
By putting a logic analyser on some of these devices to watch what they're *actually* doing on the flash when they garbage-collect, we've found some really nasty algorithms. When garbage-collecting, one of them would read from the 'victim' eraseblock into RAM, then erase the victim block while the data were still only held in RAM — so that a power failure at that moment would lose it. And then, just to make sure its race window was nice and wide, it would then pick a *second* victim block and copy data from there into the freshly-erased block, before erasing that second block and *finally* writing the data from RAM back to it. It's just scary :)
-- dwmw2