[Dovecot] Better to use a single large storage server or multiple smaller for mdbox?

Thu Apr 12 15:10:20 EEST 2012

On 12/04/2012 12:09, Timo Sirainen wrote:
> On 12.4.2012, at 13.58, Ed W wrote:
>
>> The claim by ZFS/BTRFS authors and others is that data silently "bit rots" on it's own. The claim is therefore that you can have a raid1 pair where neither drive reports a hardware failure, but each gives you different data?
> That's one reason why I planned on adding a checksum to each message in dbox. But I forgot to actually do that. I guess I could add it for new messages in some upcoming version. Then Dovecot could optionally verify the checksum before returning the message to client, and if it detects corruption perhaps automatically read it from some alternative location (e.g. if dsync replication is enabled ask from another replica). And Dovecot index files really should have had some small (8/16/32bit) checksums of stuff as well..
>

I have to say - I haven't actually seen this happen... Do any of your 
big mailstore contacts observe this, eg rackspace, etc?

I think it's worth thinking about the failure cases before implementing 
something to be honest?  Just sticking in a checksum possibly doesn't 
help anyone unless it's on the right stuff and in the right place?

Off the top of my head:
- Someone butchers the file on disk (disk error or someone edits it with vi)
- Restore of some files goes subtly wrong, eg tool tries to be clever 
and fails, snapshot taken mid-write, etc?
- Filesystem crash (sudden power loss), how to deal with partial writes?

Things I might like to do *if* there were some suitable "checksums" 
available:
- Use the checksum as some kind of guid either for the whole message, 
the message minus the headers, or individual mime sections
- Use the checksums to assist with replication speed/efficiency (dsync 
or custom imap commands)
- File RFCs for new imap features along the "lemonde" lines which allow 
clients to have faster recovery from corrupted offline states...
- Single instance storage (presumably already done, and of course this 
has some subtleties in the face of deliberate attack)
- Possibly duplicate email suppression (but really this is an LDA 
problem...)
- Storage backends where emails are redundantly stored and might not ALL 
be on a single server (find me the closest copy of email X) - 
derivations of this might be interesting for compliance archiving of 
messages?
- Fancy key-value storage backends might use checksums as part of the 
key value (either for the whole or parts of the message)

The mail server has always looked like a kind of key-value store to my 
eye.  However, traditional key-value isn't usually optimised for 
"streaming reads", hence dovecot seems like a "key value store, 
optimised for sequential high speed streaming access to the key 
values"...  Whilst it seems increasingly unlikely that a traditional 
key-value store will work well to replace say mdbox, I wonder if it's 
not worth looking at the replication strategies of key-value stores to 
see if those ideas couldn't lead to new features for mdbox?

Cheers

Ed W