[Dovecot] Scalability plans: Abstract out filesystem and make it someone else's problem

Wed Aug 12 20:42:02 EEST 2009

>>> CouchDB seems like it would still be more difficult than necessary to 
>>> scale. I'd really just want something that distributes the load and 
>>> disk usage evenly across all servers and allows easily plugging in 
>>> more servers and it automatically rebalances the load. CouchDB seems 
>>> like much of that would have to be done manually (or building scripts 
>>> to do it).
>>>       
>> Ahh fair enough - I thought it being massively multi-master would allow 
>> simply querying different machines for different users.  Not a perfect 
>> scale-out, but good enough for a whole class of requirements...
>>     
>
> If users' all mails are stuck on a particular cluster of servers, it's
> possible that suddenly several users in those servers starts increasing
> their disk load or disk usage and starts killing the performance /
> available space for others. If a user's mails were spread across 100
> servers, this would be much less likely.
>   

Sure - I'm not a couchdb expert, but I think the point is that we would 
need to check the replication options because you would simply balance 
the requests across all the servers holding those users' data.  I'm kind 
of assuming that data would be replicated across more than one server 
and there would be some way of choosing which server to use for a given user

I only know couchdb to the extent of having glanced at the website some 
time back, but I liked the way it looks and thinks like Lotus Notes (I 
did love building things using that tool about 15 years ago - the 
replication was just years ahead of it's time.  The robustness was 
extraordinary and I remember when the IRA blew up a chunk of Manchester 
(including one of our servers) that everyone just went home and started 
using the Edinburgh or London office servers and carried on as though 
nothing happened...)

Actually it's materialised views are rather clever also...

>   
>>> Hmm. I don't really see how it looks like log structured storage.. But 
>>> you do know that multi-dbox is kind of a maildir/mbox hybrid, right?
>>>       
>> Well the access is largely append only, with some deletes and noise at 
>> the writing end, but largely the older storage stays static with much 
>> longer gaps between deletes (and extremely infrequent edits)
>>     
>
> Ah, right. I guess if you think about it from a "single user's mails"
> point of view.
>   

Well, single folder really

>> So maildir is optimised really for deletes, but improves random access 
>> for a subset of operations.  Mailbox is optimised for writes and seems 
>> like it's generally fast for most operations except deletes (people do 
>> worry about having a lot of eggs in one basket, but I think this is 
>> really a symptom of other problems at work).  Mailbox also has improved 
>> packing for small messages and probably has improved cache locality on 
>> certain read patterns
>>     
>
> Yes, this is why I'm also using mbox on dovecot.org for mailing list
> archives.
>   

Actually I use maildir, but apart from delete performance which is 
usually rare, mailbox seems better for nearly all use patterns

Seems like if it were possible to "solve" delete performance then 
mailbox becomes the preferred choice for many requirements (also lets 
solve the backup problem where the whole file changes every day)

>> So one obvious hybrid would be a mailbox type structure which perhaps 
>> splits messages up into variable sized sub mailboxes based on various 
>> criteria, perhaps including message age, type of message or message 
>> size...?  The rapid write delete would happen at the head, perhaps even 
>> as a maildir layout and gradually the storage would become larger and 
>> ever more compressed mailboxes as the age/frequency of access/etc declines.
>>
>> Perhaps this is exactly dbox?
>>     
>
> Something like that. In dbox you have one storage directory containing
> all mailboxes' mails (so that copying can be done by simple index
> updates). Then you have a bunch of files, each about n MB (configurable,
> 2 MB by default). Expunging initially only marks the message as expunged
> in index. Then later (or immediately, configurable) you run a cronjob
> that goes through all dboxes and actually removes the used space by
> recreating those dbox files.
>   

Yeah, sounds good.

You might consider some kind of "head optimisation", where we can 
already assume that the latest chunk of mails will be noisy and have a 
mixture of deletes/appends, etc.  Typically mail arrives, gets responded 
to, gets deleted quickly, but I would *guess* that if a mail survives 
for XX hours in a mailbox then likely it's going to continue to stay 
there for quite a long time until some kind of purge event happens (user 
goes on a purge, archive task, etc)

Sounds good anyway

Oh, have you considered some "optional" api calls in the storage API?  
The logic might be to assume that someone wanted to do something clever 
and split the message up in some way, eg store headers separately to 
bodies or bodies carved up into mime parts.  The motivation would be if 
there was a certain access pattern to optimise.  Eg for an SQL database 
it may well be sensible to split headers and the message body in order 
to optimise searching - the current API may not take advantage of that? 

Ed W