[Dovecot] Scalability plans: Abstract out filesystem and make it someone else's problem

Mon Sep 28 20:23:36 EEST 2009

Timo Sirainen wrote:
> On Mon, 2009-09-28 at 09:00 -0700, paulmon wrote:
>   
>> My current thinking is a key/value store as you've proposed.  Something like
>> Hadoop components or Project Voldamort.  Voldamort might be a better fit
>> from what I've read. 
>>     
>
> My understanding of Hadoop is that it's more about distributed computing
> instead of storage.
>   

I believe it's possible to use it to ask lots of machines to parse a bit 
of database and then get the answer back from all of them.  eg some 
people are alleged to be using it to parse huge log files in sensible 
time by splitting up their log files across lots of machines and asking 
each of them to do a bit of filtering...

I'm out of my depth at this point - only read the executive summary...

>> My current thinking if having the local delivery break messages up into
>> their component pieces, headers, from address, to address, spam scores, body
>> etc into various key:value relationships.  
>>     
>
> I was planning on basically just storing key=username/message-guid,
> value=message pairs instead of splitting it up. Or perhaps split header
> and body, but I think piecing it smaller than those just makes the
> performance worse. To get different headers quickly there would still be
> dovecot.index.cache (which would be in a some quick in-memory storage
> but also stored in the database).
>   

This can presumably be rephrased as:

- access times are say 10ms
- linear read times are say 60MB/sec
- Therefore don't break up a message into more than 0.010s * 60MB = 
600KB (ish) chunks or your seek times dominate simply doing linear reads 
and throwing away what you don't need...
- Obviously insert whatever timings you like and re-run the numbers, eg 
if you have some fancy pants flash drive then insert shorter seek times

However, these numbers and some very limited knowledge of how a small 
bunch of email clients seem to behave would suggest that the following 
is also worth optimising to varying degrees (please don't overlook 
someone wanting to implement some backend to try these ideas):

Theory: Attachments larger than K are worth breaking out according to 
the formula above
Justification: Above actually a fairly small attachment size it's 
cheaper to do a seek than to linear scan to the next mail message. For 
some storage designs this might be helpful (mbox type packing).  
Additionally some users have suggested that they want to try and single 
instance popular attachments, so "K" might be customisable, or better 
yet some design might choose to keep a cache of attachment fingerprints 
and de-dup them when a dup is next seen..

Theory: breakout (all) headers from bodies
Justification: scanning headers seems a popular task and dovecot keeps a 
local database to optimise the common case.  Reseeks would be slow 
though and some storage designs might be able to optimise and get fast 
linear seeks across all headers (eg pack them as per mbox and compress 
them?)

Theory: breakout individual headers
Justification: err... not got a good case for this one, but some of 
these fancy key value databases are optimised for creating views on 
certain headers across certain messages.  I imagine this won't fly in 
practice, but it seems a shame not to try it... Definitely anyone 
implementing an SQL database option will want to try it though... (bet 
it's slow though...)

Theory: pack message bodies together as per mbox
Justification: mbox seems faster, compresses better and all round seems 
better than maildir for access speed, except in certain circumstances 
such as deletes.  Dovecot already seems to optimise some corner cases by 
just marking messages dead without deleting them, so clearly there is 
tremendous scope for improvement here (dbox going down this route?).  
Some bright spark might design some backend which uses multiple mbox 
files to overcome the huge hit when "defragging" and it may well be that 
by incorporating eg splitting out larger attachments, and lightly 
compressing, then some workloads might see some really good performance! 
(Could be really interesting for archive mailboxes, etc?)

Just my 2p...

Ed