[Dovecot] Scalability plans: Abstract out filesystem and make it someone else's problem

Timo Sirainen tss at iki.fi
Wed Aug 12 20:17:09 EEST 2009


On Wed, 2009-08-12 at 17:46 +0100, Ed W wrote:
> My expectation then is that with local index and sql message storage 
> that the performance should be very reasonable for a large class of 
> users... (ok, other problems perhaps arise)

If messages are stored to SQL in dummy blobs then the performance is
probably comparable to any other database I'm thinking about.

> > Yes, S3 should be possible. With dbox it could even be used to store 
> > the old mails and keep new mails in lower latency storage.
> 
> Mogile doesn't handle S3, but I always thought it would be terrific to 
> be able to have one copy of your data on fast local storage, but to be 
> able to use slower (sometimes cheaper) storage for backups or less 
> valuable data (eg older messages), ie replicating some data to other 
> storage boxes

dsync can do the replication, dbox can have primary/secondary partitions
for message data (if mail is not found from primary, it's looked up from
secondary). All that's needed is lib-storage backend for S3, or using
some filesystem layer to it. :)

> > CouchDB seems like it would still be more difficult than necessary to 
> > scale. I'd really just want something that distributes the load and 
> > disk usage evenly across all servers and allows easily plugging in 
> > more servers and it automatically rebalances the load. CouchDB seems 
> > like much of that would have to be done manually (or building scripts 
> > to do it).
> 
> Ahh fair enough - I thought it being massively multi-master would allow 
> simply querying different machines for different users.  Not a perfect 
> scale-out, but good enough for a whole class of requirements...

If users' all mails are stuck on a particular cluster of servers, it's
possible that suddenly several users in those servers starts increasing
their disk load or disk usage and starts killing the performance /
available space for others. If a user's mails were spread across 100
servers, this would be much less likely.

> > Hmm. I don't really see how it looks like log structured storage.. But 
> > you do know that multi-dbox is kind of a maildir/mbox hybrid, right?
> 
> Well the access is largely append only, with some deletes and noise at 
> the writing end, but largely the older storage stays static with much 
> longer gaps between deletes (and extremely infrequent edits)

Ah, right. I guess if you think about it from a "single user's mails"
point of view.

> So maildir is optimised really for deletes, but improves random access 
> for a subset of operations.  Mailbox is optimised for writes and seems 
> like it's generally fast for most operations except deletes (people do 
> worry about having a lot of eggs in one basket, but I think this is 
> really a symptom of other problems at work).  Mailbox also has improved 
> packing for small messages and probably has improved cache locality on 
> certain read patterns

Yes, this is why I'm also using mbox on dovecot.org for mailing list
archives.

> So one obvious hybrid would be a mailbox type structure which perhaps 
> splits messages up into variable sized sub mailboxes based on various 
> criteria, perhaps including message age, type of message or message 
> size...?  The rapid write delete would happen at the head, perhaps even 
> as a maildir layout and gradually the storage would become larger and 
> ever more compressed mailboxes as the age/frequency of access/etc declines.
> 
> Perhaps this is exactly dbox?

Something like that. In dbox you have one storage directory containing
all mailboxes' mails (so that copying can be done by simple index
updates). Then you have a bunch of files, each about n MB (configurable,
2 MB by default). Expunging initially only marks the message as expunged
in index. Then later (or immediately, configurable) you run a cronjob
that goes through all dboxes and actually removes the used space by
recreating those dbox files.

> It would also be interesting to consier separate message headers from 
> body content.  Have heavy localisation of message headers, and slower 
> higher latency access to the message body.  Would this improve access 
> speeds in general?  

Probably not much. Usually I think clients download a specific set of
headers, and those can be looked up from dovecot.index.cache file.
Although if a new header is looked up from all messages that's not in
cache already, it would be faster to go through headers if they were
packed together separately. But then again that would make it maybe a
bit slower to download full message, since it's split to two places.

I don't really know, but my feeling is that it wouldn't benefit all that
much.

> Also the mime structure could be torn apart to store 
> attachments individually - the motivation being single instance storage 
> of large attachments with identical content...  Anyway, these seem like 
> very speculative directions...

Yes, this is also something in dbox's far future plans.

> > I haven't really done any explicit benchmarks, but there are a few 
> > reasons why I think low-latency for indexes is really important:
> 
> I think low latency for indexes is a given.  You appear to have 
> architected the system so that all responses are delivered from the 
> index and baring an increase in index efficiency the remaining time is 
> spent doing the initial generation and maintenance of those indexes.  I 
> would have thought bar downloading an entire INBOX that the access time 
> of individual mails was very much secondary?

There are of course clients that can download lots of mails, one command
at a time.. I guess with those some kind of predictive prefetching could
help.

> > Yes, resolving conflicts due to split brain merging back is something 
> > I really want to make work as well as it can. The backend database can 
> > hopefully again help here (by noticing there was a conflict and 
> > allowing the program to resolve it).
> 
> In general conflict resolution is thrown back to the application, so 
> likely this is going to become a dovecot problem.  It seems that the 
> general class of problem is too hard to solve at the storage side

Right. I really want to be able to handle the conflict resolution
myself.

> > This is also one of its goals :) Even if I make a mistake in choosing 
> > a bad database first, it should be somewhat easy to implement another 
> > backend again. The backend FS API will be pretty simple. Basically 
> > it's going to be:
> 
> I wouldn't get too held back by posix semantics.  For sure they are 
> memorable, but definitely consider that transactions are the key to any 
> kind of database performance improvement and make sure you can batch 
> together stuff to make good use of the backend.  Your "flush" command 
> seems to be the implicit end of transaction, but I guess give it plenty 
> of thought that you might have a super slow system (eg S3) and the 
> backend might want the flexibility to mark something "kind of done", 
> while uploading for 30 seconds in the background, then marking it 
> properly done once S3 actually acks the data saved?

Well, the API's details can change, but those are the basic operations
it needs. The important points are that it doesn't need to overwrite
data in existing files and it doesn't need locking, but it does need
atomic appends with ability to know the offset where the append actually
saved the data.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 197 bytes
Desc: This is a digitally signed message part
Url : http://dovecot.org/pipermail/dovecot/attachments/20090812/f8d61943/attachment.bin 


More information about the dovecot mailing list