[Dovecot] Scalability plans: Abstract out filesystem and make it someone else's problem

10 Aug 2009

      This is something I figured out a few months ago, mainly because this
one guy at work (hi, Stu) kept telling me my multi-master replication
plan sucked and we should use some existing scalable database. (I guess
it didn't go exactly like that, but that's the result anyway.)
So, my current plan is based on a couple of observations:

Index files are really more like memory dumps. They're already in an
optimal format for keeping them in memory, so they can be just mmap()ed
and used. Doing some kind of translation to another format would just
make it more complex and slower.

I can change all indexing and dbox code to not require any locks or
overwriting files. I just need very few filesystem operations, primarily
the ability to atomically append to a file.

Index and mail data is very different. Index data is accessed
constantly and it must be very low latency or performance will be
horrible. It practically should be in memory in local machine and there
shouldn't normally be any network lookups when accessing it.

Mail data on the other hand is just written once and usually read
maybe once or a couple of times. Caching mail data in memory probably
doesn't help all that much. Latency isn't such a horrible issue as long
as multiple mails can be fetched at once / in parallel, so there's only
a single latency wait.

So the high level plan is:

Change the index/cache/log file formats in a way that allows lockless
writes.

Abstract out filesystem accessing in index and dbox code and
implement a regular POSIX filesystem support.

Make lib-storage able to access mails in parallel and send multiple
"get mail" requests in advance.

(3.5. Implement async I/O filesystem backend.)

Implement a multi-master filesystem backend for index files. The idea
would be that all servers accessing the same mailbox must be talking to
each others via network and every time something is changed, push the
change to other servers. This is actually very similar to my previous
multi-master plan. One of the servers accessing the mailbox would still
act as a master and handle conflict resolution and writing indexes to
disk more or less often.

Implement filesystem backend for dbox and permanent index storage
using some scalable distributed database, such as maybe Cassandra. This
is the part I've thought the least about, but it's also the part I hope
to (mostly) outsource to someone else. I'm not going to write a
distributed database from scratch..

This actually should solve several issues:

Scalability, of course. It'll be as scalable as the distributed
database being used to store mails.

NFS reliability! Even if you don't care about any of these
alternative databases, this still solves NFS caching problems. You'd
keep using the regular POSIX FS API (or async FS api) but with the
in-memory index "cache", so only a single server is writing to mailbox
indexes at the same time.

Shared mailboxes. The filesystem API is abstracted, so it should be
possible to easily add another layer to handle accessing other users'
mails from both local and remote servers. This should finally make it
possible to easily support shared mailboxes with system users.

[Dovecot] Scalability plans: Abstract out filesystem and make it someone else's problem

Timo Sirainen