[Dovecot] RFC: Storing indexes in a (remote) database to increase dovecot scalability

Mark Zealey Mark.Zealey at webfusion.com
Mon Aug 10 13:20:12 EEST 2009


Hi all,

This is just an idea I had over the weekend; I was wondering if instead
of storing the indexes/caches on disk, it would be possible to have an
option to store them in a remote database (eg mysql). There are several
issues that we currently have with scaling dovecot, and I think that if
it could store indexes in an external database we could alleviate most
of these issues. In no particular order:

1) One major issue we have with dovecot is the number of iops it
requires to keep its files in sync on our big disk arrays. If this were
a mysql database, we'd have better ways of keeping these in sync, for
example switching the speed at which it flushes the updates to disk,
growth using partitioning/sharding, read-only slave instances etc,
specific ssd storage for indexes/caches. I know that we could create a
separate ssd disk array and export that over nfs too, but see issue (2)
below. If indexes were stored in a mysql database we could choose to
have a different reliability for the indexes (ie if the database server
crashes we have 1 second of index changes lost, whereas for the email
itself we can ensure it is always kept in sync on our other disk
arrays).

2) nfs locking issues. No matter how hard we try, we always find there
are some issues with concurrency over nfs. It really doesn't seem
designed to handle these issues very well. I've not tried with nfsv4 and
I believe there are some improvements there, but at the same time nfs
wasn't really designed to store databases and have heavy concurrency. A
proper database server (eg mysql, postgres) on the other hand thrives in
this type of environment, especially when coupled with a transactional
storage backend such as innodb which includes support for deadlock
handling etc.

3) As we store all of our mail systems over nfs, effectively each export
is a single-threaded connection to the server. No matter how fast the
central nfs server you can realistically not go above 10k nfs ops/sec on
the fastest of local networks due to latency issues. With mysql you
could have multiple parallel connections to the central database store
(1 per process?) which would mean more nfs ops available for actually
serving the files.

4) Caching. Much more tuneable on a database server than in a
filesystem. Eg in mysql there are loads of options to tune the various
innodb/query/key caches, on nfs there are very few even on advanced
netapps etc

I know some of you store all your indexes on a single server locally &
then store the emails on a remote central filestore. Whilst this is a
good idea, it doesn't scale very well in that if the one server fails
then everything has to be reindexed as clients login which causes
incredible load as dovecot has to re-read most of the messages (in pop3
mode at least) in order to work out size counts etc. In my experience
this reindexing would take out our email system for the best part of a
day. I'm sure we could set something up with drbd mirroring or a central
san array with ssd's for the index data, but this intrinsically does not
scale with as much ease as a replicating database cluster could. There
are also failover issues which can be very easily solved with a database
system (eg mysql with an active/passive dual-master setup).

Also, I know that most of these points refer to nfs; and I know that
there are alternative methods (eg san + gfs or some other scalable
filesystem) however most of these are relatively new technologies
whereas nfs is pretty much the standard for remote file storage;
certainly I'd consider it a lot more reliable when compared to the likes
of gfs or lustre. Your typical large scale mail setup would be over nfs
and this is the sort of environment that I've run into these scaling
issues in. I've heard that gfs also doesn't work very well with 'hot
spots' which indexes tend to be, although I've no first hand experience
of this. So, although this post only really mentions nfs & mysql (the
two technologies that I'm most familiar with) I think this is a much
more general problem so I'm not advocating the use of either technology
apart from them being most people's first choices for ease of use.

What I'm really proposing is a way to separate the two different types
of data that dovecot uses. Indexes&caches contain small frequent
transactions; and whilst these are relatively important, dropping the
past few seconds of these transactions doesn't seriously matter. On the
other hand emails are larger, infrequently changed & it's important they
are kept fully in sync - if we accept a message it should be guaranteed
that we don't loose it. At the end of the day it seems to me that a
filesystem is the ideal place to store emails, whereas a database server
would be the ideal place to store indexes/caches. Obviously for speed of
setup, simply storing everything on a filesystem is easy so I'm not
suggesting that we abandon this route, but for truly large
infrastructures I think being able to use a database system would have
significant advantages in each of these four areas.

Any thoughts?

Mark

--
Mark Zealey -- Platform Architect
Product Development * Webfusion
123-reg.co.uk, webfusion.co.uk, donhost.co.uk, supanames.co.uk

This mail is subject to http://www.gxn.net/disclaimer


More information about the dovecot mailing list