[Dovecot] RFC: Storing indexes in a (remote) database to increase dovecot scalability
Hi all,
This is just an idea I had over the weekend; I was wondering if instead of storing the indexes/caches on disk, it would be possible to have an option to store them in a remote database (eg mysql). There are several issues that we currently have with scaling dovecot, and I think that if it could store indexes in an external database we could alleviate most of these issues. In no particular order:
One major issue we have with dovecot is the number of iops it requires to keep its files in sync on our big disk arrays. If this were a mysql database, we'd have better ways of keeping these in sync, for example switching the speed at which it flushes the updates to disk, growth using partitioning/sharding, read-only slave instances etc, specific ssd storage for indexes/caches. I know that we could create a separate ssd disk array and export that over nfs too, but see issue (2) below. If indexes were stored in a mysql database we could choose to have a different reliability for the indexes (ie if the database server crashes we have 1 second of index changes lost, whereas for the email itself we can ensure it is always kept in sync on our other disk arrays).
nfs locking issues. No matter how hard we try, we always find there are some issues with concurrency over nfs. It really doesn't seem designed to handle these issues very well. I've not tried with nfsv4 and I believe there are some improvements there, but at the same time nfs wasn't really designed to store databases and have heavy concurrency. A proper database server (eg mysql, postgres) on the other hand thrives in this type of environment, especially when coupled with a transactional storage backend such as innodb which includes support for deadlock handling etc.
As we store all of our mail systems over nfs, effectively each export is a single-threaded connection to the server. No matter how fast the central nfs server you can realistically not go above 10k nfs ops/sec on the fastest of local networks due to latency issues. With mysql you could have multiple parallel connections to the central database store (1 per process?) which would mean more nfs ops available for actually serving the files.
Caching. Much more tuneable on a database server than in a filesystem. Eg in mysql there are loads of options to tune the various innodb/query/key caches, on nfs there are very few even on advanced netapps etc
I know some of you store all your indexes on a single server locally & then store the emails on a remote central filestore. Whilst this is a good idea, it doesn't scale very well in that if the one server fails then everything has to be reindexed as clients login which causes incredible load as dovecot has to re-read most of the messages (in pop3 mode at least) in order to work out size counts etc. In my experience this reindexing would take out our email system for the best part of a day. I'm sure we could set something up with drbd mirroring or a central san array with ssd's for the index data, but this intrinsically does not scale with as much ease as a replicating database cluster could. There are also failover issues which can be very easily solved with a database system (eg mysql with an active/passive dual-master setup).
Also, I know that most of these points refer to nfs; and I know that there are alternative methods (eg san + gfs or some other scalable filesystem) however most of these are relatively new technologies whereas nfs is pretty much the standard for remote file storage; certainly I'd consider it a lot more reliable when compared to the likes of gfs or lustre. Your typical large scale mail setup would be over nfs and this is the sort of environment that I've run into these scaling issues in. I've heard that gfs also doesn't work very well with 'hot spots' which indexes tend to be, although I've no first hand experience of this. So, although this post only really mentions nfs & mysql (the two technologies that I'm most familiar with) I think this is a much more general problem so I'm not advocating the use of either technology apart from them being most people's first choices for ease of use.
What I'm really proposing is a way to separate the two different types of data that dovecot uses. Indexes&caches contain small frequent transactions; and whilst these are relatively important, dropping the past few seconds of these transactions doesn't seriously matter. On the other hand emails are larger, infrequently changed & it's important they are kept fully in sync - if we accept a message it should be guaranteed that we don't loose it. At the end of the day it seems to me that a filesystem is the ideal place to store emails, whereas a database server would be the ideal place to store indexes/caches. Obviously for speed of setup, simply storing everything on a filesystem is easy so I'm not suggesting that we abandon this route, but for truly large infrastructures I think being able to use a database system would have significant advantages in each of these four areas.
Any thoughts?
Mark
-- Mark Zealey -- Platform Architect Product Development * Webfusion 123-reg.co.uk, webfusion.co.uk, donhost.co.uk, supanames.co.uk
This mail is subject to http://www.gxn.net/disclaimer
I just wrote a "Scalability plans" mail, which explains what I've been recently thinking about. I think that would solve most of your problems.
On Mon, 2009-08-10 at 11:20 +0100, Mark Zealey wrote:
- One major issue we have with dovecot is the number of iops it requires to keep its files in sync on our big disk arrays. If this were a mysql database, we'd have better ways of keeping these in sync, for example switching the speed at which it flushes the updates to disk, growth using partitioning/sharding, read-only slave instances etc, specific ssd storage for indexes/caches. I know that we could create a separate ssd disk array and export that over nfs too, but see issue (2) below. If indexes were stored in a mysql database we could choose to have a different reliability for the indexes (ie if the database server crashes we have 1 second of index changes lost, whereas for the email itself we can ensure it is always kept in sync on our other disk arrays).
Seems pretty complex to me, but it should be possible to implement MySQL FS backend and keep indexes/mails there. They would anyway pretty much have to be just blobs, but I guess that's not an issue for you?
- nfs locking issues. No matter how hard we try, we always find there are some issues with concurrency over nfs.
Solved again.
- As we store all of our mail systems over nfs, effectively each export is a single-threaded connection to the server. No matter how fast the central nfs server you can realistically not go above 10k nfs ops/sec on the fastest of local networks due to latency issues. With mysql you could have multiple parallel connections to the central database store (1 per process?) which would mean more nfs ops available for actually serving the files.
lib-storage parallelization with async I/O backend should also solve this. Even when using mysql the lib-storage parallelization would be required to take advantage of multiple connections.
- Caching. Much more tuneable on a database server than in a filesystem. Eg in mysql there are loads of options to tune the various innodb/query/key caches, on nfs there are very few even on advanced netapps etc
I'm not sure about this. I think the in-memory index cache would solve the worst problem.
participants (2)
-
Mark Zealey
-
Timo Sirainen