Re: [Dovecot] Replication plans
On Fri, May 18, 2007 1:42 am, Troy Benjegerdes <hozer@hozed.org> said:
I'm going to throw out a warning that it's my feeling that replication has ended many otherwise worthwhile projects. Once you go down that rabbit hole, you end up finding out the hard way that you just can't avoid the stability, performance, complexity, and whatever problems. .. I've found myself pretty much in the same "all roads lead to the filesystem" situation. I don't want to replicate just imap, I want to replicate the build directory with my source code, my email, and my MP3 files.
One of the problems with the clustered file system approach seems to be that accessing Dovecot's index, cache and control files are slow over the network. For speed, ideally you want your index, cache and control on local disk... but still replicated to another server.
So what about tackling this replication problem from a different angle... Make it Dovecot's job to replicate the index and control files between servers, and make it the file system's job to replicate just the mail data. This would require further disconnecting the index and control files from the mail data, so that there is less syncing required. i.e. remove the need to check directory mtimes and to compare directory listings against the index; and instead assume that the indexes are always correct. Periodically you could still check to see if a sync is needed, but you'd do this must less frequently.
I agree that there are already great solutions available for replicated storage, so this would allow us to take advantage of these solutions for the bulk of our storage without impacting the speed of IMAP.
Bill
On Fri, 2007-05-18 at 12:20 -0400, Bill Boebel wrote:
So what about tackling this replication problem from a different angle... Make it Dovecot's job to replicate the index and control files between servers, and make it the file system's job to replicate just the mail data. This would require further disconnecting the index and control files from the mail data, so that there is less syncing required. i.e. remove the need to check directory mtimes and to compare directory listings against the index; and instead assume that the indexes are always correct. Periodically you could still check to see if a sync is needed, but you'd do this must less frequently.
This would practically mean that you want either cydir or dbox storage.
This kind of a hybrid replication / clustered filesystem implementation would simplify the replication a bit, but all the difficult things related to UID conflicts etc. will still be there. So I wouldn't mind implementing this, but I think implementing the message content sending via TCP socket wouldn't add much more code anymore.
The clustered filesystem could probably be used to simplify some things though, such as UID allocation could be done by renaming a "uid-<next uid>" file. If the rename() succeeded, you allocated the UID, otherwise someone else did and you'll have to find the new filename and try again. But I'm not sure if this kind of a special-case handling would be good. Unless of course I decide to use the same thing for non-replicated cydir/dbox.
On Fri, May 18, 2007 at 12:20:13PM -0400, Bill Boebel wrote:
On Fri, May 18, 2007 1:42 am, Troy Benjegerdes <hozer@hozed.org> said:
I'm going to throw out a warning that it's my feeling that replication has ended many otherwise worthwhile projects. Once you go down that rabbit hole, you end up finding out the hard way that you just can't avoid the stability, performance, complexity, and whatever problems. .. I've found myself pretty much in the same "all roads lead to the filesystem" situation. I don't want to replicate just imap, I want to replicate the build directory with my source code, my email, and my MP3 files.
One of the problems with the clustered file system approach seems to be that accessing Dovecot's index, cache and control files are slow over the network. For speed, ideally you want your index, cache and control on local disk... but still replicated to another server.
Don't assume that the network is slower than disk.. Both InfiniBand and 10Gigabit ethernet are about 10-20 times faster on raw bandwidth than a single disk spindle, and around 100-1000 times lower latency if you can get the data out of another node's RAM. (10 or 100 microseconds instead of 10 milliseconds for a disk seek).
If what you want is speed, you want to keep the data in RAM... or at least in the RAM-backed OS buffer cache.. If the index, cache, and control files can be replicated to every node and still leave say, half the memory for actual message data, you win. If the replicated data files start pushing each other out of memory, you lose, and would be better off with the proxy approach where each node can be responsible for a portion of the index, cache, and control files.
For what it's worth, AFS 'replicates' the file data to a local disk cache.. Linux NFS with cachefs will also support a local disk-cache backed network filesystem. Where AFS (and probably nfs+cachefs) fall down is when the files (or directories) are changing a lot and you have to go back to the server all the time to fetch a new version. So maildir is a big win, except when a new message gets delivered and the clients all have to go fetch a new directory list from the fileserver.
So what about tackling this replication problem from a different angle... Make it Dovecot's job to replicate the index and control files between servers, and make it the file system's job to replicate just the mail data. This would require further disconnecting the index and control files from the mail data, so that there is less syncing required. i.e. remove the need to check directory mtimes and to compare directory listings against the index; and instead assume that the indexes are always correct. Periodically you could still check to see if a sync is needed, but you'd do this must less frequently.
I agree that there are already great solutions available for replicated storage, so this would allow us to take advantage of these solutions for the bulk of our storage without impacting the speed of IMAP.
I suppose that to really be able to reduce the mtime lookups and syncing, you'd probably need to use dbox so that there isn't the possibility of some other program accessing the maildirs.
participants (3)
-
Bill Boebel
-
Timo Sirainen
-
Troy Benjegerdes