[Dovecot] Replication plans

17 May 2007

      Several companies have been interested in getting replication support
for Dovecot. It looks like I could begin implementing it in a few months
after some other changes. So here's how I'm planning on doing it:
What needs to be replicated:

Saving new mails (IMAP, deliver)
Copying existing mails
Expunges
Flag and keyword changes
Mailbox creations, deletions and renames
Subscription list
IMAP extension changes, such as ACLs

I'll first talk about only master/slave configuration. Later then
master/multi-slave and multi-master.
Comments?
Basic design
Since the whole point of master-slave replication would be to get a
reliable service, I'll want to make the replication as reliable as
possible. It would be easier to implement much simpler master-slave
replication, but in error conditions that would almost guarantee that
some messages get lost. I want to avoid that.
Both slave and master would be running a dovecot-replication process.
They would talk to each others via TCP connection. Master would be
feeding changes (transactions) to slave, and slave would keep replying
how far it's committed the changes. So there would be an increasing
sequence number sent with each transaction.
Master keeps all the changes in memory until slave has replied that it
has committed the changes. If the memory buffer gets too large (1MB?)
because the slave is handling the input too slowly or because it's
completely dead, the master starts writing the buffer to a file. Once
the slave is again responding the changes are read from the file and
finally the file gets deleted.
If the file gets too large (10MB?) it's deleted and slave will require a
resync. Master always keeps track of "user/mailbox -> last transaction
sequence" in memory. When the slave comes back up and tells the master
its last committed sequence, this allows the master to resync only those
mailboxes that had changed. If this mapping doesn't exist (master was
restarted), all the users' mailboxes need to be resynced.
The resyncing will also have to deal with new mailboxes, ACL changes and
such. I'm not completely sure yet how the resyncing will work. Probably
the only important thing here is that it would be good if it would
resolve any message UID conflicts (even though they shouldn't happen,
see below) by comparing either the messages' MD5 contents, or with
maildir it would be enough to just compare the maildir filename.
It's possible that the slave doesn't die completely but only some
transactions fail. This shouldn't normally happen, so the easiest way to
handle it would be to try a full resync, and if that still fails stop
the whole slave. Another way would be to just mark that one user or
mailbox as dirty and try to resync it once in a while.
imap, pop3 and deliver would connect to dovecot-replication either using
UNIX sockets or POSIX message queue. Queue could be faster, but sockets
would allow better security checks. Or I suppose it would be possible to
send some kind of a small authentication token in each message with
queues. The communication protocol would be binary and the existing
dovecot.index.log format would be used as much as possible (making it
simple for slave to append changes to mailboxes' dovecot.index.log).
dovecot-replication process would need read/write access to all users'
mailboxes. So either it would run as root or it would need to have at
least group-permission rights to all mailboxes. A bit more complex
solution would be to use multiple processes each running with their own
UIDs, but I think I won't implement this yet.
The replication would require support for shared mailboxes. That way the
mailboxes could be accessed using the configured shared namespaces, for
example "shared/<username>/INBOX". (LMTP server could be implemented in
a similar way BTW)
Configuration
I'm not yet sure how this should work. The simplest possibility would be
just to configure a single slave and everything would be replicated
there.
But it should be possible to split users into multiple slaves (still one
slave/user). The most configurable way to do this would be to have
userdb return the slave host. Hmm. Actually that could work pretty well
if the userdb returned slave's name and each slave would have a separate
socket/queue where dovecot-replication listens in. I'm not sure if they
could all still be handled by a single dovecot-replication process or if
there should be a separate one for each of them. The configuration would
then be something like:
slave for usernames beginning with letters A..K
replication users_a_k {
host = slave1.example.com
other options?
}
L..Z and others
replication users_l_z_rest {
host = slave2.example.com
}
Saving
This is the most important thing to get right, and also the most complex
one. Besides replicating mails that are being saved via Dovecot, I think
also externally saved mails should be replicated when they're first
seen. This is somewhat related to doing an initial sync to a slave.
The biggest problem with saving is how to robustly handle master
crashes. If you're just pushing changes from master to slave and the
master dies, it's entirely possible that some of the new messages that
were already saved in master didn't get through to slave. So when you
switch to using slave, those messages are gone. The biggest problem with
this is that when saving new mails, they'll be assigned UIDs that were
already used by the original master. If a locally caching client had
already seen the UID from the original master, it'll think the message
is the original one instead of the new one. So the newly added message
pretty much gets lost, and if the user could expunge the message
(thinking it's the original one) without ever even seeing it.
This UID conflict could be avoided by making the master tell slave
whenever it's saving new mails. The latency between master-slave
communication could be avoided with a bit of a trick:

When beginning to save/copy a mail, tell slave to increase UID
counter by one. Slave replies with the new highest-seen-UID counter.
UID allocation is done after all the mails are saved/copied within a
transaction. At this stage wait until slave has replied that it has
reserved UIDs at least as high as we're currently allocating (other
processes could have caused it to get even higher).
If save/copy is aborted, tell the slave to decrease the UID counter
by the number of aborted messages.
UID counter desync shouldn't happen, but if it does handle that also
somehow.

Then the UID allocation latency depends on how long it takes to write
the message, and where the message comes from. With IMAP APPEND there's
probably no wait required, unless the APPEND command and the whole
message arrives in one IP packet. With deliver and copying it might have
to wait a while.
So the UID allocation should work with minimal possible latency. It
could use a separate TCP connection to slave and maybe a separate high
priority process in the slave that would handle only UID allocations.
The above will make sure that conflicting UIDs aren't used, but it
doesn't make sure that messages themselves aren't lost. If the master's
hard disk failed completely and it can't be resynced anymore, the
messages saved there are lost. This could be avoided by making the
saving wait until the message is saved to slave:

save mail to disk, and at the same time also send it to slave
allocate UID(s) and tell to slave what they were, wait for "mail
saved" reply from slave before committing the messages to mailbox
permanently

This would however add more latency than the simpler UID allocation
counter. I don't know if it should be optional which way is used.
Perhaps I'll implement the full save-sync first and if it seems too slow
then I'll implement optionally the UID allocation counter.
Save messages from imap/deliver would contain at least:

user name (unless using UNIX sockets, then it's already known)
mailbox name
UIDVALIDITY
4a) UID
4b) global UID, filename, offset and size of the message

4a) would be used with the UID allocation counter behavior. The message
would be sent after the message was saved and UIDs were allocated.
4b) would be used with the wait-slave-to-save behavior. The message
would be sent immediately whenever new data was received for the
message, so possibly multiple times for each message. The last block
would then contain size=0. After saving all the mails in a transaction,
there would be a separate "allocate UIDs x..y for global UIDs a, b, c,
d, etc." message which the slave then replies and master forwards the
reply to imap/deliver.
If the slave is down, or if a timeout (a few seconds?) occurs while
waiting for the reply from slave, imap/deliver could just go on saving
the message. If this is done it's possible that if the master also dies
soon, there will be UID conflicts. But since this should be a rare error
condition, it's unlikely that the master will die soon after a
misbehaving slave.
Copying
Copying has similar UID conflict problems than saving. The main
difference is that there's no need to transfer message contents, which
makes it faster to handle copies reliably.
Copying could be handled with the UID allocation counter way described
above. Since the message already exists in the server, there's no way
that the message gets lost. So it would be possible to allow master and
slave to go into desync, because the user would still find the message
from the original mailbox.
Move operation is done with "copy + mark \deleted + expunge"
combination. This could be problematic if the master had managed to do
all that before slave saw the COPY at all. Since the message isn't yet
copied to the new mailbox in slave, clients wouldn't see it there, but
locally caching clients might not see the message in the original
mailbox either because they had already received EXPUNGE notification
for it. So this should be avoided in some way.
Perhaps it would be better to just treat copies the same way as saves:
Require a notification from slave before replying to client with OK
message. Alternatively the wait could be done at EXPUNGE command, as
described below.
Expunges
Losing expunge transactions wouldn't be too bad, because the worst that
can happen is that the user needs to expunge the message again. Except
that having an expunged message suddenly appear again in the middle of
the mailbox violates IMAP protocol. I don't know if/how that will break
clients.
Solution here would again be that before EXPUNGE notifications are sent
to client we'll wait for reply from slave that it had also processed the
expunge.
Flag and keyword changes
Most likely doesn't matter if some of them are lost. If someone really
cares about them, it would be possible to again do the slave-waiting for
either all the changes or just when some specific flags/keywords change.
Other changes
Mailbox creations, deletions, renames and subscription changes could all
be handled in a similiar way to flag/keyword changes.
Handling changes implemented by IMAP extensions such as ACL is another
thing. It's probably not worth the trouble to implement a special
replication handler for each plugin. These changes most likely will
happen somewhat rarely though, so they could be implemented by sending
the whole IMAP command to slave and have it execute the command using a
regular imap process. The commands could be marked with some
"replication" flag so Dovecot knows that they need to be sent to slave.
Although instead of executing a new imap process for each user (probably
each command), it could be possible to use one imap process and just
change the mailbox name in the command. So instead of eg. "setacl
mailbox (whatever)" the IMAP process would execute "setacl
shared/username/mailbox (whatever)". That would require telling Dovecot
what parameter contains the mailbox name. Should be pretty easy to
implement, unless some IMAP extensions want multiple mailbox names or
the position isn't static. I can't think of any such extensions right
now though.
Master/multi-slave
Once the master/slave is working, support for multiple slaves could be
added. The main thing to be done here is to have the slaves select which
one of them is the new master, and then they'll have to sync the
mailboxes with each others, because each of them might have seen some
messages that others hadn't. This is assuming that save/etc. operations
that are waiting for a reply from slave would wait for only one reply
from the first slave that happens to respond. The number of slave
replies to wait for could be also configurable, in case you're worried
of a simultaneous master+slave crash.
Multi-master
After master/multi-slave is working, we're nearly ready for a full
multi-master operation. The master farm will need global locking.
Perhaps there is a usable existing implementation which could be used.
In any case there needs to be a way to keeps track of what master owns
what locks.
The easiest way to implement multi-master is to assume that a single
user isn't normally accessed from different servers at the same time.
This wouldn't be a requirement, but it would make the performance
better. For most users you could handle this by making sure that load
balancer transfers all connections from the same IP to the same server.
Or you could use a Dovecot proxy that transfers the connections.
So, once the user is accessed only from one server, the operation
becomes pretty much like master/multi-slave. You'll just need to be able
to dynamically change the current master server for the user. This can
be done by grabbing a global per-user lock before modifying the mailbox.
The lock wouldn't have to be released immediately though. If no other
server needs the lock, there's no point in locking/unlocking constantly.
Instead the lock server could request the current lock owner to release
the lock whenever another server wants it.

[Dovecot] Replication plans

Timo Sirainen

Basic design

Configuration

slave for usernames beginning with letters A..K

other options?

L..Z and others

Saving

Copying

Expunges

Flag and keyword changes

Other changes

Master/multi-slave

Multi-master