I'll probably be implementing multi-master replication this summer. My previous thoughts about it are here: http://dovecot.org/list/dovecot/2007-December/027284.html
Below is a description of how the replication protocol will probably work. It should work just as well for master-slave and multi-master setups. Comments welcome. I'll write later a separate mail about how it'll be implemented to Dovecot.
Goals
If a single (or configurable number of) server dies, no mails must be lost that have been reported to IMAP/SMTP clients as being successfully stored.
Must be able to automatically recover from a server desynchronization, such as:
- server has been offline for a long time
- some mail files have been manually added/deleted
- corrupted data/mail files if they're noticed
In multi-master setup if the link between servers die, the servers must be able to proceed autonomously (kind of conflicts with goal 1 though). When the link comes back, the changes are replicated as soon as possible.
Normal IMAP commands must not be able to cause desynchronization between servers. For example making conflicting flag changes simultaneously in two servers must not result in the servers having different flags.
Must perform well with at least 3 servers in a multi-master setup, preferrably still with tens of servers.
Latency shouldn't be increased noticeably when using servers distributed into 3 or more data centers. Must be usable over high-latency links (in optimistic async mode).
In normal operation send minimal incremental changes.
Protocol
There are two major parts in the protocol: Handling the normal incremental changes and fixing desynchronization.
Originally I thought that maybe the changes could be sent using the same format as Dovecot's transaction logs, but now I'm beginning to think that it's probably not that good idea. The code reuse potential is quite minimal and the format would still have to be extended in several ways so that it won't be directly compatible anyway.
So I'm thinking the protocol could be something text-based. The main benefit is that text-based protocols are easier to debug. Stream compression should drop most of the extra overhead if bandwidth is a problem.
The exact on-wire protocol anyway doesn't matter in the design that's discussed below.
The commands have tags similar to IMAP commands, because some commands may have to be forwarded to other servers and it may take a while to get a reply. During the wait the server may process other commands.
Mailbox master
Each mailbox has a single master server selected. In multi-master setups the master server may be moved by having the destination server simply request it from the current master. The current master must then give it up. If link is lost to the current master, one of the remaining servers will become the new master within the remaining servers.
Each server must be connected to the current master server. Since each mailbox can have a different master server, this typically means all servers are connected to each others. However it's possible to create setups where server connects to only one other server, which in turn connect to more servers. This is useful if there are bandwidth bottlenecks between some servers. This kind of a situation can also happen if in a network A-B-C the link A-C dies, but A-B and B-C continues to work. Because of this all commands must be able to function in a way that the server proxies them to the current master, instead of failing the command and trying to make it the caller's problem to resend the command to the actual master.
When the cluster starts up, a single server is selected as the root master for all mailboxes. If a server doesn't know who the current mailbox master is, it asks from the root. All servers cache the currently known mailbox masters to avoid constant requests to the root.
If the root server dies, another server is selected as the root. Because the new root doesn't know what masters have been requested (and asking all of them from all servers would just waste bandwidth), all the servers are expected to flush their master caches and drop their own master status. The new root doesn't respond to any requests before all servers have notified that they've dropped being a master.
The master status doesn't have to be at mailbox level granularity. It could just as well be configured to move at user, domain or even global level. Perhaps this could be done dynamically, so that higher granularity is used when the master is beginning to change too often between servers.
Mailbox ID
Mailbox IDs are session-specific numbers dynamically assigned for user+mailbox+UIDVALIDITY combinations. All connections have different mailbox IDs. Also send and receive directions have different IDs. This allows the sender to easily replace existing IDs to point to new mailboxes without causing any confusion.
MBOX:
- Mailbox ID
- User name
- Mailbox name
- Mailbox UIDVALIDITY
- Mailbox UIDNEXT
- Mailbox message count
If the receiver finds out it has a different UIDVALIDITY, the mailbox requires a full resync. Message count and UIDNEXT may also be used to determine if replication servers are out of sync.
Requesting master status
MASTER-MOVE:
- Mailbox ID
- [Destination SID] (if forwarding)
The command is sent to the last known master for the mailbox. The server will keep forwarding the command until it reaches the current master. During the forwarding other servers may want to request something from the master. These requests must be delayed by the forwarding servers until the move is finished.
Saving messages
SAVE:
- Mailbox IDs
- Received date
- [IMAP UID] (only if we're the master)
- Global UID (stays the same when copying the message)
- Message text Reply:
- [IMAP UID] (only if not specified in parameters)
- [Current mailbox master SID] (if it was moved)
If current server is not the master, the SAVE is sent to the master which gives the message its UID. The master server then replicates the message to other servers with the UID parameter set.
The mailbox master may have already changed by the time server receives a save request. If server receives a SAVE without IMAP UID parameter, it's responsible for finding out the new mailbox master and sending a new SAVE request to it. Once the new master replies with the IMAP UID, the server can reply to the original SAVE request, also providing the new master SID so the future requests can be sent there directly.
To be sure the message doesn't get lost, the server should not reply OK to the IMAP/SMTP client until it has received a reply from SAVE.
Copying messages
COPY:
- Source mailbox ID
- Destination mailbox ID
- Source IMAP UID
- Global UID
- [Destination IMAP UID] (only if we're the master)
- Destination received date Reply:
- [Destination IMAP UID] (only if not specified in parameters)
- [Current mailbox master SID] (if it was moved)
Source mailbox ID + source IMAP UID identifies the message to be copied. It's expected to contain the given global UID (which is just an extra sanity check). Otherwise it works the same way as SAVE.
Since the message already exists, it's probably not necessary to wait for a reply before replying OK to originating IMAP client.
Expunging messages
EXPUNGE:
- Mailbox ID
- UID range (No reply)
Expunges also have to be sent via master server (the same way as SAVE) to avoid COPY command failing in some servers because it was just expunged.
Changing message flags
STORE:
- Mailbox ID
- UID range
- Added flags/keywords
- Removed flags/keywords
- [Current modseq] (master sends)
- [Highest modseq of the messages before this change] (non-master sends)
- [flag: this is a CONDSTORE STORE UNCHANGEDSINCE] (non-master may send) [Reply:
- UIDs where STORE was rejected to (if CONDSTORE flag was used) ]
Stores also have to be sent via master server to avoid flag desynchronization. Master first checks if it has higher modseqs in the messages. Then it applies all the changes and forwards the changes to other servers. For messages that had higher initial modseqs their flags are sent to the server sending the STORE to fix a potential desync.
If CONDSTORE flag is set, the change is rejected for messages that had a higher modseq. Non-masters shouldn't reply to a STORE UNCHANGEDSINCE command before the change has been replicated to master and the rejections have been processed.
Mailbox synchronization
If a mailbox is determined to have changed externally (e.g. network connection down for too long, causing replication logs to get full) the mailbox state needs to be synchronized between servers.
SYNC:
- UIDVALIDITY
- UIDNEXT
- Message count
- For each message:
- UID
- Global UID
- Modseq
- Flags and keywords
- Received date Reply:
- (Sync finished)
Receiving server compares the parameters with its own mailbox state. If it finds previously unseen global UIDs, their message texts are requested:
FETCH:
- Mailbox ID
- UID Reply:
- Message text
SAVE, EXPUNGE and STORE commands are used to synchronize the mailbox.
A special case is when two servers have been saving messages independently from each others. In this case it's possible that the servers have used the same UIDs for different messages (different global UIDs). These need to be resolved by giving both conflicting UIDs new unused UIDs, otherwise IMAP clients may show them as wrong messages from their caches.
FIXME: If the other server had expunged a conflicting UID it still should be given a new UID. How do we find out this has happened?