[Dovecot] Replication protocol design #2
Changes:
- Added goal 8 and rewrote mailbox synchronization plan.
- Added new SELECT command to change active mailbox and removed mailbox ID from command parameters
Goals
If a single (or configurable number of) server dies, no mails must be lost that have been reported to IMAP/SMTP clients as being successfully stored.
Must be able to automatically recover from a server desynchronization, such as:
- server has been offline for a long time
- some mail files have been manually added/deleted
- corrupted data/mail files if they're noticed
In multi-master setup if the link between servers die, the servers must be able to proceed autonomously (kind of conflicts with goal 1 though). When the link comes back, the changes are replicated as soon as possible.
Normal IMAP commands must not be able to cause desynchronization between servers. For example making conflicting flag changes simultaneously in two servers must not result in the servers having different flags.
Must perform well with at least 3 servers in a multi-master setup, preferrably still with tens of servers.
Latency shouldn't be increased noticeably when using servers distributed into 3 or more data centers. Must be usable over high-latency links (in optimistic async mode).
In normal operation send minimal incremental changes.
Make non-incremental synchronization fast enough to allow replication with users' untrusted Dovecot servers which may connect only rarely. Support also super high latency replication (e.g. using USB sticks).
Protocol
There are two major parts in the protocol: Handling the normal incremental changes and fixing desynchronization.
Originally I thought that maybe the changes could be sent using the same format as Dovecot's transaction logs, but now I'm beginning to think that it's probably not that good idea. The code reuse potential is quite minimal and the format would still have to be extended in several ways so that it won't be directly compatible anyway.
So I'm thinking the protocol could be something text-based. The main benefit is that text-based protocols are easier to debug. Stream compression should drop most of the extra overhead if bandwidth is a problem.
The exact on-wire protocol anyway doesn't matter in the design that's discussed below.
The commands have tags similar to IMAP commands, because some commands may have to be forwarded to other servers and it may take a while to get a reply. During the wait the server may process other commands.
Mailbox master
Each mailbox has a single master server selected. In multi-master setups the master server may be moved by having the destination server simply request it from the current master. The current master must then give it up. If link is lost to the current master, one of the remaining servers will become the new master within the remaining servers.
Each server must be connected to the current master server. Since each mailbox can have a different master server, this typically means all servers are connected to each others. However it's possible to create setups where server connects to only one other server, which in turn connect to more servers. This is useful if there are bandwidth bottlenecks between some servers. This kind of a situation can also happen if in a network A-B-C the link A-C dies, but A-B and B-C continues to work. Because of this all commands must be able to function in a way that the server proxies them to the current master, instead of failing the command and trying to make it the caller's problem to resend the command to the actual master.
When the cluster starts up, a single server is selected as the root master for all mailboxes. If a server doesn't know who the current mailbox master is, it asks from the root. All servers cache the currently known mailbox masters to avoid constant requests to the root.
If the root server dies, another server is selected as the root. Because the new root doesn't know what masters have been requested (and asking all of them from all servers would just waste bandwidth), all the servers are expected to flush their master caches and drop their own master status. The new root doesn't respond to any requests before all servers have notified that they've dropped being a master.
The master status doesn't have to be at mailbox level granularity. It could just as well be configured to move at user, domain or even global level. Perhaps this could be done dynamically, so that higher granularity is used when the master is beginning to change too often between servers.
Mailbox ID
Mailbox IDs are session-specific numbers dynamically assigned for user+mailbox+UIDVALIDITY combinations. All connections have different mailbox IDs. Also send and receive directions have different IDs. This allows the sender to easily replace existing IDs to point to new mailboxes without causing any confusion.
MBOX:
- Mailbox ID
- User name
- Mailbox name
- Mailbox UIDVALIDITY Reply:
- [Mailbox UIDVALIDITY] (if changed - command failed)
- Mailbox UIDNEXT
- Highest modseq
If the receiver finds out it has a different UIDVALIDITY, the mailbox requires a full resync. UIDNEXT and modseq can be used to determine if replication servers are out of sync.
After mailbox ID has been assigned, the active mailbox can be changed using a command:
SELECT:
- Mailbox ID
Requesting master status
MASTER-MOVE:
- [Destination SID] (if forwarding)
The command is sent to the last known master for the selected mailbox. The server will keep forwarding the command until it reaches the current master. During the forwarding other servers may want to request something from the master. These requests must be delayed by the forwarding servers until the move is finished.
Saving messages
SAVE:
- Received date
- [IMAP UID] (only if we're the master)
- Global UID (stays the same when copying the message)
- Message text Reply:
- [IMAP UID] (only if not specified in parameters)
- [Current mailbox master SID] (if it was moved)
If current server is not the master, the SAVE is sent to the master which gives the message its IMAP UID. The master server then replicates the message to other servers with the IMAP UID parameter set.
The mailbox master may have already changed by the time server receives a save request. If server receives a SAVE without IMAP UID parameter, it's responsible for finding out the new mailbox master and sending a new SAVE request to it. Once the new master replies with the IMAP UID, the server can reply to the original SAVE request, also providing the new master SID so the future requests can be sent there directly.
To be sure the message doesn't get lost, the server should not reply OK to the IMAP/SMTP client until it has received a reply from SAVE.
Copying messages
FIXME: Source or destination mailbox ID parameter should probably be removed and the selected mailbox's ID used instead.
COPY:
- Source mailbox ID
- Destination mailbox ID
- Source IMAP UID
- Global UID
- [Destination IMAP UID] (only if we're the master)
- Destination received date Reply:
- [Destination IMAP UID] (only if not specified in parameters)
- [Current mailbox master SID] (if it was moved)
Source mailbox ID + source IMAP UID identifies the message to be copied. It's expected to contain the given global UID (which is just an extra sanity check). Otherwise it works the same way as SAVE.
Since the message already exists, it's probably not necessary to wait for a reply before replying OK to originating IMAP client.
Expunging messages
EXPUNGE:
- IMAP UIDs (No reply)
Expunges also have to be sent via master server (the same way as SAVE) to avoid COPY command failing in some servers because it was just expunged.
Changing message flags
STORE:
- IMAP UIDs
- Added flags/keywords
- Removed flags/keywords
- [Current modseq] (master sends)
- [Highest modseq of the messages before this change] (non-master sends)
- [flag: this is a CONDSTORE STORE UNCHANGEDSINCE] (non-master may send) [Reply:
- IMAP UIDs where STORE was rejected to (if CONDSTORE flag was used) ]
Stores also have to be sent via master server to avoid flag desynchronization. Master first checks if it has higher modseqs in the messages. Then it applies all the changes and forwards the changes to other servers. For messages that had higher initial modseqs their flags are sent to the server sending the STORE to fix a potential desync.
If CONDSTORE flag is set, the change is rejected for messages that had a higher modseq. Non-masters shouldn't reply to a STORE UNCHANGEDSINCE command before the change has been replicated to master and the rejections have been processed.
Mailbox synchronization
Network connections to other servers are always initiated by the same servers. From replication point of view they could be thought of as client/server.
The clients keep track of the last seen server state for all mailboxes:
- UIDVALIDITY
- UIDNEXT
- Highest modseq
Client requests mailbox ID with MBOX command. If UIDVALIDITY changed, state is reset by setting UIDNEXT=1 and highest modseq=0. MBOX command is then resent using the updated UIDVALIDITY.
If server's UIDNEXT (server-UIDNEXT) is higher than last seen UIDNEXT (old-UIDNEXT), it means the server has added new messages and we want to fetch them. If the client's UIDNEXT (client-UIDNEXT) is higher than the old-UIDNEXT, it means the client has added new messages and they need to be sent to the server. All messages with UID >= old-UIDNEXT and UID < min(client-UIDNEXT, server-UIDNEXT) have an UID conflict and they must be given new unused UIDs.
The synchronization is done in two batches of commands. Client first selects the mailbox and then starts the sync:
SYNC-BEGIN:
- server's last highest modseq
- client's current highest modseq
- QRESYNC-like sequence + IMAP UID lists for optimizing EXPUNGE replies
Client then proceeds to send all its changes using SAVE, EXPUNGE and STORE commands. The sync is finished with SYNC-END command.
Server replies with:
- List of expunged IMAP UIDs (EXPUNGE command):
- IMAP UIDs
- Added messages (SAVE command?):
- IMAP UID
- Global UID
- Modseq
- Flags and keywords
- Received date
- [Message text]
- Changed flags/keywords/modseqs (STORE commands):
- IMAP UID
- Modseq
- Changed flags and keywords (or if not easily determined, their current values)
- IMAP UID conflict fixes (UIDFIX command):
- Source IMAP UIDs
- Destination IMAP UIDs
- Sync finish:
- Updated UIDNEXT
- Updated highest modseq
Server may also reply with STORE commands that change nothing except modseqs. This is needed to get the modseqs synchronized.
If in high-latency + high-bandwidth mode the message texts are always sent. The message texts are also sent if the server can determine that the client couldn't have previously seen the message (i.e. the message was saved, not copied). Otherwise they have to be requested for the messages with previously unseen global UIDs:
FETCH:
- IMAP UIDs Reply:
- Message texts
By the time client receives the reply from server, it may already have done further changes. This means it may have to fix modseqs and IMAP UIDs internally as well.
On Thu, 2008-05-01 at 14:36 +0300, Timo Sirainen wrote:
Support also super high latency replication (e.g. using USB sticks).
If it wasn't clear how this works:
"Client" and "server" are both Dovecots, client just has no/slow network connection.
Client runs a command to create a replication batch file that's stored on USB memory stick (or whatever).
The USB stick is plugged into another machine with a higher network bandwidth to the server.
The replication batch file is copied to the server and a command is run to process it. The command creates a reply batch file.
The reply file is copied to the USB stick and the file gets moved to the client.
Client runs a command to import the reply file.
This works because the mailbox synchronization doesn't require more than one roundtrip.
- Make non-incremental synchronization fast enough to allow replication with users' untrusted Dovecot servers which may connect only rarely. Support also super high latency replication (e.g. using USB sticks).
The USB stick sync in particular would be a MASSIVELY powerful feature for an admittedly small userbase!
Ed W
Added "current modseq" parameter to all commands and MODSEQ command. Other things missing from the replication protocol are related to mailbox list, i.e. figuring out which mailboxes to replicate, which are newly created, deleted, renamed, etc. There will probably have to be a new "unique global mailbox ID" which is preserved across renames.
I think global message UIDs will also become 128bit IDs instead of 64bit. 128bit GIDs could be generated from:
- 64bit timestamp with nanosecond resolution (or microsecond if clock_gettime() not available). When delivering multiple mails within a process make sure this value increases always.
- 32bit process ID
- 32bit server ID (configurable? based on IPv4 address? 48bit MAC address by reducing PID/timestamp by 16bits? is exposing these to a normal user a security problem?)
I think it's pretty safe to assume that two processes can't deliver a mail within the same nanosecond using the same PID.
Modseq changing
MODSEQ:
- IMAP UIDs
- Updated modseq
When doing expunges, stores or whatever else in future updates modseqs, there's a problem with keeping them synchronized between all servers. This command allows fixing them so that the entire cluster will have the modseq from the server that assigned the highest modseq for the change. This means that some servers may temporarily assign a lower modseq for a message only to be soon updated to a slightly higher modseq.
When master is processing commands if a command's modseq parameter <= highest-modseq, the master sends back a MODSEQ which changes the message's modseq to highest-modseq+1. The command is forwarded to other servers using the updated modseq.
When server receives a MODSEQ command, the modseq is updated only for the messages which currently have a lower modseq. The modseq must be updated also if the message has already been expunged, so that syncing (replication or QRESYNC) can send expunges correctly.
On 5/1/2008 7:36 AM, Timo Sirainen wrote:
Changes:
Hi Timo,
The replication work sounds way cool and I look forward to being able to take advantage of it, but...
I'm also very keen on seeing full support for Shared Folders and client managed ACLs...
Has this fallen to the back-burner? I seem to recall this was planned for 1.2, but haven't heard much about it since...
Thanks for all your hard work!
--
Best regards,
Charles
On May 1, 2008, at 7:46 PM, Charles Marcus wrote:
I'm also very keen on seeing full support for Shared Folders and
client managed ACLs...Has this fallen to the back-burner? I seem to recall this was
planned for 1.2, but haven't heard much about it since...
At least some kind of a shared mailbox support is required for
replication. I was thinking that there would be one replication
process which can write (and read) all users' all mailboxes.
Other than that ACLs aren't high on my TODO list.
participants (3)
-
Charles Marcus
-
Ed W
-
Timo Sirainen