[Dovecot] Replication protocol design

29 Apr 2008

      I'll probably be implementing multi-master replication this summer. My
previous thoughts about it are here:
http://dovecot.org/list/dovecot/2007-December/027284.html
Below is a description of how the replication protocol will probably
work. It should work just as well for master-slave and multi-master
setups. Comments welcome. I'll write later a separate mail about how
it'll be implemented to Dovecot.
Goals

If a single (or configurable number of) server dies, no mails must be
lost that have been reported to IMAP/SMTP clients as being successfully
stored.

Must be able to automatically recover from a server desynchronization,
such as:

server has been offline for a long time
some mail files have been manually added/deleted
corrupted data/mail files if they're noticed

In multi-master setup if the link between servers die, the servers must
be able to proceed autonomously (kind of conflicts with goal 1 though).
When the link comes back, the changes are replicated as soon as possible.

Normal IMAP commands must not be able to cause desynchronization between
servers. For example making conflicting flag changes simultaneously in two
servers must not result in the servers having different flags.

Must perform well with at least 3 servers in a multi-master setup,
preferrably still with tens of servers.

Latency shouldn't be increased noticeably when using servers distributed
into 3 or more data centers. Must be usable over high-latency links (in
optimistic async mode).

In normal operation send minimal incremental changes.

Protocol
There are two major parts in the protocol: Handling the normal incremental
changes and fixing desynchronization.
Originally I thought that maybe the changes could be sent using the same
format as Dovecot's transaction logs, but now I'm beginning to think that
it's probably not that good idea. The code reuse potential is quite minimal
and the format would still have to be extended in several ways so that it
won't be directly compatible anyway.
So I'm thinking the protocol could be something text-based. The main
benefit is that text-based protocols are easier to debug. Stream
compression should drop most of the extra overhead if bandwidth is a
problem.
The exact on-wire protocol anyway doesn't matter in the design that's
discussed below.
The commands have tags similar to IMAP commands, because some commands may
have to be forwarded to other servers and it may take a while to get a
reply. During the wait the server may process other commands.
Mailbox master
Each mailbox has a single master server selected. In multi-master setups
the master server may be moved by having the destination server simply
request it from the current master. The current master must then give it
up. If link is lost to the current master, one of the remaining servers
will become the new master within the remaining servers.
Each server must be connected to the current master server. Since each
mailbox can have a different master server, this typically means all
servers are connected to each others. However it's possible to create
setups where server connects to only one other server, which in turn
connect to more servers. This is useful if there are bandwidth bottlenecks
between some servers. This kind of a situation can also happen if in a
network A-B-C the link A-C dies, but A-B and B-C continues to work. Because
of this all commands must be able to function in a way that the server
proxies them to the current master, instead of failing the command and
trying to make it the caller's problem to resend the command to the actual
master.
When the cluster starts up, a single server is selected as the root master
for all mailboxes. If a server doesn't know who the current mailbox master
is, it asks from the root. All servers cache the currently known mailbox
masters to avoid constant requests to the root.
If the root server dies, another server is selected as the root. Because
the new root doesn't know what masters have been requested (and asking all
of them from all servers would just waste bandwidth), all the servers are
expected to flush their master caches and drop their own master status. The
new root doesn't respond to any requests before all servers have notified
that they've dropped being a master.
The master status doesn't have to be at mailbox level granularity. It could
just as well be configured to move at user, domain or even global level.
Perhaps this could be done dynamically, so that higher granularity is used
when the master is beginning to change too often between servers.
Mailbox ID
Mailbox IDs are session-specific numbers dynamically assigned for
user+mailbox+UIDVALIDITY combinations. All connections have different
mailbox IDs. Also send and receive directions have different IDs. This
allows the sender to easily replace existing IDs to point to new mailboxes
without causing any confusion.
MBOX:

Mailbox ID
User name
Mailbox name
Mailbox UIDVALIDITY
Mailbox UIDNEXT
Mailbox message count

If the receiver finds out it has a different UIDVALIDITY, the mailbox
requires a full resync. Message count and UIDNEXT may also be used to
determine if replication servers are out of sync.
Requesting master status
MASTER-MOVE:

Mailbox ID
[Destination SID] (if forwarding)

The command is sent to the last known master for the mailbox. The server
will keep forwarding the command until it reaches the current master.
During the forwarding other servers may want to request something from the
master. These requests must be delayed by the forwarding servers until the
move is finished.
Saving messages
SAVE:

Mailbox IDs
Received date
[IMAP UID] (only if we're the master)
Global UID (stays the same when copying the message)
Message text
Reply:
[IMAP UID] (only if not specified in parameters)
[Current mailbox master SID] (if it was moved)

If current server is not the master, the SAVE is sent to the master which
gives the message its UID. The master server then replicates the message to
other servers with the UID parameter set.
The mailbox master may have already changed by the time server receives a
save request. If server receives a SAVE without IMAP UID parameter, it's
responsible for finding out the new mailbox master and sending a new SAVE
request to it. Once the new master replies with the IMAP UID, the server
can reply to the original SAVE request, also providing the new master SID
so the future requests can be sent there directly.
To be sure the message doesn't get lost, the server should not reply OK to
the IMAP/SMTP client until it has received a reply from SAVE.
Copying messages
COPY:

Source mailbox ID
Destination mailbox ID
Source IMAP UID
Global UID
[Destination IMAP UID] (only if we're the master)
Destination received date
Reply:
[Destination IMAP UID] (only if not specified in parameters)
[Current mailbox master SID] (if it was moved)

Source mailbox ID + source IMAP UID identifies the message to be copied.
It's expected to contain the given global UID (which is just an extra
sanity check). Otherwise it works the same way as SAVE.
Since the message already exists, it's probably not necessary to wait for a
reply before replying OK to originating IMAP client.
Expunging messages
EXPUNGE:

Mailbox ID
UID range
(No reply)

Expunges also have to be sent via master server (the same way as SAVE) to
avoid COPY command failing in some servers because it was just expunged.
Changing message flags
STORE:

Mailbox ID
UID range
Added flags/keywords
Removed flags/keywords
[Current modseq] (master sends)
[Highest modseq of the messages before this change] (non-master sends)
[flag: this is a CONDSTORE STORE UNCHANGEDSINCE] (non-master may send)
[Reply:
UIDs where STORE was rejected to (if CONDSTORE flag was used)
]

Stores also have to be sent via master server to avoid flag
desynchronization. Master first checks if it has higher modseqs in the
messages. Then it applies all the changes and forwards the changes to other
servers. For messages that had higher initial modseqs their flags are sent
to the server sending the STORE to fix a potential desync.
If CONDSTORE flag is set, the change is rejected for messages that had a
higher modseq. Non-masters shouldn't reply to a STORE UNCHANGEDSINCE
command before the change has been replicated to master and the rejections
have been processed.
Mailbox synchronization
If a mailbox is determined to have changed externally (e.g. network
connection down for too long, causing replication logs to get full) the
mailbox state needs to be synchronized between servers.
SYNC:

UIDVALIDITY
UIDNEXT
Message count
For each message:
UID
Global UID
Modseq
Flags and keywords
Received date
Reply:

(Sync finished)

Receiving server compares the parameters with its own mailbox state. If it
finds previously unseen global UIDs, their message texts are requested:
FETCH:

Mailbox ID
UID
Reply:
Message text

SAVE, EXPUNGE and STORE commands are used to synchronize the mailbox.
A special case is when two servers have been saving messages independently
from each others. In this case it's possible that the servers have used the
same UIDs for different messages (different global UIDs). These need to be
resolved by giving both conflicting UIDs new unused UIDs, otherwise IMAP
clients may show them as wrong messages from their caches.
FIXME: If the other server had expunged a conflicting UID it still should
be given a new UID. How do we find out this has happened?

[Dovecot] Replication protocol design

Timo Sirainen

Goals

Protocol

Mailbox master

Mailbox ID

Requesting master status

Saving messages

Copying messages

Expunging messages

Changing message flags

Mailbox synchronization