[Dovecot] Dovecot clustering with dsync-based replication

Tue Feb 28 16:03:54 EET 2012

This document describes a design for a dsync-replicated Dovecot cluster.
This design can be used to build at least two different types of dsync
clusters, which are both described here. Ville has also drawn overview
pictures of these two setups, see
http://www.dovecot.org/img/dsync-director-replication.png and
http://www.dovecot.org/img/dsync-director-replication-ssh.png

First of all, why dsync replication instead of block level filesystem
replication?

 - dsync won't replicate filesystem corruption.

 - A cold restart of replication won't go through all of the data in the
disks, but instead quickly finds out what has changed.

 - Split brain won't result in downtime or losing any data. If both
sides did changes, the changes are merged without data loss.

 - If using more than 2 storages, the users' replicas can be divided
among the other storages. So if one storage goes down, the extra load is
shared by all the other storages, not just one.

Replication mail plugin
-----------------------

This is a simple plugin based on notify plugin. It listens for all
changes that happen to mailboxes (new mails, flag changes, etc.) Once it
sees a change, it sends an asynchronous (username, priority)
notification to replication-notify-fifo. The priority can be either high
(new mails) or low (everything else).

Optionally the replication plugin can also support synchronous
replication of new mail deliveries. In this way it connects to
replication-notify UNIX socket, tells it to replicate the user with sync
(=highest) priority and waits until it is done or
replication_sync_timeout occurs. The IMAP/LMTP client won't see an "OK"
reply until the mail is replicated (or the replication has failed). The
synchronous replication probably adds a noticeable delay, so it might
not be acceptable for IMAP, but might be for LMTP.

So, what is listening in those replication-notify* sockets? It depends
on if Dovecot is running on director-based setup or not.

Aggregator
----------

When running in Dovecot director-based setup, all of the Dovecot
backends (where replication plugin runs) also run "aggregator" process.
Its job is very simple: It proxies the notifications from mail plugin
and sends them via a single TCP connection to the replicator process
running in Dovecot proxies. This is simply an optimization to avoid tons
of short lived TCP connections directly from replication plugin to
director server.

When not running in Dovecot director setup (i.e. there is only a single
Dovecot instance that handles all of the users), there is no point in
having an aggregator proxy, because the replicator process is running on
the same server. In this kind of setup the replicator process directly
listens on the replication-notify* sockets.

Replicator
----------

The initial design for replicator isn't very complex either: It keeps a
priority queue of all users, and replicates those users at the top of
the queue. Notifications about changes to user's mailboxes (may) move
the user up in the priority queue. If the user at the top of the queue
already has been replicated "recently enough", the replicator stops its
work until new changes arrive or the "recently enough" is no longer
that.

dsync can do two types of syncs: quick syncs and full syncs. A quick
sync trusts indexes and does the replication with the least amount of
work and network traffic. A quick sync is normally enough to replicate
all changes, but just in case something has gone wrong there's also the
full sync option, which guarantees that the mailboxes end up being fully
synced. A full sync is slower though, and uses more network traffic.

The priority queue is sorted by:

 1. Priority (updated by a notification from replication plugin)

 2. If priority!=none: Last fast sync (those users are replicated first
whose last replication time is oldest)

 2. If priority=none: Last full sync (these users should already be
fully synced, but do a full sync for them once in a while anyway)

All users get added to the replication queue at replicator startup with
"none" priority. The list of users is looked up via userdb iteration. If
the previous replication state is found from a disk dump, it's used to
update the priorities, last_*_sync timestamps and other replication
state. Replicator process creates such dumps periodically [todo: every
few mins? maybe a setting?].

Replicator starts replicating users at the top of the queue, setting
their priorities to "none" before starting. This means that if another
change notification arrives during replication, the priority is bumped
up and no changes get lost. replication_max_conns setting specifies how
many users are replicated simultaneously. If the user's last_full_sync
is older than replication_full_sync_interval setting, a full sync is
done instead of a fast sync. If the user at the top of the queue has
"none" priority and the last_full_sync is newer than
replication_full_sync_interval, the replication stops. [todo: it would
be nice to prefer doing all the full syncs at night when there's
hopefully less disk I/O]

(A global replication_max_conns setting isn't optimal in proxy-based
setup, where different backend servers are doing the replication. There
it should maybe be a per-backend setting. Then again, it doesn't account
for the replica servers that also need to do replication work. Also to
properly handle this each backend should have its own replication queue,
but this requires doing a userdb lookup for each user to find out their
replication server, and this would need to be done periodically in case
the backend changes, which can easily happen often with director-based
setup. So all in all, none of this is being done in the initial
implementation. Ideally the users are distributed in a way that a global
replication queue would work well enough.)

In director-based setup each director runs a replicator server, but only
one of them (master) actually asks the backends to do the replication.
The rest of them just keep track of what's happening, and if the master
dies or hangs, one of the others becomes the new master. The server with
lowest IP address is always the master. The replicators are connected to
a ring like the directors, using the same director_servers setting. The
communication between them is simply about notifications of what's
happening to users' priorities. Preferably the aggregators would always
connect to the master server, but this isn't required. In general
there's not much that can go wrong, since it's not a problem if two
replicators request a backend to start replication for the same user or
if the replication queue states aren't identical.

If the replication is running too slowly [todo: means what exactly?],
log a warning and send an email to admin.

So, how does the actual replication happen? Replicator connects to
doveadm server and sends a "sync -u user at domain" command. In
director-based setup the doveadm server redirects this command to the
proper backend.

doveadm sync
------------

This is an independent feature from all of the above. Even with none of
it implemented, you could run this to replicate a user. Most of this is
already implemented. The only problem is that currently you need to
explicitly tell it where to sync. So, when the destination isn't
specified, it could do a userdb lookup and use the returned
"mail_replica" field as the destination. Multiple (sequentially
replicated) destinations could be supported by returning
"mail_replica2", "mail_replica3" etc. field.

In NFS-based (or shared filesystem-based in general) setup the
mail_replica setting is identical to mail_location setting. So your
primary mail_location would be in /storage1/user/Maildir, while the
secondary mail_replica would be in /storage2/user/Maildir. Simple.

In non-NFS-based setup two Dovecot servers talk dsync protocol to each
others. Currently dsync already supports SSH-based connections. It would
also be easy to implement direct TCP-based connections between two
doveadm servers. In future these connections could be SSL-encrypted.
Initially I'm only supporting SSH-based connections, as they're already
implemented. So what does the mail_replica setting look like in this
kind of a setup? I'm not entirely sure. I'm thinking that it could be
either "ssh:host" or "ssh:user at host", where user is the SSH login user
(this is opposite of the current doveadm sync command line usage). In
future then it could support also tcp:host[:port]. Both of these ssh:
and tcp: prefixes would also be supported by doveadm sync command line
usage (and perhaps the prefixless user at domain be deprecated).

dsync can run without any long lived locking and it typically works
fine. In case mailbox was modified during dsync, the replicas may not
end up being identical, but nothing breaks. dsync currently usually
notices this and logs a warning. When these conflicting changes was
caused by imap/pop3/lda/etc. this isn't a problem, they've already
notified replicator already to perform another sync that will fix it.

Running two dsyncs at the same time is more problematic though, mainly
related to new emails. Both dsyncs notice that mail X needs to be
replicated, so both save it and it results in having a duplicate. To
avoid this, there should be a dsync-lock. If this lock exists, dsync
should wait until the previous dsync is done and then do it again, just
in case there were more changes since the previous sync started.

This should conclude everything needed for replication itself.

High-availability NFS setup
---------------------------

Once you have replication, it's of course nice if the system
automatically recovers from a broken storage. In NFS-based setups the
idea is to do soft mounts, so if the NFS server goes away things start
failing with EIO errors, which Dovecot notices and switches to using the
secondary storage(s).

In v2.1.0 Dovecot already keeps track of mounted filesystems. Initially
they're all marked as "online". When multiple I/O errors occur in a
filesystem [todo: how many exactly? where are these errors checked, all
around in the code or checking the log?] the mountpoint is marked as
"offline" and the connections accessing that storage are killed [todo:
again how exactly?].

Another job for replication plugin is to hook into namespace creation.
If mail_location points to a mountpoint marked as "offline", it's
replaced with mail_replica. This way the user can access mails from the
secondary storage without downtime. If the replica isn't fully up to
date, this means that some of the mails (or other changes) may
temporarily be lost. These will come back again after the original
storage has come back up and replication has finished its job. So as
long as mails aren't lost in the original storage, there won't be any
permanent mail loss.

When an offline storage comes back online, its mountpoint's status is
initially changed to "failover" (as opposed to "online"). During this
state the replication plugin works a bit differently when the user's
primary mail_location is in this storage: It first checks if the user is
fully replicated, and if so uses the primary storage, otherwise it uses
the replica storage. Long running IMAP protocesses check the replication
state periodically and kill themselves once the user is replicated, to
move back to primary storage.

Once replicator notices that all users have been replicated, it tells
the backends' to change the "failover" state to "online" (via doveadm
server).

High-availability non-NFS setup
-------------------------------

One possibility is to use Dovecot proxies, which know which servers are
down. Instead of directing users to those servers, it would direct them
to replica servers. The server states could be handled similar to NFS
setup's online vs. failover vs. offline states.

Another possibility would be to do the same as above, except without
separate proxy servers. Just make "mail.example.com" DNS point to two IP
addresses, and if one Dovecot notices that it's not the user's primary
server, it proxies to the secondary server, unless it's down. If one IP
is down, clients hopefully connect to the other.