[Dovecot] Clustering (replication and proxying) plans for the future

Tue Oct 24 19:31:22 UTC 2006

I wrote this mostly so that I won't forget my thoughts. Most likely
won't be implemented anytime soon.

I haven't thought through all the locking issues. It would require a
distributed lock manager in any case. I think there are already several
existing ones, so one of them could probably be used for Dovecot.

Each mail server would consult a master server for getting namespace
configuration and location for shared/public mailboxes. If the data is
stored in some central SQL server there could just as well be multiple
master servers (or even each server be their own master).

Whenever trying to open a shared mailbox that doesn't exist in the
current server, the imap process would automatically create a new
connection for the destination server and start proxying the data. The
proxy would be a somewhat dummy proxy, which just forwards the client's
commands to it and doesn't even try to understand the replies from the
remote server. So admin will have to be careful to use only servers
which have all the capabilities that were announced by the original
server.

So, then the replication. Each namespace could be configured to be
available either by proxying or by replication:

namespace public {
  prefix = example/
  location = proxy:imap.example.com
}
namespace public {
  prefix = example/
  replicated_to = host1.example.com host2.example.com
  # If multi-master, we'll need to use global locking. If not, the
  # replicated_to servers contain only slaves.
  multi_master = yes
  location = maildir:~/Maildir
}

The way this would work is that each imap process connects via UNIX
socket to a replication process, which gets input from all the imap
processes. The replication process keeps track of how much has been sent
to each other server and also the acks it has received from them. So if
the network dies temporarily it'll start writing the changes to the disk
until it reaches some limit, after which a complete resync will be
necessary.

Dovecot already writes all the changes to a mailbox first to a
transaction log file and and only afterwards it updates the actual
mailbox. The contents of the new mails aren't stored there though.

The replication could work simply by sending the transaction logs'
contents to the replicatio process which passes it onto other servers,
which finally sync their local mailboxes based on that data. Since
Dovecot already is able to sync mailboxes based on the transaction log's
contents this should be pretty easy to implement.

Of course the new mails' contents also have to be sent. This could be
prioritized lower than the transaction traffic, so that each server
always has very up-to-date view of the mailbox metadata, but not
necessarily the contents of all the mails.

If the server finds itself in a situation that it doesn't have some
specific mail, it'll send a request to the replication process to fetch
it ASAP from another server. The reply will then take the highest
priority in the queue.

Normally when opening or reading the mailbox there's really no need to
do any global locking. The server will show the mailbox's state to be
what it has last received from the replication process.

Whenever modifying the mailbox a global lock is required. Here's also
the problem that if the network is down between two servers, they both
think that they got the global lock. If they did any conflicting
changes, they'll have to be resolved when both of them again see each
others.

Dovecot's transactions in general aren't order-dependent so most of the
conflicts can be resolved by just having the replication master decide
which way the transactions were done and then sending the transactions
back to all of the servers in that order.

Adding new messages is a problem however. UIDs must be always growing
and they must be unique. I think Cyrus solved this by having 64bit
global UIDs which really uniquely identify the messages (ie. first
32bits contain server-specific UID), and then the 32bit IMAP UIDs. If
there are any conflicts with the IMAP UIDs, they're given new UIDs
(which invalidates clients' cache for those messages, but that can't be
helped).

Resyncing works basically by requesting a list of all the mailboxes,
opening each one of them and requesting new transactions from the log
files with <UID validity>, <log file sequence>, <log file offset>
triplet. If UID validity has changed or log file no longer exists, a
full resync needs to be done. Also if the log file contains a lot of
changes, it's probably faster to do a full resync. Full resync would
then be just sending the whole dovecot.index file.

Conflict resolving could also require a full resync if transaction logs
were lost. That'd work again by the master receiving everyone's
dovecot.index files, deciding what to do with them and sending the
modified version back to everyone.

Each server has different transaction logs, so the log file sequences
can't be directly used between servers. So the replication process
probably will have to keep track of every server's log seq/offset. Also
when receiving full dovecot.index files they'll have to be updated in
some way.

Then there's the problem of how to implement the receiving side. If the
user already has a process connected to the replication process it could
be instructed to update the mailboxes. But since this doesn't happen all
that often, there also needs to be some replication-writer process which
writes all the received transactions to the mailboxes.

Now, when exactly should these replication-writer processes be created
then? If all users have the same UNIX user ID, then there could be a
spool of processes always running and doing the writing. Although this
worries me a bit since the same process is handling different users'
data.

More secure way, and the only way if users have different UNIX UIDs, is
to launch a separate writer process for each user. It probably isn't a
great idea to do that immediately when receiving data for a user, but to
queue data for the user for a while and then once enough has been
received or some timeout has occurred, the writer process would be
executed.

There probably should be some shared in-memory cache with a memory limit
of some megabytes, after which data is written to disk up to a
configurable amount. Only after that is full or timeout for the user is
reached, the writer process is executed and the data is sent to it.

All this replication stuff could also be used to implement a remote mail
storage (ie. smarter proxying). The received indexes/transactions could
simply be kept in memory, and all the mails that aren't already cached
in memory could just be requested from the master whenever needed. I'm
not sure if there's much point in doing this over dummy proxying though.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
Url : http://dovecot.org/pipermail/dovecot/attachments/20061024/c0a0b38e/attachment-0001.pgp