I wrote this mostly so that I won't forget my thoughts. Most likely won't be implemented anytime soon.
I haven't thought through all the locking issues. It would require a distributed lock manager in any case. I think there are already several existing ones, so one of them could probably be used for Dovecot.
Each mail server would consult a master server for getting namespace configuration and location for shared/public mailboxes. If the data is stored in some central SQL server there could just as well be multiple master servers (or even each server be their own master).
Whenever trying to open a shared mailbox that doesn't exist in the current server, the imap process would automatically create a new connection for the destination server and start proxying the data. The proxy would be a somewhat dummy proxy, which just forwards the client's commands to it and doesn't even try to understand the replies from the remote server. So admin will have to be careful to use only servers which have all the capabilities that were announced by the original server.
So, then the replication. Each namespace could be configured to be available either by proxying or by replication:
namespace public { prefix = example/ location = proxy:imap.example.com } namespace public { prefix = example/ replicated_to = host1.example.com host2.example.com # If multi-master, we'll need to use global locking. If not, the # replicated_to servers contain only slaves. multi_master = yes location = maildir:~/Maildir }
The way this would work is that each imap process connects via UNIX socket to a replication process, which gets input from all the imap processes. The replication process keeps track of how much has been sent to each other server and also the acks it has received from them. So if the network dies temporarily it'll start writing the changes to the disk until it reaches some limit, after which a complete resync will be necessary.
Dovecot already writes all the changes to a mailbox first to a transaction log file and and only afterwards it updates the actual mailbox. The contents of the new mails aren't stored there though.
The replication could work simply by sending the transaction logs' contents to the replicatio process which passes it onto other servers, which finally sync their local mailboxes based on that data. Since Dovecot already is able to sync mailboxes based on the transaction log's contents this should be pretty easy to implement.
Of course the new mails' contents also have to be sent. This could be prioritized lower than the transaction traffic, so that each server always has very up-to-date view of the mailbox metadata, but not necessarily the contents of all the mails.
If the server finds itself in a situation that it doesn't have some specific mail, it'll send a request to the replication process to fetch it ASAP from another server. The reply will then take the highest priority in the queue.
Normally when opening or reading the mailbox there's really no need to do any global locking. The server will show the mailbox's state to be what it has last received from the replication process.
Whenever modifying the mailbox a global lock is required. Here's also the problem that if the network is down between two servers, they both think that they got the global lock. If they did any conflicting changes, they'll have to be resolved when both of them again see each others.
Dovecot's transactions in general aren't order-dependent so most of the conflicts can be resolved by just having the replication master decide which way the transactions were done and then sending the transactions back to all of the servers in that order.
Adding new messages is a problem however. UIDs must be always growing and they must be unique. I think Cyrus solved this by having 64bit global UIDs which really uniquely identify the messages (ie. first 32bits contain server-specific UID), and then the 32bit IMAP UIDs. If there are any conflicts with the IMAP UIDs, they're given new UIDs (which invalidates clients' cache for those messages, but that can't be helped).
Resyncing works basically by requesting a list of all the mailboxes, opening each one of them and requesting new transactions from the log files with <UID validity>, <log file sequence>, <log file offset> triplet. If UID validity has changed or log file no longer exists, a full resync needs to be done. Also if the log file contains a lot of changes, it's probably faster to do a full resync. Full resync would then be just sending the whole dovecot.index file.
Conflict resolving could also require a full resync if transaction logs were lost. That'd work again by the master receiving everyone's dovecot.index files, deciding what to do with them and sending the modified version back to everyone.
Each server has different transaction logs, so the log file sequences can't be directly used between servers. So the replication process probably will have to keep track of every server's log seq/offset. Also when receiving full dovecot.index files they'll have to be updated in some way.
Then there's the problem of how to implement the receiving side. If the user already has a process connected to the replication process it could be instructed to update the mailboxes. But since this doesn't happen all that often, there also needs to be some replication-writer process which writes all the received transactions to the mailboxes.
Now, when exactly should these replication-writer processes be created then? If all users have the same UNIX user ID, then there could be a spool of processes always running and doing the writing. Although this worries me a bit since the same process is handling different users' data.
More secure way, and the only way if users have different UNIX UIDs, is to launch a separate writer process for each user. It probably isn't a great idea to do that immediately when receiving data for a user, but to queue data for the user for a while and then once enough has been received or some timeout has occurred, the writer process would be executed.
There probably should be some shared in-memory cache with a memory limit of some megabytes, after which data is written to disk up to a configurable amount. Only after that is full or timeout for the user is reached, the writer process is executed and the data is sent to it.
All this replication stuff could also be used to implement a remote mail storage (ie. smarter proxying). The received indexes/transactions could simply be kept in memory, and all the mails that aren't already cached in memory could just be requested from the master whenever needed. I'm not sure if there's much point in doing this over dummy proxying though.