[Dovecot] Replication plans

Mon May 21 03:58:29 EEST 2007

Timo Sirainen writes:

> Master keeps all the changes in memory until slave has replied that it
> has committed the changes. If the memory buffer gets too large (1MB?)

Does this mean that in case of a crash all that would be lost?
I think the cache should be smaller.  

> because the slave is handling the input too slowly or because it's
> completely dead, the master starts writing the buffer to a file. Once
> the slave is again responding the changes are read from the file and
> finally the file gets deleted.

Good.

> If the file gets too large (10MB?) it's deleted and slave will require a
> resync.

Don't agree.
A large mailstore with Gigabytes worth of mail would benefit from having 
10MB synced... instead of re-starting from scratch.

>Master always keeps track of "user/mailbox -> last transaction
> sequence" in memory. When the slave comes back up and tells the master
> its last committed sequence, this allows the master to resync only those
> mailboxes that had changed.

I think a user configurable option to decide how large the sync files can 
grow to would be most flexible.

> the whole slave. Another way would be to just mark that one user or
> mailbox as dirty and try to resync it once in a while.

That sounds better.
A full resync can be very time consuming with a large and busy mailstore.
Not only the full amount of data needs to be synced, but new changes too. 

> queues. The communication protocol would be binary

Because? Performance? Wouldn't that make debugging more difficult?

> dovecot-replication process would need read/write access to all users'
> mailboxes. So either it would run as root or it would need to have at
> least group-permission rights to all mailboxes. A bit more complex
> solution would be to use multiple processes each running with their own
> UIDs, but I think I won't implement this yet.

For now pick the easiest approach to get this first version out.
This will allow testers to have something to stress test. Better to have 
some basics out.. get feedback.. than to try to go after a more complex 
approach; unless you believe the complex approach is the ultimate long term 
best method.

> But it should be possible to split users into multiple slaves (still one
> slave/user). The most configurable way to do this would be to have
> userdb return the slave host.

Why not just have 1 slave process per slave machine?

> This is the most important thing to get right, and also the most complex
> one. Besides replicating mails that are being saved via Dovecot, I think
> also externally saved mails should be replicated when they're first
> seen. This is somewhat related to doing an initial sync to a slave.

Why not go with a pure log replication scheme?
this way you basically have 3 processes.

1- The normal, currently existing programs. Add logs to the process
2- A Master replication process which listens for clients requesting for 
info.
3- The slave processes that request infomation and write it to the slave 
machines.

With this approach you can basically break it down into logical units of 
code which can be tested and debugged. Also helps when you need to worry 
about security and the level at which each component needs to work.

> The biggest problem with saving is how to robustly handle master
> crashes. If you're just pushing changes from master to slave and the
> master dies, it's entirely possible that some of the new messages that
> were already saved in master didn't get through to slave.

With my suggested method that, in theory, never happen.
A message doesn't get accepted unless the log gets written (if replication 
is on).

If the master dies, when it gets restarted it should be able to continue.   

>   - If save/copy is aborted, tell the slave to decrease the UID counter
> by the number of aborted messages.

Are you planning to have a single slave? Or did you plan to allow multiple 
slaves? If allowing multiple slaves you will need to keep track at which 
point in the log each slave is. An easier approach is to have a setting 
based on time for how long to allow the master to keep logs.

> Solution here would again be that before EXPUNGE notifications are sent
> to client we'll wait for reply from slave that it had also processed the
> expunge.

>From all your descriptions it sounds as if you are trying to do Synchronous 
replicat. What I suggested is basically to use Asynchronous replication.
I think synchronous replication is not only much more difficult, but also 
much more difficult to debug and maintain in working order over changes.

> Master/multi-slave
> ------------------
> 
> Once the master/slave is working, support for multiple slaves could be
> added.

With the log shipping method I suggested multi-slave should not be much more 
coding to do.

In theory you could put more of the burden on the slaves to ask for their 
last transaction ID.. that they got onward.. so the master will not need to 
know anything extra to handle multi-slaves.  

> After master/multi-slave is working, we're nearly ready for a full
> multi-master operation

I think it will be clearer to see what needs to be done after you have 
master-slave working. I have never tried to implement a replication system, 
but I think that the onl way to have a reliable multi-master system is to 
have synchronous replication across ALL nodes.

This increases communication and locking significantly. The locking alone 
will likely be a choke point.