[Dovecot] Replication plans

Tue May 22 01:02:01 EEST 2007

On Sun, 2007-05-20 at 20:58 -0400, Francisco Reyes wrote:
> Timo Sirainen writes:
> 
> > Master keeps all the changes in memory until slave has replied that it
> > has committed the changes. If the memory buffer gets too large (1MB?)
> 
> Does this mean that in case of a crash all that would be lost?
> I think the cache should be smaller.  

Well, there are two possibilities:

a) Accept that replication process can lose changes, and require a full
resync when it gets back up. This of course means that all users'
mailboxes need to be scanned, which can be slow.

b) Write everything immediately to disk (and possibly fsync()) and
require the actual writer process to wait until replicator has done this
before replying to client that the operation succeeded. Probably not
worth it for flag changes, but for others it could be a good idea.

> > If the file gets too large (10MB?) it's deleted and slave will require a
> > resync.
> 
> Don't agree.
> A large mailstore with Gigabytes worth of mail would benefit from having 
> 10MB synced... instead of re-starting from scratch.

By a resync I mean that dovecot.index and and newer changes from
dovecot.index.log need to be sent to the slave, which can then figure
out what messages it's missing and request them from the master (or
something similar). So I didn't mean that all existing messages would be
sent.

But again this would mean that the above is done for all mailboxes.
There could of course be some ways to make this faster, such as have a
global modification counter stored for each mailbox and resync only
those mailboxes where the counter is higher than the last value that the
slave saw. I guess the modification counter could be a simple mtime
timestamp of dovecot.index.log file :)

> >Master always keeps track of "user/mailbox -> last transaction
> > sequence" in memory. When the slave comes back up and tells the master
> > its last committed sequence, this allows the master to resync only those
> > mailboxes that had changed.
> 
> I think a user configurable option to decide how large the sync files can 
> grow to would be most flexible.

Sure. My default 1MB/10MB were just guesses as to what might be the
defaults.

> > queues. The communication protocol would be binary
> 
> Because? Performance? Wouldn't that make debugging more difficult?

Because then flag changes, expunges and partially appends (and maybe
something else) could be handled using the exact same format as
dovecot.index.log file uses. The slave could simply append such
transactions to dovecot.index.log without even trying to parse their
contents.

> > But it should be possible to split users into multiple slaves (still one
> > slave/user). The most configurable way to do this would be to have
> > userdb return the slave host.
> 
> Why not just have 1 slave process per slave machine?

You mean why there shouldn't be just one master/slave pair, or why one
slave process couldn't handle data from multiple masters?

For the former, because people are using the same computer to be master
for some users, and slave for others. If there was a single master/slave
pair and the master failed, some poor server's load would double. With
splitting the users to multiple slaves, the new servers will now share
the load.

For the latter, I'm not sure. It could be possible if the configuration
is the same for all the masters that it's serving as slave for. Then it
could even be better that way. I think both ways could be supported just
as easily (run everything under one dovecot vs. run multiple dovecots in
different IPs/ports).

> > This is the most important thing to get right, and also the most complex
> > one. Besides replicating mails that are being saved via Dovecot, I think
> > also externally saved mails should be replicated when they're first
> > seen. This is somewhat related to doing an initial sync to a slave.
> 
> Why not go with a pure log replication scheme?
> this way you basically have 3 processes.
> 
> 1- The normal, currently existing programs. Add logs to the process
> 2- A Master replication process which listens for clients requesting for 
> info.
> 3- The slave processes that request infomation and write it to the slave 
> machines.
> 
> With this approach you can basically break it down into logical units of 
> code which can be tested and debugged. Also helps when you need to worry 
> about security and the level at which each component needs to work.

I'm not completely sure what you mean by these. Basically the same as
what I said, except just have imap/deliver simply send the changes
without any waiting?

> > The biggest problem with saving is how to robustly handle master
> > crashes. If you're just pushing changes from master to slave and the
> > master dies, it's entirely possible that some of the new messages that
> > were already saved in master didn't get through to slave.
> 
> With my suggested method that, in theory, never happen.
> A message doesn't get accepted unless the log gets written (if replication 
> is on).
> If the master dies, when it gets restarted it should be able to continue.   

But isn't the point of the master/slave that the slave would switch on
if the master dies? If you switch slave to be the new master, it doesn't
matter if the logs were written to master's disk. Sure the message could
come back when the master is again brought back (assuming it didn't
completely die), but until then your IMAP clients might see messages
getting lost or existing UIDs being used for new mails, which can cause
all kinds of breakages.

> Are you planning to have a single slave? Or did you plan to allow multiple 
> slaves? If allowing multiple slaves you will need to keep track at which 
> point in the log each slave is. An easier approach is to have a setting 
> based on time for how long to allow the master to keep logs.

I don't understand what you mean. Sure the logs could timeout at some
point (or shouldn't there be some size limits anyway?), but you'd still
need to keep track of what different slaves have seen.

> > Solution here would again be that before EXPUNGE notifications are sent
> > to client we'll wait for reply from slave that it had also processed the
> > expunge.
> 
> From all your descriptions it sounds as if you are trying to do Synchronous 
> replicat. What I suggested is basically to use Asynchronous replication.
> I think synchronous replication is not only much more difficult, but also 
> much more difficult to debug and maintain in working order over changes.

Right. And I think the benefits of doing it synchronously outweight the
extra difficulties. As I mentioned in the beginning of the mail:

Since the whole point of master-slave replication would be to get a
reliable service, I'll want to make the replication as reliable as
possible. It would be easier to implement much simpler master-slave
replication, but in error conditions that would almost guarantee that
some messages get lost. I want to avoid that.

By "much simpler replication" I mean asynchronous replication. Perhaps
asynchronous could be an option also if it seems that synchronous
replication adds too much latency (especially if your replication is
simply for an offsite backup), but I'd want synchronous to be the
recommended method.

> > After master/multi-slave is working, we're nearly ready for a full
> > multi-master operation
> 
> I think it will be clearer to see what needs to be done after you have 
> master-slave working. 

Sure. I wasn't planning on implementing multi-slave or multi-master
before the master/slave was fully working and stress testing showing
that no mails get lost even if master is killed every few seconds (and
each crash causing master/slave to switch roles randomly).

> I have never tried to implement a replication system, 
> but I think that the onl way to have a reliable multi-master system is to 
> have synchronous replication across ALL nodes.

That would make it the safest, but I think it's enough to have just one
synchronous slave update. Once the slaves then figure out together who
is the next master, it would gather all the updates it doesn't know
about yet from the other slaves.

(Assuming the multi-master was implemented so that global mailbox holder
is the master for the mailbox and the others are slaves.)

> This increases communication and locking significantly. The locking alone 
> will likely be a choke point. 

My plan would require the locking only when the mailbox is being updated
and the global lock isn't already owned by the server. If you want to
avoid different servers from constantly stealing the lock from each
others, use different ways to make sure that the mailbox normally isn't
modified from more than one server.

I don't think this will be a big problem even if multiple servers are
modifying the same mailbox, but it depends entirely on the extra latency
caused by the global locking. I don't know what the latency will be
until it can be tested, but I don't think it should be much more than
what a simple ping would give over the same network.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
Url : http://dovecot.org/pipermail/dovecot/attachments/20070522/f4aa69a9/attachment.bin