On Sun, 2007-05-20 at 20:58 -0400, Francisco Reyes wrote:
Timo Sirainen writes:
Master keeps all the changes in memory until slave has replied that it has committed the changes. If the memory buffer gets too large (1MB?)
Does this mean that in case of a crash all that would be lost? I think the cache should be smaller.
Well, there are two possibilities:
a) Accept that replication process can lose changes, and require a full resync when it gets back up. This of course means that all users' mailboxes need to be scanned, which can be slow.
b) Write everything immediately to disk (and possibly fsync()) and require the actual writer process to wait until replicator has done this before replying to client that the operation succeeded. Probably not worth it for flag changes, but for others it could be a good idea.
If the file gets too large (10MB?) it's deleted and slave will require a resync.
Don't agree. A large mailstore with Gigabytes worth of mail would benefit from having 10MB synced... instead of re-starting from scratch.
By a resync I mean that dovecot.index and and newer changes from dovecot.index.log need to be sent to the slave, which can then figure out what messages it's missing and request them from the master (or something similar). So I didn't mean that all existing messages would be sent.
But again this would mean that the above is done for all mailboxes. There could of course be some ways to make this faster, such as have a global modification counter stored for each mailbox and resync only those mailboxes where the counter is higher than the last value that the slave saw. I guess the modification counter could be a simple mtime timestamp of dovecot.index.log file :)
Master always keeps track of "user/mailbox -> last transaction sequence" in memory. When the slave comes back up and tells the master its last committed sequence, this allows the master to resync only those mailboxes that had changed.
I think a user configurable option to decide how large the sync files can grow to would be most flexible.
Sure. My default 1MB/10MB were just guesses as to what might be the defaults.
queues. The communication protocol would be binary
Because? Performance? Wouldn't that make debugging more difficult?
Because then flag changes, expunges and partially appends (and maybe something else) could be handled using the exact same format as dovecot.index.log file uses. The slave could simply append such transactions to dovecot.index.log without even trying to parse their contents.
But it should be possible to split users into multiple slaves (still one slave/user). The most configurable way to do this would be to have userdb return the slave host.
Why not just have 1 slave process per slave machine?
You mean why there shouldn't be just one master/slave pair, or why one slave process couldn't handle data from multiple masters?
For the former, because people are using the same computer to be master for some users, and slave for others. If there was a single master/slave pair and the master failed, some poor server's load would double. With splitting the users to multiple slaves, the new servers will now share the load.
For the latter, I'm not sure. It could be possible if the configuration is the same for all the masters that it's serving as slave for. Then it could even be better that way. I think both ways could be supported just as easily (run everything under one dovecot vs. run multiple dovecots in different IPs/ports).
This is the most important thing to get right, and also the most complex one. Besides replicating mails that are being saved via Dovecot, I think also externally saved mails should be replicated when they're first seen. This is somewhat related to doing an initial sync to a slave.
Why not go with a pure log replication scheme? this way you basically have 3 processes.
1- The normal, currently existing programs. Add logs to the process 2- A Master replication process which listens for clients requesting for info. 3- The slave processes that request infomation and write it to the slave machines.
With this approach you can basically break it down into logical units of code which can be tested and debugged. Also helps when you need to worry about security and the level at which each component needs to work.
I'm not completely sure what you mean by these. Basically the same as what I said, except just have imap/deliver simply send the changes without any waiting?
The biggest problem with saving is how to robustly handle master crashes. If you're just pushing changes from master to slave and the master dies, it's entirely possible that some of the new messages that were already saved in master didn't get through to slave.
With my suggested method that, in theory, never happen. A message doesn't get accepted unless the log gets written (if replication is on). If the master dies, when it gets restarted it should be able to continue.
But isn't the point of the master/slave that the slave would switch on if the master dies? If you switch slave to be the new master, it doesn't matter if the logs were written to master's disk. Sure the message could come back when the master is again brought back (assuming it didn't completely die), but until then your IMAP clients might see messages getting lost or existing UIDs being used for new mails, which can cause all kinds of breakages.
Are you planning to have a single slave? Or did you plan to allow multiple slaves? If allowing multiple slaves you will need to keep track at which point in the log each slave is. An easier approach is to have a setting based on time for how long to allow the master to keep logs.
I don't understand what you mean. Sure the logs could timeout at some point (or shouldn't there be some size limits anyway?), but you'd still need to keep track of what different slaves have seen.
Solution here would again be that before EXPUNGE notifications are sent to client we'll wait for reply from slave that it had also processed the expunge.
From all your descriptions it sounds as if you are trying to do Synchronous replicat. What I suggested is basically to use Asynchronous replication. I think synchronous replication is not only much more difficult, but also much more difficult to debug and maintain in working order over changes.
Right. And I think the benefits of doing it synchronously outweight the extra difficulties. As I mentioned in the beginning of the mail:
Since the whole point of master-slave replication would be to get a reliable service, I'll want to make the replication as reliable as possible. It would be easier to implement much simpler master-slave replication, but in error conditions that would almost guarantee that some messages get lost. I want to avoid that.
By "much simpler replication" I mean asynchronous replication. Perhaps asynchronous could be an option also if it seems that synchronous replication adds too much latency (especially if your replication is simply for an offsite backup), but I'd want synchronous to be the recommended method.
After master/multi-slave is working, we're nearly ready for a full multi-master operation
I think it will be clearer to see what needs to be done after you have master-slave working.
Sure. I wasn't planning on implementing multi-slave or multi-master before the master/slave was fully working and stress testing showing that no mails get lost even if master is killed every few seconds (and each crash causing master/slave to switch roles randomly).
I have never tried to implement a replication system, but I think that the onl way to have a reliable multi-master system is to have synchronous replication across ALL nodes.
That would make it the safest, but I think it's enough to have just one synchronous slave update. Once the slaves then figure out together who is the next master, it would gather all the updates it doesn't know about yet from the other slaves.
(Assuming the multi-master was implemented so that global mailbox holder is the master for the mailbox and the others are slaves.)
This increases communication and locking significantly. The locking alone will likely be a choke point.
My plan would require the locking only when the mailbox is being updated and the global lock isn't already owned by the server. If you want to avoid different servers from constantly stealing the lock from each others, use different ways to make sure that the mailbox normally isn't modified from more than one server.
I don't think this will be a big problem even if multiple servers are modifying the same mailbox, but it depends entirely on the extra latency caused by the global locking. I don't know what the latency will be until it can be tested, but I don't think it should be much more than what a simple ping would give over the same network.