[Dovecot] Replication plans
Several companies have been interested in getting replication support for Dovecot. It looks like I could begin implementing it in a few months after some other changes. So here's how I'm planning on doing it:
What needs to be replicated:
- Saving new mails (IMAP, deliver)
- Copying existing mails
- Expunges
- Flag and keyword changes
- Mailbox creations, deletions and renames
- Subscription list
- IMAP extension changes, such as ACLs
I'll first talk about only master/slave configuration. Later then master/multi-slave and multi-master.
Comments?
Basic design
Since the whole point of master-slave replication would be to get a reliable service, I'll want to make the replication as reliable as possible. It would be easier to implement much simpler master-slave replication, but in error conditions that would almost guarantee that some messages get lost. I want to avoid that.
Both slave and master would be running a dovecot-replication process. They would talk to each others via TCP connection. Master would be feeding changes (transactions) to slave, and slave would keep replying how far it's committed the changes. So there would be an increasing sequence number sent with each transaction.
Master keeps all the changes in memory until slave has replied that it has committed the changes. If the memory buffer gets too large (1MB?) because the slave is handling the input too slowly or because it's completely dead, the master starts writing the buffer to a file. Once the slave is again responding the changes are read from the file and finally the file gets deleted.
If the file gets too large (10MB?) it's deleted and slave will require a resync. Master always keeps track of "user/mailbox -> last transaction sequence" in memory. When the slave comes back up and tells the master its last committed sequence, this allows the master to resync only those mailboxes that had changed. If this mapping doesn't exist (master was restarted), all the users' mailboxes need to be resynced.
The resyncing will also have to deal with new mailboxes, ACL changes and such. I'm not completely sure yet how the resyncing will work. Probably the only important thing here is that it would be good if it would resolve any message UID conflicts (even though they shouldn't happen, see below) by comparing either the messages' MD5 contents, or with maildir it would be enough to just compare the maildir filename.
It's possible that the slave doesn't die completely but only some transactions fail. This shouldn't normally happen, so the easiest way to handle it would be to try a full resync, and if that still fails stop the whole slave. Another way would be to just mark that one user or mailbox as dirty and try to resync it once in a while.
imap, pop3 and deliver would connect to dovecot-replication either using UNIX sockets or POSIX message queue. Queue could be faster, but sockets would allow better security checks. Or I suppose it would be possible to send some kind of a small authentication token in each message with queues. The communication protocol would be binary and the existing dovecot.index.log format would be used as much as possible (making it simple for slave to append changes to mailboxes' dovecot.index.log).
dovecot-replication process would need read/write access to all users' mailboxes. So either it would run as root or it would need to have at least group-permission rights to all mailboxes. A bit more complex solution would be to use multiple processes each running with their own UIDs, but I think I won't implement this yet.
The replication would require support for shared mailboxes. That way the mailboxes could be accessed using the configured shared namespaces, for example "shared/<username>/INBOX". (LMTP server could be implemented in a similar way BTW)
Configuration
I'm not yet sure how this should work. The simplest possibility would be just to configure a single slave and everything would be replicated there.
But it should be possible to split users into multiple slaves (still one slave/user). The most configurable way to do this would be to have userdb return the slave host. Hmm. Actually that could work pretty well if the userdb returned slave's name and each slave would have a separate socket/queue where dovecot-replication listens in. I'm not sure if they could all still be handled by a single dovecot-replication process or if there should be a separate one for each of them. The configuration would then be something like:
# slave for usernames beginning with letters A..K replication users_a_k { host = slave1.example.com # other options? } # L..Z and others replication users_l_z_rest { host = slave2.example.com }
Saving
This is the most important thing to get right, and also the most complex one. Besides replicating mails that are being saved via Dovecot, I think also externally saved mails should be replicated when they're first seen. This is somewhat related to doing an initial sync to a slave.
The biggest problem with saving is how to robustly handle master crashes. If you're just pushing changes from master to slave and the master dies, it's entirely possible that some of the new messages that were already saved in master didn't get through to slave. So when you switch to using slave, those messages are gone. The biggest problem with this is that when saving new mails, they'll be assigned UIDs that were already used by the original master. If a locally caching client had already seen the UID from the original master, it'll think the message is the original one instead of the new one. So the newly added message pretty much gets lost, and if the user could expunge the message (thinking it's the original one) without ever even seeing it.
This UID conflict could be avoided by making the master tell slave whenever it's saving new mails. The latency between master-slave communication could be avoided with a bit of a trick:
- When beginning to save/copy a mail, tell slave to increase UID counter by one. Slave replies with the new highest-seen-UID counter.
- UID allocation is done after all the mails are saved/copied within a transaction. At this stage wait until slave has replied that it has reserved UIDs at least as high as we're currently allocating (other processes could have caused it to get even higher).
- If save/copy is aborted, tell the slave to decrease the UID counter by the number of aborted messages.
- UID counter desync shouldn't happen, but if it does handle that also somehow.
Then the UID allocation latency depends on how long it takes to write the message, and where the message comes from. With IMAP APPEND there's probably no wait required, unless the APPEND command and the whole message arrives in one IP packet. With deliver and copying it might have to wait a while.
So the UID allocation should work with minimal possible latency. It could use a separate TCP connection to slave and maybe a separate high priority process in the slave that would handle only UID allocations.
The above will make sure that conflicting UIDs aren't used, but it doesn't make sure that messages themselves aren't lost. If the master's hard disk failed completely and it can't be resynced anymore, the messages saved there are lost. This could be avoided by making the saving wait until the message is saved to slave:
- save mail to disk, and at the same time also send it to slave
- allocate UID(s) and tell to slave what they were, wait for "mail saved" reply from slave before committing the messages to mailbox permanently
This would however add more latency than the simpler UID allocation counter. I don't know if it should be optional which way is used. Perhaps I'll implement the full save-sync first and if it seems too slow then I'll implement optionally the UID allocation counter.
Save messages from imap/deliver would contain at least:
- user name (unless using UNIX sockets, then it's already known)
- mailbox name
- UIDVALIDITY 4a) UID 4b) global UID, filename, offset and size of the message
4a) would be used with the UID allocation counter behavior. The message would be sent after the message was saved and UIDs were allocated.
4b) would be used with the wait-slave-to-save behavior. The message would be sent immediately whenever new data was received for the message, so possibly multiple times for each message. The last block would then contain size=0. After saving all the mails in a transaction, there would be a separate "allocate UIDs x..y for global UIDs a, b, c, d, etc." message which the slave then replies and master forwards the reply to imap/deliver.
If the slave is down, or if a timeout (a few seconds?) occurs while waiting for the reply from slave, imap/deliver could just go on saving the message. If this is done it's possible that if the master also dies soon, there will be UID conflicts. But since this should be a rare error condition, it's unlikely that the master will die soon after a misbehaving slave.
Copying
Copying has similar UID conflict problems than saving. The main difference is that there's no need to transfer message contents, which makes it faster to handle copies reliably.
Copying could be handled with the UID allocation counter way described above. Since the message already exists in the server, there's no way that the message gets lost. So it would be possible to allow master and slave to go into desync, because the user would still find the message from the original mailbox.
Move operation is done with "copy + mark \deleted + expunge" combination. This could be problematic if the master had managed to do all that before slave saw the COPY at all. Since the message isn't yet copied to the new mailbox in slave, clients wouldn't see it there, but locally caching clients might not see the message in the original mailbox either because they had already received EXPUNGE notification for it. So this should be avoided in some way.
Perhaps it would be better to just treat copies the same way as saves: Require a notification from slave before replying to client with OK message. Alternatively the wait could be done at EXPUNGE command, as described below.
Expunges
Losing expunge transactions wouldn't be too bad, because the worst that can happen is that the user needs to expunge the message again. Except that having an expunged message suddenly appear again in the middle of the mailbox violates IMAP protocol. I don't know if/how that will break clients.
Solution here would again be that before EXPUNGE notifications are sent to client we'll wait for reply from slave that it had also processed the expunge.
Flag and keyword changes
Most likely doesn't matter if some of them are lost. If someone really cares about them, it would be possible to again do the slave-waiting for either all the changes or just when some specific flags/keywords change.
Other changes
Mailbox creations, deletions, renames and subscription changes could all be handled in a similiar way to flag/keyword changes.
Handling changes implemented by IMAP extensions such as ACL is another thing. It's probably not worth the trouble to implement a special replication handler for each plugin. These changes most likely will happen somewhat rarely though, so they could be implemented by sending the whole IMAP command to slave and have it execute the command using a regular imap process. The commands could be marked with some "replication" flag so Dovecot knows that they need to be sent to slave.
Although instead of executing a new imap process for each user (probably each command), it could be possible to use one imap process and just change the mailbox name in the command. So instead of eg. "setacl mailbox (whatever)" the IMAP process would execute "setacl shared/username/mailbox (whatever)". That would require telling Dovecot what parameter contains the mailbox name. Should be pretty easy to implement, unless some IMAP extensions want multiple mailbox names or the position isn't static. I can't think of any such extensions right now though.
Master/multi-slave
Once the master/slave is working, support for multiple slaves could be added. The main thing to be done here is to have the slaves select which one of them is the new master, and then they'll have to sync the mailboxes with each others, because each of them might have seen some messages that others hadn't. This is assuming that save/etc. operations that are waiting for a reply from slave would wait for only one reply from the first slave that happens to respond. The number of slave replies to wait for could be also configurable, in case you're worried of a simultaneous master+slave crash.
Multi-master
After master/multi-slave is working, we're nearly ready for a full multi-master operation. The master farm will need global locking. Perhaps there is a usable existing implementation which could be used. In any case there needs to be a way to keeps track of what master owns what locks.
The easiest way to implement multi-master is to assume that a single user isn't normally accessed from different servers at the same time. This wouldn't be a requirement, but it would make the performance better. For most users you could handle this by making sure that load balancer transfers all connections from the same IP to the same server. Or you could use a Dovecot proxy that transfers the connections.
So, once the user is accessed only from one server, the operation becomes pretty much like master/multi-slave. You'll just need to be able to dynamically change the current master server for the user. This can be done by grabbing a global per-user lock before modifying the mailbox. The lock wouldn't have to be released immediately though. If no other server needs the lock, there's no point in locking/unlocking constantly. Instead the lock server could request the current lock owner to release the lock whenever another server wants it.
Hi Timo,
MySQL gets around the problem of multiple masters allocating the same primary key, by giving each server its own address range (e.g. first machine uses 1,5,9,13 next one uses 2,6,10,14,...). Would this work for UIDs?
Jonathan.
Jonathan wrote:
Hi Timo,
MySQL gets around the problem of multiple masters allocating the same primary key, by giving each server its own address range (e.g. first machine uses 1,5,9,13 next one uses 2,6,10,14,...). Would this work for UIDs? UIDs have to be sequential. Causes some problems.
On Thu, 2007-05-17 at 09:23 -0600, Aredridel wrote:
Jonathan wrote:
Hi Timo,
MySQL gets around the problem of multiple masters allocating the same primary key, by giving each server its own address range (e.g. first machine uses 1,5,9,13 next one uses 2,6,10,14,...). Would this work for UIDs? UIDs have to be sequential. Causes some problems.
Right. The IMAP requirement for UIDs to be growing is the main problem with IMAP replication.
My first reaction is that I've already got replication by running dovecot on AFS ;)
But that's currently not *really* replicated. The real question I guess is why not use a cluster/distributed/san filesystem like AFS, GFS, Lustre, GPFS to handle the actual data, and specify that replication only works for maildir or other single file per message mailbox formats.
This puts about half the replication problem on the filesystem, and I would hope keeps the dovecot code a lot simpler. The hard part for dovecot is understanding the locking semantics of various filesystems, and doing locking in a way that doesn't totally trash performance.
I also tend to have imap clients open on multiple machines, so the assumption that a user's mailbox will only be accessed from 1 IP is probably a bad one.
I'd suggest that multi-master replication be implemented on a per-mailbox basis, so that for any mailbox, the first dovecot instance that gets a connection for that mailbox checks with a 'dovelock' process that can either backend to a shared filesystem, or implement it's own tcp-based lock manager. It would then get a master lock on that mailbox. If a second dovecot gets a connection to a mailbox with a master lock, it would act as a proxy and redirect all writes to the master. In the case of a shared filesystem, all reads can still come from the straight from the local filesystem. In the case of AFS, this might hit the directory lookups kind of hard, but I think would still be a big win because of local cacheing of all the maildir files, which for the most part are read-only.
In the event that a master crashes or stops responding, the dovelock process would need a mechanism to revoke a master lock.. this might require some filesystem-specific glue to make sure that once a lock is revoked the dovecot process that had it's lock revoked can't write to the filesystem anymore. I can think of at least one way to do this with AFS by having the AFS server invalidate authentication tokens for the process/server that the lock was revoked from.
Once the lock is revoked, the remaining process can grab it and go on happily with life. Once the dead/crashed process comes back, it just has to have it's filesystem go out and update to the latest files.
I like the idea of dovecot having a built-in proxy/redirect ability so a cluster can be done with plain old round-robin DNS. I also think that in most cases, if there's a dead server in the round-robin DNS pool, most clients will retry automatically, or the user will get an error, click okay, and try again. Having designated proxy or redirectors in the middle just makes things complicated and hard to debug.
On Thu, May 17, 2007 at 04:45:21PM +0300, Timo Sirainen wrote:
Several companies have been interested in getting replication support for Dovecot. It looks like I could begin implementing it in a few months after some other changes. So here's how I'm planning on doing it:
What needs to be replicated:
- Saving new mails (IMAP, deliver)
- Copying existing mails
- Expunges
- Flag and keyword changes
- Mailbox creations, deletions and renames
- Subscription list
- IMAP extension changes, such as ACLs
I'll first talk about only master/slave configuration. Later then master/multi-slave and multi-master.
Comments?
Basic design
Since the whole point of master-slave replication would be to get a reliable service, I'll want to make the replication as reliable as possible. It would be easier to implement much simpler master-slave replication, but in error conditions that would almost guarantee that some messages get lost. I want to avoid that.
Both slave and master would be running a dovecot-replication process. They would talk to each others via TCP connection. Master would be feeding changes (transactions) to slave, and slave would keep replying how far it's committed the changes. So there would be an increasing sequence number sent with each transaction.
Master keeps all the changes in memory until slave has replied that it has committed the changes. If the memory buffer gets too large (1MB?) because the slave is handling the input too slowly or because it's completely dead, the master starts writing the buffer to a file. Once the slave is again responding the changes are read from the file and finally the file gets deleted.
If the file gets too large (10MB?) it's deleted and slave will require a resync. Master always keeps track of "user/mailbox -> last transaction sequence" in memory. When the slave comes back up and tells the master its last committed sequence, this allows the master to resync only those mailboxes that had changed. If this mapping doesn't exist (master was restarted), all the users' mailboxes need to be resynced.
The resyncing will also have to deal with new mailboxes, ACL changes and such. I'm not completely sure yet how the resyncing will work. Probably the only important thing here is that it would be good if it would resolve any message UID conflicts (even though they shouldn't happen, see below) by comparing either the messages' MD5 contents, or with maildir it would be enough to just compare the maildir filename.
It's possible that the slave doesn't die completely but only some transactions fail. This shouldn't normally happen, so the easiest way to handle it would be to try a full resync, and if that still fails stop the whole slave. Another way would be to just mark that one user or mailbox as dirty and try to resync it once in a while.
imap, pop3 and deliver would connect to dovecot-replication either using UNIX sockets or POSIX message queue. Queue could be faster, but sockets would allow better security checks. Or I suppose it would be possible to send some kind of a small authentication token in each message with queues. The communication protocol would be binary and the existing dovecot.index.log format would be used as much as possible (making it simple for slave to append changes to mailboxes' dovecot.index.log).
dovecot-replication process would need read/write access to all users' mailboxes. So either it would run as root or it would need to have at least group-permission rights to all mailboxes. A bit more complex solution would be to use multiple processes each running with their own UIDs, but I think I won't implement this yet.
The replication would require support for shared mailboxes. That way the mailboxes could be accessed using the configured shared namespaces, for example "shared/<username>/INBOX". (LMTP server could be implemented in a similar way BTW)
Configuration
I'm not yet sure how this should work. The simplest possibility would be just to configure a single slave and everything would be replicated there.
But it should be possible to split users into multiple slaves (still one slave/user). The most configurable way to do this would be to have userdb return the slave host. Hmm. Actually that could work pretty well if the userdb returned slave's name and each slave would have a separate socket/queue where dovecot-replication listens in. I'm not sure if they could all still be handled by a single dovecot-replication process or if there should be a separate one for each of them. The configuration would then be something like:
# slave for usernames beginning with letters A..K replication users_a_k { host = slave1.example.com # other options? } # L..Z and others replication users_l_z_rest { host = slave2.example.com }
Saving
This is the most important thing to get right, and also the most complex one. Besides replicating mails that are being saved via Dovecot, I think also externally saved mails should be replicated when they're first seen. This is somewhat related to doing an initial sync to a slave.
The biggest problem with saving is how to robustly handle master crashes. If you're just pushing changes from master to slave and the master dies, it's entirely possible that some of the new messages that were already saved in master didn't get through to slave. So when you switch to using slave, those messages are gone. The biggest problem with this is that when saving new mails, they'll be assigned UIDs that were already used by the original master. If a locally caching client had already seen the UID from the original master, it'll think the message is the original one instead of the new one. So the newly added message pretty much gets lost, and if the user could expunge the message (thinking it's the original one) without ever even seeing it.
This UID conflict could be avoided by making the master tell slave whenever it's saving new mails. The latency between master-slave communication could be avoided with a bit of a trick:
- When beginning to save/copy a mail, tell slave to increase UID counter by one. Slave replies with the new highest-seen-UID counter.
- UID allocation is done after all the mails are saved/copied within a transaction. At this stage wait until slave has replied that it has reserved UIDs at least as high as we're currently allocating (other processes could have caused it to get even higher).
- If save/copy is aborted, tell the slave to decrease the UID counter by the number of aborted messages.
- UID counter desync shouldn't happen, but if it does handle that also somehow.
Then the UID allocation latency depends on how long it takes to write the message, and where the message comes from. With IMAP APPEND there's probably no wait required, unless the APPEND command and the whole message arrives in one IP packet. With deliver and copying it might have to wait a while.
So the UID allocation should work with minimal possible latency. It could use a separate TCP connection to slave and maybe a separate high priority process in the slave that would handle only UID allocations.
The above will make sure that conflicting UIDs aren't used, but it doesn't make sure that messages themselves aren't lost. If the master's hard disk failed completely and it can't be resynced anymore, the messages saved there are lost. This could be avoided by making the saving wait until the message is saved to slave:
- save mail to disk, and at the same time also send it to slave
- allocate UID(s) and tell to slave what they were, wait for "mail saved" reply from slave before committing the messages to mailbox permanently
This would however add more latency than the simpler UID allocation counter. I don't know if it should be optional which way is used. Perhaps I'll implement the full save-sync first and if it seems too slow then I'll implement optionally the UID allocation counter.
Save messages from imap/deliver would contain at least:
- user name (unless using UNIX sockets, then it's already known)
- mailbox name
- UIDVALIDITY 4a) UID 4b) global UID, filename, offset and size of the message
4a) would be used with the UID allocation counter behavior. The message would be sent after the message was saved and UIDs were allocated.
4b) would be used with the wait-slave-to-save behavior. The message would be sent immediately whenever new data was received for the message, so possibly multiple times for each message. The last block would then contain size=0. After saving all the mails in a transaction, there would be a separate "allocate UIDs x..y for global UIDs a, b, c, d, etc." message which the slave then replies and master forwards the reply to imap/deliver.
If the slave is down, or if a timeout (a few seconds?) occurs while waiting for the reply from slave, imap/deliver could just go on saving the message. If this is done it's possible that if the master also dies soon, there will be UID conflicts. But since this should be a rare error condition, it's unlikely that the master will die soon after a misbehaving slave.
Copying
Copying has similar UID conflict problems than saving. The main difference is that there's no need to transfer message contents, which makes it faster to handle copies reliably.
Copying could be handled with the UID allocation counter way described above. Since the message already exists in the server, there's no way that the message gets lost. So it would be possible to allow master and slave to go into desync, because the user would still find the message from the original mailbox.
Move operation is done with "copy + mark \deleted + expunge" combination. This could be problematic if the master had managed to do all that before slave saw the COPY at all. Since the message isn't yet copied to the new mailbox in slave, clients wouldn't see it there, but locally caching clients might not see the message in the original mailbox either because they had already received EXPUNGE notification for it. So this should be avoided in some way.
Perhaps it would be better to just treat copies the same way as saves: Require a notification from slave before replying to client with OK message. Alternatively the wait could be done at EXPUNGE command, as described below.
Expunges
Losing expunge transactions wouldn't be too bad, because the worst that can happen is that the user needs to expunge the message again. Except that having an expunged message suddenly appear again in the middle of the mailbox violates IMAP protocol. I don't know if/how that will break clients.
Solution here would again be that before EXPUNGE notifications are sent to client we'll wait for reply from slave that it had also processed the expunge.
Flag and keyword changes
Most likely doesn't matter if some of them are lost. If someone really cares about them, it would be possible to again do the slave-waiting for either all the changes or just when some specific flags/keywords change.
Other changes
Mailbox creations, deletions, renames and subscription changes could all be handled in a similiar way to flag/keyword changes.
Handling changes implemented by IMAP extensions such as ACL is another thing. It's probably not worth the trouble to implement a special replication handler for each plugin. These changes most likely will happen somewhat rarely though, so they could be implemented by sending the whole IMAP command to slave and have it execute the command using a regular imap process. The commands could be marked with some "replication" flag so Dovecot knows that they need to be sent to slave.
Although instead of executing a new imap process for each user (probably each command), it could be possible to use one imap process and just change the mailbox name in the command. So instead of eg. "setacl mailbox (whatever)" the IMAP process would execute "setacl shared/username/mailbox (whatever)". That would require telling Dovecot what parameter contains the mailbox name. Should be pretty easy to implement, unless some IMAP extensions want multiple mailbox names or the position isn't static. I can't think of any such extensions right now though.
Master/multi-slave
Once the master/slave is working, support for multiple slaves could be added. The main thing to be done here is to have the slaves select which one of them is the new master, and then they'll have to sync the mailboxes with each others, because each of them might have seen some messages that others hadn't. This is assuming that save/etc. operations that are waiting for a reply from slave would wait for only one reply from the first slave that happens to respond. The number of slave replies to wait for could be also configurable, in case you're worried of a simultaneous master+slave crash.
Multi-master
After master/multi-slave is working, we're nearly ready for a full multi-master operation. The master farm will need global locking. Perhaps there is a usable existing implementation which could be used. In any case there needs to be a way to keeps track of what master owns what locks.
The easiest way to implement multi-master is to assume that a single user isn't normally accessed from different servers at the same time. This wouldn't be a requirement, but it would make the performance better. For most users you could handle this by making sure that load balancer transfers all connections from the same IP to the same server. Or you could use a Dovecot proxy that transfers the connections.
So, once the user is accessed only from one server, the operation becomes pretty much like master/multi-slave. You'll just need to be able to dynamically change the current master server for the user. This can be done by grabbing a global per-user lock before modifying the mailbox. The lock wouldn't have to be released immediately though. If no other server needs the lock, there's no point in locking/unlocking constantly. Instead the lock server could request the current lock owner to release the lock whenever another server wants it.
--
Troy Benjegerdes 'da hozer' hozer@hozed.org
Somone asked me why I work on this free (http://www.fsf.org/philosophy/) software stuff and not get a real job. Charles Shultz had the best answer:
"Why do musicians compose symphonies and poets write poems? They do it because life wouldn't have any meaning for them if they didn't. That's why I draw cartoons. It's my life." -- Charles Shultz
On Thu, 2007-05-17 at 10:04 -0500, Troy Benjegerdes wrote:
But that's currently not *really* replicated. The real question I guess is why not use a cluster/distributed/san filesystem like AFS, GFS, Lustre, GPFS to handle the actual data, and specify that replication only works for maildir or other single file per message mailbox formats.
This already works, but people don't seem to want to use it. There are apparently some stability, performance, complexity and whatever problems. And if you're planning on replicating to a remote location far away they're really slow. But I haven't personally tried any of them, so all I can say is that people are interested in getting replication that doesn't require clustered filesystems, so I'll build it for them. :)
Then there are also these special users, such as me. :) I'm now running my own Dovecot server on the same server that receives my emails. I want to use my laptop from different locations, so I want to keep that working. But then again my desktop computer is faster than the remote server, so I'd want to run another Dovecot in the desktop which replicates with the remote server.
Then there are also people who would want to run Dovecot on their laptop and have it synchronize with the main server whenever network connection is available.
I also tend to have imap clients open on multiple machines, so the assumption that a user's mailbox will only be accessed from 1 IP is probably a bad one.
Yes, I know. But usually there's only one client that's actively modifying the mailbox. The readers don't need global locking because they're not changing anything. Except \Recent flag updates..
If a second dovecot gets a connection to a mailbox with a master lock, it would act as a proxy and redirect all writes to the master.
Actually this was my original plan, but then I decided that lock transfer is probably more efficient than proxying. But I'll add support for proxying in any case for accessing shared mailboxes in other servers, so if you prefer proxying it'll be easy to support too.
In the event that a master crashes or stops responding, the dovelock process would need a mechanism to revoke a master lock.. this might require some filesystem-specific glue to make sure that once a lock is revoked the dovecot process that had it's lock revoked can't write to the filesystem anymore. I can think of at least one way to do this with AFS by having the AFS server invalidate authentication tokens for the process/server that the lock was revoked from.
I was hoping to find an existing global lock manager. I think there exist some. If not, then I'll have to spend some more time thinking about it. Anyway everyone don't want to use AFS, so I can't rely on it :)
On Thu, 17 May 2007 19:17:25 +0300 Timo Sirainen tss@iki.fi wrote:
On Thu, 2007-05-17 at 10:04 -0500, Troy Benjegerdes wrote:
But that's currently not *really* replicated. The real question I guess is why not use a cluster/distributed/san filesystem like AFS, GFS, Lustre, GPFS to handle the actual data, and specify that replication only works for maildir or other single file per message mailbox formats.
This already works, but people don't seem to want to use it. There are apparently some stability, performance, complexity and whatever problems. And if you're planning on replicating to a remote location far away they're really slow. But I haven't personally tried any of them, so all I can say is that people are interested in getting replication that doesn't require clustered filesystems, so I'll build it for them. :)
I for one would rather pay you for not re-inventing the wheel, but if people with actual access to funds are willing to pay you for this then I guess "take the money and run" is the thing to do. :-p
Yes, all these FS based approaches currently have one or more of the issues Timo lists. The question of course is, will a replicated dovecot be less complex, slow, etc. For people with money, there are enterprise level replicated file systems and/or hardware replicated SANs (remote locations, too). For people w/o those funds there are the above approaches (which despite all their shortcomings can work, right now) and of course one frequently overlooked but perfectly fitting solution, DRBD. For the ZFS fanbois, there ware ways to make it clustered/replicated as well (some storageworks add-on or geom tricks).
The point (for me) would be to not just replicate IMAP (never mind that most of our users use POP, me preferring not to use the dovecot LDA, etc), but ALL of the services/infrastructure that make up a mail system. Which leads quickly back to HA/cluster/SAN/DRBD for me.
I also tend to have imap clients open on multiple machines, so the assumption that a user's mailbox will only be accessed from 1 IP is probably a bad one.
Yes, I know. But usually there's only one client that's actively modifying the mailbox. The readers don't need global locking because they're not changing anything. Except \Recent flag updates..
We have multiple ACTIVE clients accessing the same mailbox all the time. Company role accounts love that kind of setup. ^_^
Regards,
Christian
Christian Balzer Network/Systems Engineer NOC chibi@gol.com Global OnLine Japan/Fusion Network Services http://www.gol.com/
On Fri, May 18, 2007 at 11:41:46AM +0900, Christian Balzer wrote:
On Thu, 17 May 2007 19:17:25 +0300 Timo Sirainen tss@iki.fi wrote:
On Thu, 2007-05-17 at 10:04 -0500, Troy Benjegerdes wrote:
But that's currently not *really* replicated. The real question I guess is why not use a cluster/distributed/san filesystem like AFS, GFS, Lustre, GPFS to handle the actual data, and specify that replication only works for maildir or other single file per message mailbox formats.
This already works, but people don't seem to want to use it. There are apparently some stability, performance, complexity and whatever problems. And if you're planning on replicating to a remote location far away they're really slow. But I haven't personally tried any of them, so all I can say is that people are interested in getting replication that doesn't require clustered filesystems, so I'll build it for them. :)
I for one would rather pay you for not re-inventing the wheel, but if people with actual access to funds are willing to pay you for this then I guess "take the money and run" is the thing to do. :-p
I'm going to throw out a warning that it's my feeling that replication has ended many otherwise worthwhile projects. Once you go down that rabbit hole, you end up finding out the hard way that you just can't avoid the stability, performance, complexity, and whatever problems.
If you take the money and run, just be aware of the complexity and customer expectations you are getting yourself into ;)
Yes, all these FS based approaches currently have one or more of the issues Timo lists. The question of course is, will a replicated dovecot be less complex, slow, etc. For people with money, there are enterprise level replicated file systems and/or hardware replicated SANs (remote locations, too). For people w/o those funds there are the above approaches (which despite all their shortcomings can work, right now) and of course one frequently overlooked but perfectly fitting solution, DRBD. For the ZFS fanbois, there ware ways to make it clustered/replicated as well (some storageworks add-on or geom tricks).
The point (for me) would be to not just replicate IMAP (never mind that most of our users use POP, me preferring not to use the dovecot LDA, etc), but ALL of the services/infrastructure that make up a mail system. Which leads quickly back to HA/cluster/SAN/DRBD for me.
I've found myself pretty much in the same "all roads lead to the filesystem" situation. I don't want to replicate just imap, I want to replicate the build directory with my source code, my email, and my MP3 files.
So maybe the right thing to do here is have dovecot do the locking and proxying bit, and initially use librsync for the actual replication. The rsync bit could be replaced with plugins for various filesystems.
On 18.5.2007, at 5.41, Christian Balzer wrote:
Yes, all these FS based approaches currently have one or more of the issues Timo lists. The question of course is, will a replicated dovecot be less complex, slow, etc.
The good thing is at least that Dovecot won't be any more complex if
you're not using replication. :) I think pretty much all of the
replication can be handled with a plugin and this dovecot-replication
binary.
I also tend to have imap clients open on multiple machines, so the assumption that a user's mailbox will only be accessed from 1 IP is probably a bad one.
Yes, I know. But usually there's only one client that's actively modifying the mailbox. The readers don't need global locking because they're not changing anything. Except \Recent flag updates..
We have multiple ACTIVE clients accessing the same mailbox all the time. Company role accounts love that kind of setup. ^_^
I think moving a global lock from one computer to another could still
be faster than all-the-time proxying. But this of course depends on
the locking implementation and perhaps other things I haven't
considered yet.
Timo Sirainen writes:
Then there are also people who would want to run Dovecot on their laptop and have it synchronize with the main server whenever network connection is available.
YES!
I had not thought of that, but that would be killer.. although that would be multi-master which I think would be really difficult to implement.. :-(
I was hoping to find an existing global lock manager. I think there exist some. If not, then I'll have to spend some more time thinking about it. Anyway everyone don't want to use AFS, so I can't rely on it :)
It is not that people dont want.. it is that some times they can't. For example where I work we are keeping a close eye on Gluster, but it currently does not work in FreeBSD. Some of the other distrbibuted filesystems are either not as mature in FreeBSD or are non existent.
Troy Benjegerdes writes:
But that's currently not *really* replicated. The real question I guess is why not use a cluster/distributed/san filesystem like AFS, GFS,
Because those distribute filesystems may be more difficult to setup, more difficult to maintain and may be less portable than a dovecot solution.
For example Gluster sounds really like a great distributed filesystem, but it currently does not work in FreeBSD.
If the company I work for wanted to use Gluster we would need to either learn Linux or hire someone to setup and maintain the linux boxes for us.
I'd suggest that multi-master replication be implemented on a per-mailbox basis
I suggest we forget about multi-master for now. :-) Some of us would rather see something sooner rather than later.
I like the idea of dovecot having a built-in proxy/redirect ability so a cluster can be done with plain old round-robin DNS.
Round-robin DNS is ok for a small setup, but not good enough for larger setups.
most cases, if there's a dead server in the round-robin DNS pool, most clients will retry automatically, or the user will get an error
And you will get dozens of calls of users asking what is going on, and why are things slow, etc. etc... and there goes half your morning/afternoon.
Even with a small TTL outlook is so flaky that often times after a brief problem outlook needs to be rebooted.
Where I work we may have a 15 minute problem that takes us 2 hours of calls to handle.. in large part because of outlook or people just wanted to know what happened.. and expecting a call back with an explanation.
Failover needs to be seemless and without error. Either have a proxy from dovecot in front or a load balancer.
Timo Sirainen writes:
Master keeps all the changes in memory until slave has replied that it has committed the changes. If the memory buffer gets too large (1MB?)
Does this mean that in case of a crash all that would be lost? I think the cache should be smaller.
because the slave is handling the input too slowly or because it's completely dead, the master starts writing the buffer to a file. Once the slave is again responding the changes are read from the file and finally the file gets deleted.
Good.
If the file gets too large (10MB?) it's deleted and slave will require a resync.
Don't agree. A large mailstore with Gigabytes worth of mail would benefit from having 10MB synced... instead of re-starting from scratch.
Master always keeps track of "user/mailbox -> last transaction sequence" in memory. When the slave comes back up and tells the master its last committed sequence, this allows the master to resync only those mailboxes that had changed.
I think a user configurable option to decide how large the sync files can grow to would be most flexible.
the whole slave. Another way would be to just mark that one user or mailbox as dirty and try to resync it once in a while.
That sounds better. A full resync can be very time consuming with a large and busy mailstore. Not only the full amount of data needs to be synced, but new changes too.
queues. The communication protocol would be binary
Because? Performance? Wouldn't that make debugging more difficult?
dovecot-replication process would need read/write access to all users' mailboxes. So either it would run as root or it would need to have at least group-permission rights to all mailboxes. A bit more complex solution would be to use multiple processes each running with their own UIDs, but I think I won't implement this yet.
For now pick the easiest approach to get this first version out. This will allow testers to have something to stress test. Better to have some basics out.. get feedback.. than to try to go after a more complex approach; unless you believe the complex approach is the ultimate long term best method.
But it should be possible to split users into multiple slaves (still one slave/user). The most configurable way to do this would be to have userdb return the slave host.
Why not just have 1 slave process per slave machine?
This is the most important thing to get right, and also the most complex one. Besides replicating mails that are being saved via Dovecot, I think also externally saved mails should be replicated when they're first seen. This is somewhat related to doing an initial sync to a slave.
Why not go with a pure log replication scheme? this way you basically have 3 processes.
1- The normal, currently existing programs. Add logs to the process 2- A Master replication process which listens for clients requesting for info. 3- The slave processes that request infomation and write it to the slave machines.
With this approach you can basically break it down into logical units of code which can be tested and debugged. Also helps when you need to worry about security and the level at which each component needs to work.
The biggest problem with saving is how to robustly handle master crashes. If you're just pushing changes from master to slave and the master dies, it's entirely possible that some of the new messages that were already saved in master didn't get through to slave.
With my suggested method that, in theory, never happen. A message doesn't get accepted unless the log gets written (if replication is on).
If the master dies, when it gets restarted it should be able to continue.
- If save/copy is aborted, tell the slave to decrease the UID counter by the number of aborted messages.
Are you planning to have a single slave? Or did you plan to allow multiple slaves? If allowing multiple slaves you will need to keep track at which point in the log each slave is. An easier approach is to have a setting based on time for how long to allow the master to keep logs.
Solution here would again be that before EXPUNGE notifications are sent to client we'll wait for reply from slave that it had also processed the expunge.
From all your descriptions it sounds as if you are trying to do Synchronous replicat. What I suggested is basically to use Asynchronous replication. I think synchronous replication is not only much more difficult, but also much more difficult to debug and maintain in working order over changes.
Master/multi-slave
Once the master/slave is working, support for multiple slaves could be added.
With the log shipping method I suggested multi-slave should not be much more coding to do.
In theory you could put more of the burden on the slaves to ask for their last transaction ID.. that they got onward.. so the master will not need to know anything extra to handle multi-slaves.
After master/multi-slave is working, we're nearly ready for a full multi-master operation
I think it will be clearer to see what needs to be done after you have master-slave working. I have never tried to implement a replication system, but I think that the onl way to have a reliable multi-master system is to have synchronous replication across ALL nodes.
This increases communication and locking significantly. The locking alone will likely be a choke point.
On Sun, 2007-05-20 at 20:58 -0400, Francisco Reyes wrote:
Timo Sirainen writes:
Master keeps all the changes in memory until slave has replied that it has committed the changes. If the memory buffer gets too large (1MB?)
Does this mean that in case of a crash all that would be lost? I think the cache should be smaller.
Well, there are two possibilities:
a) Accept that replication process can lose changes, and require a full resync when it gets back up. This of course means that all users' mailboxes need to be scanned, which can be slow.
b) Write everything immediately to disk (and possibly fsync()) and require the actual writer process to wait until replicator has done this before replying to client that the operation succeeded. Probably not worth it for flag changes, but for others it could be a good idea.
If the file gets too large (10MB?) it's deleted and slave will require a resync.
Don't agree. A large mailstore with Gigabytes worth of mail would benefit from having 10MB synced... instead of re-starting from scratch.
By a resync I mean that dovecot.index and and newer changes from dovecot.index.log need to be sent to the slave, which can then figure out what messages it's missing and request them from the master (or something similar). So I didn't mean that all existing messages would be sent.
But again this would mean that the above is done for all mailboxes. There could of course be some ways to make this faster, such as have a global modification counter stored for each mailbox and resync only those mailboxes where the counter is higher than the last value that the slave saw. I guess the modification counter could be a simple mtime timestamp of dovecot.index.log file :)
Master always keeps track of "user/mailbox -> last transaction sequence" in memory. When the slave comes back up and tells the master its last committed sequence, this allows the master to resync only those mailboxes that had changed.
I think a user configurable option to decide how large the sync files can grow to would be most flexible.
Sure. My default 1MB/10MB were just guesses as to what might be the defaults.
queues. The communication protocol would be binary
Because? Performance? Wouldn't that make debugging more difficult?
Because then flag changes, expunges and partially appends (and maybe something else) could be handled using the exact same format as dovecot.index.log file uses. The slave could simply append such transactions to dovecot.index.log without even trying to parse their contents.
But it should be possible to split users into multiple slaves (still one slave/user). The most configurable way to do this would be to have userdb return the slave host.
Why not just have 1 slave process per slave machine?
You mean why there shouldn't be just one master/slave pair, or why one slave process couldn't handle data from multiple masters?
For the former, because people are using the same computer to be master for some users, and slave for others. If there was a single master/slave pair and the master failed, some poor server's load would double. With splitting the users to multiple slaves, the new servers will now share the load.
For the latter, I'm not sure. It could be possible if the configuration is the same for all the masters that it's serving as slave for. Then it could even be better that way. I think both ways could be supported just as easily (run everything under one dovecot vs. run multiple dovecots in different IPs/ports).
This is the most important thing to get right, and also the most complex one. Besides replicating mails that are being saved via Dovecot, I think also externally saved mails should be replicated when they're first seen. This is somewhat related to doing an initial sync to a slave.
Why not go with a pure log replication scheme? this way you basically have 3 processes.
1- The normal, currently existing programs. Add logs to the process 2- A Master replication process which listens for clients requesting for info. 3- The slave processes that request infomation and write it to the slave machines.
With this approach you can basically break it down into logical units of code which can be tested and debugged. Also helps when you need to worry about security and the level at which each component needs to work.
I'm not completely sure what you mean by these. Basically the same as what I said, except just have imap/deliver simply send the changes without any waiting?
The biggest problem with saving is how to robustly handle master crashes. If you're just pushing changes from master to slave and the master dies, it's entirely possible that some of the new messages that were already saved in master didn't get through to slave.
With my suggested method that, in theory, never happen. A message doesn't get accepted unless the log gets written (if replication is on). If the master dies, when it gets restarted it should be able to continue.
But isn't the point of the master/slave that the slave would switch on if the master dies? If you switch slave to be the new master, it doesn't matter if the logs were written to master's disk. Sure the message could come back when the master is again brought back (assuming it didn't completely die), but until then your IMAP clients might see messages getting lost or existing UIDs being used for new mails, which can cause all kinds of breakages.
Are you planning to have a single slave? Or did you plan to allow multiple slaves? If allowing multiple slaves you will need to keep track at which point in the log each slave is. An easier approach is to have a setting based on time for how long to allow the master to keep logs.
I don't understand what you mean. Sure the logs could timeout at some point (or shouldn't there be some size limits anyway?), but you'd still need to keep track of what different slaves have seen.
Solution here would again be that before EXPUNGE notifications are sent to client we'll wait for reply from slave that it had also processed the expunge.
From all your descriptions it sounds as if you are trying to do Synchronous replicat. What I suggested is basically to use Asynchronous replication. I think synchronous replication is not only much more difficult, but also much more difficult to debug and maintain in working order over changes.
Right. And I think the benefits of doing it synchronously outweight the extra difficulties. As I mentioned in the beginning of the mail:
Since the whole point of master-slave replication would be to get a reliable service, I'll want to make the replication as reliable as possible. It would be easier to implement much simpler master-slave replication, but in error conditions that would almost guarantee that some messages get lost. I want to avoid that.
By "much simpler replication" I mean asynchronous replication. Perhaps asynchronous could be an option also if it seems that synchronous replication adds too much latency (especially if your replication is simply for an offsite backup), but I'd want synchronous to be the recommended method.
After master/multi-slave is working, we're nearly ready for a full multi-master operation
I think it will be clearer to see what needs to be done after you have master-slave working.
Sure. I wasn't planning on implementing multi-slave or multi-master before the master/slave was fully working and stress testing showing that no mails get lost even if master is killed every few seconds (and each crash causing master/slave to switch roles randomly).
I have never tried to implement a replication system, but I think that the onl way to have a reliable multi-master system is to have synchronous replication across ALL nodes.
That would make it the safest, but I think it's enough to have just one synchronous slave update. Once the slaves then figure out together who is the next master, it would gather all the updates it doesn't know about yet from the other slaves.
(Assuming the multi-master was implemented so that global mailbox holder is the master for the mailbox and the others are slaves.)
This increases communication and locking significantly. The locking alone will likely be a choke point.
My plan would require the locking only when the mailbox is being updated and the global lock isn't already owned by the server. If you want to avoid different servers from constantly stealing the lock from each others, use different ways to make sure that the mailbox normally isn't modified from more than one server.
I don't think this will be a big problem even if multiple servers are modifying the same mailbox, but it depends entirely on the extra latency caused by the global locking. I don't know what the latency will be until it can be tested, but I don't think it should be much more than what a simple ping would give over the same network.
Timo Sirainen writes:
Why not go with a pure log replication scheme? this way you basically have 3 processes.
1- The normal, currently existing programs. Add logs to the process 2- A Master replication process which listens for clients requesting for info. 3- The slave processes that request infomation and write it to the slave machines.
With this approach you can basically break it down into logical units of code which can be tested and debugged. Also helps when you need to worry about security and the level at which each component needs to work.
I'm not completely sure what you mean by these. Basically the same as what I said, except just have imap/deliver simply send the changes without any waiting?
I am basically suggesting to log all the changes to a log(s) and have a separate program handle passing on the information to the slaves.
But isn't the point of the master/slave that the slave would switch on if the master dies?
Yes.
If you switch slave to be the new master, it doesn't matter if the logs were written to master's disk. Sure the message could come back when the master is again brought back
I was thinking that once a master dies, next it it comes back it would be a slave and no longer master. This would, unfortunately imply, that any transactions that were not copied over would be lost.
I don't understand what you mean. Sure the logs could timeout at some point (or shouldn't there be some size limits anyway?), but you'd still need to keep track of what different slaves have seen.
I was thinking that it would be the slaves job to ask for data. Example pseudo transactions.
Master gets transactions 1 through 100 Slave(s) start from scratch and ask from transactions starting at 1. Say one slave, let's call it A, dies at transaction 50 and another slave, say B, continues and goes all the way to 100.
More transactions come and now we are up to 150. Slave B will ask for anything after 100. When Slave A comes back it would ask for anything after 50.
The master simply gets a request for transactions after a given number so it doesn't need to keep track the status of each slave.
possible. It would be easier to implement much simpler master-slave replication, but in error conditions that would almost guarantee that some messages get lost. I want to avoid that.
If you think Synchronous replication is doable.. go for it. :-) it is a tradeoff of reliability vs speed. Specially as the number of slaves grow the communication will grow.
Sure. I wasn't planning on implementing multi-slave or multi-master before the master/slave was fully working and stress testing showing that no mails get lost even if master is killed every few seconds (and each crash causing master/slave to switch roles randomly).
Great idea.
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
On Mon, 21 May 2007, Francisco Reyes wrote:
I am basically suggesting to log all the changes to a log(s) and have a separate program handle passing on the information to the slaves.
That's the old OpenLDAP way. However, it's surprising that you have differences in the database now and then, therefore, some way to do a 100% sync (á la rsync) would be good.
Steffen Kaiser -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (GNU/Linux)
iQEVAwUBRlKjiC9SORjhbDpvAQKuawgAiFISdtYM/DM220c5oqgxcKZrvhBrMvMa 2DoqnDqhiRipHGB7opre51v1bgejIRrlwrBf/4QM4ZHtUs1W/wAbRLfOS2XmsSmd mVeSxWWl5iCDjwo4Gn7FUsAcGGDJvhIE14MNmLYAHAdAtaqS3t9wJ4REHfvyPd0Z MsXRRQ259pWau+uiQWZLNc189v4k9Vbc9sE2qBtHz596fyZpoK194WcPzj6B4k8J QpYYDf7KrlZL8cTICwNXhrlT2tI8mFnUXH8Ls91ldyPpzpXObl7o3GCRC1tZnuFM ejMcSE4S2ImORXn2W9UAtFVKvQdHUZhbGPSk8/fhIKsVszwnmhiLQQ== =1ok3 -----END PGP SIGNATURE-----
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
On Tue, 22 May 2007, Timo Sirainen wrote:
a) Accept that replication process can lose changes, and require a full resync when it gets back up. This of course means that all users' mailboxes need to be scanned, which can be slow.
If you generate (in idle time or so and at admin request) an Unique ID, say MD5 hash, of the mail and use something like "IMAP-UID/MD5-hash" as sync data, it shouldn't be that problematic. The change of the data is probed by the ID, the keyword-file plain/text, the keywords per message using a "IMAP-UID/keyword-letters" pair. Even for large mailboxes, a diff should be possible in reasonable time. However, in multi-master environments, you'll need to track Expunges, so you know who to resolve conflicts, when server A has a message with UID U, but server B has not:
a) server B might not got it so far, or b) an user had deleted the message on server B already.
The same for UID clashes. If the MD5hash differs for a particular UID on two servers, it might be caused by a reset of the UIDs, right?
But again this would mean that the above is done for all mailboxes.
Why not check a "diff" state beforehand?
replication, but in error conditions that would almost guarantee that some messages get lost. I want to avoid that.
But in this case, you'll need to wait for any replication to any slave is done. This is not possible with the laptop-scenario??
By "much simpler replication" I mean asynchronous replication. Perhaps asynchronous could be an option also if it seems that synchronous replication adds too much latency (especially if your replication is simply for an offsite backup), but I'd want synchronous to be the recommended method.
Hmm, wouldn't it then better to have any transaction synchroneous by default, then? Though, your approach by defaulting to async, if a slave is not responding, makes this requirement a bit strange. Either-or??
Wouldn't it then better to offer two systems right at the beginning:
a synchroneous one, where everything is in sync, incl. keywords and ACLs and a async one?
That reminds me of something:
The masters sends all changes to a slave, but the slave storres them for later replay. -- Normally, both are in sync and no problem can occur. Now, either in idle time or when a particular mailbox is accessed, the slave makes the changes to the mailbox.
This way, you have all changes on all slaves in case the master's hd fails, but the latency is cut down to the normal network bandwidth. The async mode would "simply" spool the changes locally and hands them to the slave at later time.
Bye,
Steffen Kaiser -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (GNU/Linux)
iQEVAwUBRlKjGS9SORjhbDpvAQIQzQf/TWN2WRv8dbZWtK6+p4nVlcxbiR5I83hV fwRrakDnSiz+lSxlPCkFmhSNALtUqCKZq98fGSIUnqfOJ/HzKcbGbyvMOEboyplk bPAnoodttnPkJdxpOeaN3N1WzCJp/NUs30ZEAaN8GF5CekRwX7dlxCiFSE9owEkQ oet9CvJQdg0eVwSUnOL0pscYjXwg6oia51BYS/fYJ9pkbiq71pJ/CiySo/ZtvTg6 kq5AWDamM0qqWvFJniXFlfOO/bxexbbNozJ5xa51lKUN5KdPwIlDWzo7IzwceowQ h6enjc8T89A0zPO7uawtpcqvszTM6t7zmpGhDD7qJWxVfScnoghDFg== =oP3J -----END PGP SIGNATURE-----
This increases communication and locking significantly. The locking alone will likely be a choke point.
My plan would require the locking only when the mailbox is being updated and the global lock isn't already owned by the server. If you want to avoid different servers from constantly stealing the lock from each others, use different ways to make sure that the mailbox normally isn't modified from more than one server.
I don't think this will be a big problem even if multiple servers are modifying the same mailbox, but it depends entirely on the extra latency caused by the global locking. I don't know what the latency will be until it can be tested, but I don't think it should be much more than what a simple ping would give over the same network.
Best case, when all the nodes, and the network is up, locking latency shouldn't be much longer than say twice the RTT. But what really matters, and causes all the nasty bugs that even single-master replication systems have to deal with is the *worst case* latency. So everything is going along fine, and then due to a surge in incoming spam, one of your switches starts dropping 2% of the packets, and the server holding a lock starts taking 50ms instead of 1ms to respond to an incoming packet.
Now your previous lock latency of 1ms could easily extend into seconds if a couple of responses to lock requests don't get through. And your 16 node imap cluster is now 8 times slower than a single server, instead of 8 times faster ;)
The nasty part about this for imap is that we can't ever have a UID be handed out without *confirming* that it's been replicated to another server before sending out the packet. Otherwise you can get in the situation where node A sends out a new UID to a client out it's public NIC card, while in the meantime, it's internal NIC melted so the update never got propagated, so node B,C, and D decides "ooops, node A is dead, we are stealing his lock", and B takes over the lock and allocates the same UID to a different message, and now the CEO didn't get that notice from the SEC to save all his emails.
Once you decide you want replication, you pretty much have to go all the way to synchronous replication, and now you have a learning curve and complexity issue that's going to be there whether it's dovecot replication, or a cluster filesystem that's doing the dirty work for you.
--
Troy Benjegerdes 'da hozer' hozer@hozed.org
Somone asked me why I work on this free (http://www.fsf.org/philosophy/) software stuff and not get a real job. Charles Shultz had the best answer:
"Why do musicians compose symphonies and poets write poems? They do it because life wouldn't have any meaning for them if they didn't. That's why I draw cartoons. It's my life." -- Charles Shultz
On Tue, 2007-05-22 at 09:58 -0500, Troy Benjegerdes wrote:
Best case, when all the nodes, and the network is up, locking latency shouldn't be much longer than say twice the RTT. But what really matters, and causes all the nasty bugs that even single-master replication systems have to deal with is the *worst case* latency. So everything is going along fine, and then due to a surge in incoming spam, one of your switches starts dropping 2% of the packets, and the server holding a lock starts taking 50ms instead of 1ms to respond to an incoming packet.
Now your previous lock latency of 1ms could easily extend into seconds if a couple of responses to lock requests don't get through. And your 16 node imap cluster is now 8 times slower than a single server, instead of 8 times faster ;)
If you're so worried about that, you could create another internal network just for replication :)
The nasty part about this for imap is that we can't ever have a UID be handed out without *confirming* that it's been replicated to another server before sending out the packet. Otherwise you can get in the situation where node A sends out a new UID to a client out it's public NIC card, while in the meantime, it's internal NIC melted so the update never got propagated, so node B,C, and D decides "ooops, node A is dead, we are stealing his lock", and B takes over the lock and allocates the same UID to a different message, and now the CEO didn't get that notice from the SEC to save all his emails.
When the servers sync up again they'll notice the duplicated UID and both of the emails will be assigned a new UID to fix the situation. This conflict handling will have to be done in any case.
On Tue, Jun 05, 2007 at 09:56:29PM +0300, Timo Sirainen wrote:
On Tue, 2007-05-22 at 09:58 -0500, Troy Benjegerdes wrote:
Best case, when all the nodes, and the network is up, locking latency shouldn't be much longer than say twice the RTT. But what really matters, and causes all the nasty bugs that even single-master replication systems have to deal with is the *worst case* latency. So everything is going along fine, and then due to a surge in incoming spam, one of your switches starts dropping 2% of the packets, and the server holding a lock starts taking 50ms instead of 1ms to respond to an incoming packet.
Now your previous lock latency of 1ms could easily extend into seconds if a couple of responses to lock requests don't get through. And your 16 node imap cluster is now 8 times slower than a single server, instead of 8 times faster ;)
If you're so worried about that, you could create another internal network just for replication :)
Things are worse if the internal network for replication is the one that started having errors ;) .. Your machine is accessible to the world, but you can't reliably communicate to get a lock
The nasty part about this for imap is that we can't ever have a UID be handed out without *confirming* that it's been replicated to another server before sending out the packet. Otherwise you can get in the situation where node A sends out a new UID to a client out it's public NIC card, while in the meantime, it's internal NIC melted so the update never got propagated, so node B,C, and D decides "ooops, node A is dead, we are stealing his lock", and B takes over the lock and allocates the same UID to a different message, and now the CEO didn't get that notice from the SEC to save all his emails.
When the servers sync up again they'll notice the duplicated UID and both of the emails will be assigned a new UID to fix the situation. This conflict handling will have to be done in any case.
That sounds like a pretty clean solution, and makes a lot of the things that make replication hard go away.
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
On Thu, 17 May 2007, Timo Sirainen wrote:
Hello,
OpenLDAP uses another strategy, which is more robust aka needs less fragile interaction between the servers.
OpenLDAP stores any transaction into a replication log file, after it has been processed locally. The repl file is frequently read by another demaon (slurp) and forwarded to the slaves. If the forward to one particular slaves fails, the transaction is placed into a host-specific rejection log file. OpenLDAP uses a feature, that any modifiation (update, add new, delete) can be expressed in "command" syntax, hence, the "slave" speaks the same protocol as the master.
The biggest advantage is that the transation already succeeded for the master and is replayed to the slaves. So when pushing the message to the slave, you need not fiddle with decreasing UIDs for instance, because to perform a partial sync of a known-good-state mailbox. And the transaction is saved in the replay log file. In case the master process/host is crashing.
I think, if the replication log is replayed fastly - e.g. by "tailing" the file, you can effectively separate the problem of non-reacting slaves and re-replay for slaves that come up later and have quasi-immediate updates of the slaves. Also, because one replay agent per slave can be used, all interaction to the slave is sequential. You wrote something about avoiding files, what about making the repl log file a socket; so the frontend is dealing with the IMAP client and forwards the request to the replayer and is, therefore, not effected by probably bad network issues to the slaves.
You cannot have the advantage of OpenLDAP to use the same IMAP protocol for the slaves, because of some restrictions. You want to have a 100% replica, as I understand it, hence, the UIDs et al need to be equal. So you will probably need to improve the IMAP protocol by:
"APPEND/STORE with UID".
The message will be spooled with the same UID on the slave. As you've wrote, it SHOULD NOT happen, that the slave fails, but if the operation is impossible, due to some wicked out-of-sync state, the slave reports back and requests a full resync. The replay agent would then drop any changes in the transaction for the specific host and mailbox and syncs the whole mailbox with the client, probably using something like rsync?
BTW: It would be good, if the resyncs can be initiated on admin request, too ;-)
For the dial-up situation you've mentioned (laptop with own server), the replay agent would store any changes until the slave come up, properly by contacting the Master Dovecot process and issues something like "SMTP ETRN".
When the probability is low that the same mailbox is accessable on different hosts (for shared folders multiple accesses are likely), this method should be even work well in multi-master situations. You'll have to run replay agents on all the servers then.
To get the issues with the UIDs correct, when one mailbox is in use on different hosts, you thought about locks. But is this necessary?
If only the UIDs are the problem, then with a method to "mark" an UID as taken throughout multiple masters, all masters will have the same UID level, not necessarily with the message data already associated, meaning:
master A is to APPEND a new message to mailbox M, it sends all other masters the info: "want to take UID U". If the UID is already taken by another master B, B replies "UID taken", then the mailboxes are out-of-sync and need a full resync. If a master B receives a request for UID U, it has sent a election for itself, masters A&B are ranked, e.g. by IP address, so master B replies either "you may take it" or "I want to take it". In first case, master B re-issues its request for another UID U2 and marks UID U as taken. Otherwise master B marks UID U as taken in mailbox M.
If master A got the "OK for UID U", it allocates it finally and accepts the message from the IMAP/SMTP client and places the message into the replay log file.
When now a master B gets a transaction "STORE message as UID U" being taken, but no message, yet, the master accepts the transaction.
doesn't make sure that messages themselves aren't lost. If the master's hard disk failed completely and it can't be resynced anymore, the messages saved there are lost. This could be avoided by making the saving wait until the message is saved to slave:
- save mail to disk, and at the same time also send it to slave
- allocate UID(s) and tell to slave what they were, wait for "mail saved" reply from slave before committing the messages to mailbox permanently
Well, this assumes that everything is functional hyper-good. To preseve a hard disk should not be the issue of Dovecot, but the underlaying filesystem, IMHO. (aka RAID, SAN)
If you want to wait for each transaction, that all slaves gave their OK, you'll have problems with the "slave server on laptop" scenario. Then you'll need to perfrom a full sync each time.
BTW: There is something like DLM (Distributed Lock Manager), I don't know it this is what you are looking for.
Bye,
Steffen Kaiser -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (GNU/Linux)
iQEVAwUBRlGgKC9SORjhbDpvAQLvaQgAtIebLdGqSsV0AGMb/miU9GErGdRBvyWQ /0Z99DWugw4zDwOBzLgArOLxnJLKORMEs79/UXZVrESlXGzvOjjc5xzGU7VPEJ25 5UP8C8I/cTOeI8nvN0KTZ8Af576YgTb/qL5Jq1YwW6y60HYMiglFq5ZTvjAvZHPW oFQM30h0ZjnQxHDvXVy4PNtx0J1sU8vb1vD3Bd7jEsEwzj+3rtdmKoN9OxgqDV4X 5bEF+f2TAX28f1YGh5I0kfibh/7wseWMhqlNyUhAWmY9SSSHte0ZRg9b69PCU3rF ovz5807zOTzV51NmXjQPEYxBDnX5/VCwvotKmwEMhBhlJlW4pHyFQw== =ppQK -----END PGP SIGNATURE-----
Hi list,
OpenLDAP uses another strategy, which is more robust aka needs less fragile interaction between the servers.
We have been thinking very long about replication. The requirement is to have a backup computing center in distant location, so replication has to work over a WAN connection (latency!) and must be able to recover from failures. This in mind we came to the conclusion that the strategy OpenLDAP is using would be the best to come up with and would be not too difficult to implement (we even started experiments which showed that this would be feasible). BTW, Oracle's replication mechanism (DataGuard) also works in a similar way, ie. by transferring the transaction logs to the backup and replaying them there.
Cheers, Jörg
participants (8)
-
Aredridel
-
Christian Balzer
-
Francisco Reyes
-
J.Wendland@scan-plus.de
-
Jonathan
-
Steffen Kaiser
-
Timo Sirainen
-
Troy Benjegerdes