Re: [Dovecot] Replication plans

6 Jun 2007 · *worst case*


      On Tue, Jun 05, 2007 at 09:56:29PM +0300, Timo Sirainen wrote:
...
On Tue, 2007-05-22 at 09:58 -0500, Troy Benjegerdes wrote:
...
Best case, when all the nodes, and the network is up, locking latency
shouldn't be much longer than say twice the RTT. But what really
matters, and causes all the nasty bugs that even single-master
replication systems have to deal with is the *worst case* latency. So
everything is going along fine, and then due to a surge in incoming
spam, one of your switches starts dropping 2% of the packets, and the
server holding a lock starts taking 50ms instead of 1ms to respond to an
incoming packet.
Now your previous lock latency of 1ms could easily extend into seconds if
a couple of responses to lock requests don't get through. And your 16
node imap cluster is now 8 times slower than a single server, instead of
8 times faster ;)
If you're so worried about that, you could create another internal
network just for replication :)
Things are worse if the internal network for replication is the one that
started having errors ;) .. Your machine is accessible to the world, but
you can't reliably communicate to get a lock
...
...
The nasty part about this for imap is that we can't ever have a UID be
handed out without *confirming* that it's been replicated to another
server before sending out the packet. Otherwise you can get in the
situation where node A sends out a new UID to a client out it's public
NIC card, while in the meantime, it's internal NIC melted so the update
never got propagated, so node B,C, and D  decides "ooops, node A is
dead, we are stealing his lock", and B takes over the lock and allocates
the same UID to a different message, and now the CEO didn't get that
notice from the SEC to save all his emails.
When the servers sync up again they'll notice the duplicated UID and
both of the emails will be assigned a new UID to fix the situation. This
conflict handling will have to be done in any case.
That sounds like a pretty clean solution, and makes a lot of the things
that make replication hard go away.