[Dovecot] restarting director
Hi all, anyone having any problems with restarting the director? Every time I bring down 1 of the director servers, reboot it, or just restart it for whatever reason, im seeing all kinds of problems. Dovecot generally always gives me this error:
Jan 20 22:49:55 imapdirector3 dovecot: director: Error: Director 194.109.26.173:444/right disconnected before handshake finished
It seems the directors cant agree on forming a ring anymore, and this may be leading to problems with clients. I mostly have to resort to bringing down all directors, and restarting them all at once. Not really a workable solution. As an example, last night for a few hours we were getting complaints from customers about being disconnected, and the only obvious error in the log was the one above, after one of my colleagues had to restart a director because of some changes in the syslog daemon. After I restarted all directors withing a few seconds of each other, all complaints disappeared.
Timo, i know ive asked similar questions before, but the answer just eludes me.
If I have 3 director servers, and need to take one down and restart it, what is the proper method to reconnect the ring? In practice, I cant seem to work it out and I mostly end up with the above error until I just restart them all. Not fun with 20.000 clients connected.
Cor
On Fri, 2011-01-21 at 13:42 -0400, Cor Bosman wrote:
Hi all, anyone having any problems with restarting the director? Every time I bring down 1 of the director servers, reboot it, or just restart it for whatever reason, im seeing all kinds of problems. Dovecot generally always gives me this error:
Jan 20 22:49:55 imapdirector3 dovecot: director: Error: Director 194.109.26.173:444/right disconnected before handshake finished
I'm not sure if that itself is a problem..
It seems the directors cant agree on forming a ring anymore, and this may be leading to problems with clients. I mostly have to resort to bringing down all directors, and restarting them all at once. Not really a workable solution. As an example, last night for a few hours we were getting complaints from customers about being disconnected, and the only obvious error in the log was the one above, after one of my colleagues had to restart a director because of some changes in the syslog daemon. After I restarted all directors withing a few seconds of each other, all complaints disappeared.
I can take a look at it, but it would help if you were able to reproduce the problem. I'm still lagging a lot behind in emails (=bugfixes)..
On Fri, Jan 21, 2011 at 08:00:08PM +0200, Timo Sirainen wrote:
On Fri, 2011-01-21 at 19:59 +0200, Timo Sirainen wrote:
I can take a look at it, but it would help if you were able to reproduce the problem.
More clearly: Reliably reproduce this in a test setup :)
Timo & Cor, did you guys ever nail this down? We're looking at migration to a director config soon but I'd like to see this resolved first. Anything we can do do help?
-K
On Mon, 2011-02-07 at 17:18 -0800, Kelsey Cummings wrote:
On Fri, Jan 21, 2011 at 08:00:08PM +0200, Timo Sirainen wrote:
On Fri, 2011-01-21 at 19:59 +0200, Timo Sirainen wrote:
I can take a look at it, but it would help if you were able to reproduce the problem.
More clearly: Reliably reproduce this in a test setup :)
Timo & Cor, did you guys ever nail this down? We're looking at migration to a director config soon but I'd like to see this resolved first. Anything we can do do help?
If you're still interested and you can (at least sometimes) reproduce this error:
See if this helps at all:
http://hg.dovecot.org/dovecot-2.0/rev/2b7af3a16521
If not, apply http://hg.dovecot.org/dovecot-2.0/rev/e9139f74c451 and set:
service director { executable = director -D }
Then gather the error/warning/debug logs from all directors around the time when it's not working correctly. Be sure that the error and debug messages go to the same log file so that the message ordering is preserved.
participants (3)
-
Cor Bosman
-
Kelsey Cummings
-
Timo Sirainen