Re: Auth process sometimes stop responding after upgrade
In data venerdì 7 settembre 2018 09:39:23 CEST, Aki Tuomi ha scritto:
Do you see anything else in logs?
Just the usual proxy messages, like these:
Sep 6 14:45:41 imap-front13 dovecot: imap-login: proxy(alessandro@soldinicarrelli.it): disconnecting 5.169.233.224 (Disconnected by server(0s idle, in=451, out=3556)): user=<alessandro@soldinicarrelli.it>, method=PLAIN, rip=5.169.233.224, lip=80.76.81.92, TLS, session=<uqVq2TJ1VbkFqeng>
Sep 6 14:45:51 imap-front13 dovecot: pop3-login: proxy(prg@pedrini.it): started proxying to 192.168.1.144:110: user=<prg@pedrini.it>, method=DIGEST- MD5, rip=5.34.191.10, lip=80.76.81.92, TLS, session=<WqhRRDN1qw0FIr8K>
The only suspect thing is this:
Sep 6 14:45:41 imap-front13 dovecot: director: doveadm: Host 192.168.1.142 vhost count changed from 100 to 0 Sep 6 14:45:41 imap-front13 dovecot: director: doveadm: Host 192.168.1.143 vhost count changed from 100 to 0 Sep 6 14:45:41 imap-front13 dovecot: director: doveadm: Host 192.168.1.219 vhost count changed from 100 to 0
Nothing on the other system logs (e.g. kernel, daemon, syslog, messages ....).
-- *Simone Lazzaris* *Qcom S.p.A.*
On 7 Sep 2018, at 11.00, Simone Lazzaris <s.lazzaris@interactive.eu> wrote:
The only suspect thing is this:
Sep 6 14:45:41 imap-front13 dovecot: director: doveadm: Host 192.168.1.142 vhost count changed from 100 to 0 Sep 6 14:45:41 imap-front13 dovecot: director: doveadm: Host 192.168.1.143 vhost count changed from 100 to 0 Sep 6 14:45:41 imap-front13 dovecot: director: doveadm: Host 192.168.1.219 vhost count changed from 100 to 0
Nothing on the other system logs (e.g. kernel, daemon, syslog, messages ....).
Any idea what is changing the vhost count on the backends? Do you have some script running that possibly does change the vhost count for some triggering event?
Sami
In data venerdì 7 settembre 2018 10:06:00 CEST, Sami Ketola ha scritto:
On 7 Sep 2018, at 11.00, Simone Lazzaris <s.lazzaris@interactive.eu> wrote:
The only suspect thing is this:
Sep 6 14:45:41 imap-front13 dovecot: director: doveadm: Host 192.168.1.142 vhost count changed from 100 to 0 Sep 6 14:45:41 imap-front13 dovecot: director: doveadm: Host 192.168.1.143 vhost count changed from 100 to 0 Sep 6 14:45:41 imap-front13 dovecot: director: doveadm: Host 192.168.1.219 vhost count changed from 100 to 0
Nothing on the other system logs (e.g. kernel, daemon, syslog, messages ....). Any idea what is changing the vhost count on the backends? Do you have some script running that possibly does change the vhost count for some triggering event?
Sami
Actually, I have a poolmon script running that should drop vhost count for unresponsive backends; the strage thing is, the backends are NOT unresponsive, they are working as ususal.
*Simone Lazzaris* *Qcom S.p.A.* simone.lazzaris@qcom.it[1] | www.qcom.it[2]
- LinkedIn[3]* | *Facebook[4]* [5]
[1] mailto:simone.lazzaris@qcom.it [2] https://www.qcom.it [3] https://www.linkedin.com/company/qcom-spa [4] http://www.facebook.com/qcomspa [5] https://www.qcom.it/includes/email-banner.gif
On 7 Sep 2018, at 11.25, Simone Lazzaris <simone.lazzaris@qcom.it> wrote: Actually, I have a poolmon script running that should drop vhost count for unresponsive backends; the strage thing is, the backends are NOT unresponsive, they are working as ususal.
If it's this one https://github.com/brandond/poolmon/blob/master/poolmon <https://github.com/brandond/poolmon/blob/master/poolmon> you are running and old version of it as the latest version is more compatible with recent dovecot releases.
current version in git correctly uses HOST-DOWN and HOST-FLUSH instead of modifying vhost count.
But even then I still consider poolmon a bit too aggressive in marking hosts down. It does HOST-DOWN and HOST-FLUSH already after first scan failure. Maybe there should be some grace period and wait for few failed polls in certain period before doing it in case there is some temporary overload or networking failure.
Sami
In data venerdì 7 settembre 2018 11:20:49 CEST, Sami Ketola ha scritto:
On 7 Sep 2018, at 11.25, Simone Lazzaris <simone.lazzaris@qcom.it> wrote: Actually, I have a poolmon script running that should drop vhost count for unresponsive backends; the strage thing is, the backends are NOT unresponsive, they are working as ususal. If it's this one https://github.com/brandond/poolmon/blob/master/poolmon <https://github.com/brandond/poolmon/blob/master/poolmon> you are running and old version of it as the latest version is more compatible with recent dovecot releases.
current version in git correctly uses HOST-DOWN and HOST-FLUSH instead of modifying vhost count.
Interesting, I'll surely upgrade the script in the next days. Thanks for the hint.
But the script is surely not the ultimate cause of the failures: the backend (and the script itself) are untouched - and working - since many moons ago.
The only modified entity is dovecot on the frontends.
And even in the event of some (3 out of 8, in this very case) backends marked as failed, the authentication on the frontends should work, shouldn't it?
I've tried to strace the auth process during the last failure, and this is what I've got:
Process 2539 attached - interrupt to quit gettimeofday({1536308480, 998803}, NULL) = 0 epoll_wait(15,
After about 60 seconds, I've aborted the strace and restarted dovecot to avoid upsetting customers. Searching for file descriptor #15 in /proc/nnnn/fd I found "anon_inode: [eventpoll]"
*Simone Lazzaris* *Qcom S.p.A.* simone.lazzaris@qcom.it[1] | www.qcom.it[2]
- LinkedIn[3]* | *Facebook[4]* [5]
[1] mailto:simone.lazzaris@qcom.it [2] https://www.qcom.it [3] https://www.linkedin.com/company/qcom-spa [4] http://www.facebook.com/qcomspa [5] https://www.qcom.it/includes/email-banner.gif
Try tracing auth-worker process instead, as sql uses worker. ---Aki TuomiDovecot oy -------- Original message --------From: Simone Lazzaris <s.lazzaris@interactive.eu> Date: 07/09/2018 12:59 (GMT+02:00) To: Sami Ketola <sami.ketola@dovecot.fi> Cc: Dovecot Mailing List <dovecot@dovecot.org>, Aki Tuomi <aki.tuomi@dovecot.fi> Subject: Re: Auth process sometimes stop responding after upgrade
In data venerdì 7 settembre 2018 11:20:49 CEST, Sami Ketola ha scritto:
On 7 Sep 2018, at 11.25, Simone Lazzaris <simone.lazzaris@qcom.it> wrote: Actually, I have a poolmon script running that should drop vhost count for unresponsive backends; the strage thing is, the backends are NOT unresponsive, they are working as ususal. If it's this one https://github.com/brandond/poolmon/blob/master/poolmon <https://github.com/brandond/poolmon/blob/master/poolmon> you are running and old version of it as the latest version is more compatible with recent dovecot releases.
current version in git correctly uses HOST-DOWN and HOST-FLUSH instead of modifying vhost count.
Interesting, I'll surely upgrade the script in the next days. Thanks for the hint. But the script is surely not the ultimate cause of the failures: the backend (and the script itself) are untouched - and working - since many moons ago. The only modified entity is dovecot on the frontends. And even in the event of some (3 out of 8, in this very case) backends marked as failed, the authentication on the frontends should work, shouldn't it? I've tried to strace the auth process during the last failure, and this is what I've got: Process 2539 attached - interrupt to quit gettimeofday({1536308480, 998803}, NULL) = 0 epoll_wait(15, After about 60 seconds, I've aborted the strace and restarted dovecot to avoid upsetting customers. Searching for file descriptor #15 in /proc/nnnn/fd I found "anon_inode:[eventpoll]"
-- Simone Lazzaris Responsabile datacenter
Qcom S.p.A. Via Roggia Vignola, 9 | 24047 Treviglio (BG) T +39036347905 | D +3903631970352| M +393938111237 simone.lazzaris@qcom.it | www.qcom.it
Qcom Official Pages LinkedIn | Facebook
Some more information: the issue has just occurred, again on an instance without the "service_count = 0" configuration directive on pop3-login.
I've observed that while the issue is occurring, the director process goes 100% CPU. I've straced the process. It is seemingly looping:
... ... epoll_ctl(13, EPOLL_CTL_ADD, 78, {EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP, {u32=149035320, u64=149035320}}) = 0 epoll_ctl(13, EPOLL_CTL_DEL, 78, {0, {u32=149035320, u64=149035320}}) = 0 epoll_ctl(13, EPOLL_CTL_ADD, 78, {EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP, {u32=149035320, u64=149035320}}) = 0 epoll_ctl(13, EPOLL_CTL_DEL, 78, {0, {u32=149035320, u64=149035320}}) = 0 epoll_ctl(13, EPOLL_CTL_ADD, 78, {EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP, {u32=149035320, u64=149035320}}) = 0 epoll_ctl(13, EPOLL_CTL_DEL, 78, {0, {u32=149035320, u64=149035320}}) = 0 epoll_ctl(13, EPOLL_CTL_ADD, 78, {EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP, {u32=149035320, u64=149035320}}) = 0 epoll_ctl(13, EPOLL_CTL_DEL, 78, {0, {u32=149035320, u64=149035320}}) = 0 epoll_ctl(13, EPOLL_CTL_ADD, 78, {EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP, {u32=149035320, u64=149035320}}) = 0 epoll_ctl(13, EPOLL_CTL_DEL, 78, {0, {u32=149035320, u64=149035320}}) = 0 epoll_ctl(13, EPOLL_CTL_ADD, 78, {EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP, {u32=149035320, u64=149035320}}) = 0 epoll_ctl(13, EPOLL_CTL_DEL, 78, {0, {u32=149035320, u64=149035320}}) = 0 epoll_ctl(13, EPOLL_CTL_ADD, 78, {EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP, {u32=149035320, u64=149035320}}) = 0 epoll_ctl(13, EPOLL_CTL_DEL, 78, {0, {u32=149035320, u64=149035320}}) = 0 epoll_ctl(13, EPOLL_CTL_ADD, 78, {EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP, {u32=149035320, u64=149035320}}) = 0 epoll_ctl(13, EPOLL_CTL_DEL, 78, {0, {u32=149035320, u64=149035320}}) = 0 epoll_ctl(13, EPOLL_CTL_ADD, 78, {EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP, {u32=149035320, u64=149035320}}) = 0 epoll_ctl(13, EPOLL_CTL_DEL, 78, {0, {u32=149035320, u64=149035320}}) = 0 epoll_ctl(13, EPOLL_CTL_ADD, 78, {EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP, {u32=149035320, u64=149035320}}) = 0 epoll_ctl(13, EPOLL_CTL_DEL, 78, {0, {u32=149035320, u64=149035320}}) = 0 epoll_ctl(13, EPOLL_CTL_ADD, 78, {EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP, {u32=149035320, u64=149035320}}) = 0 epoll_ctl(13, EPOLL_CTL_DEL, 78, {0, {u32=149035320, u64=149035320}}) = 0 epoll_ctl(13, EPOLL_CTL_ADD, 78, {EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP, {u32=149035320, u64=149035320}}) = 0 epoll_ctl(13, EPOLL_CTL_DEL, 78, {0, {u32=149035320, u64=149035320}}) = 0 epoll_ctl(13, EPOLL_CTL_ADD, 78, {EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP, {u32=149035320, u64=149035320}}) = 0 epoll_ctl(13, EPOLL_CTL_DEL, 78, {0, {u32=149035320, u64=149035320}}) = 0 epoll_ctl(13, EPOLL_CTL_ADD, 78, {EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP, {u32=149035320, u64=149035320}}) = 0 epoll_ctl(13, EPOLL_CTL_DEL, 78, {0, {u32=149035320, u64=149035320}}) = 0 epoll_ctl(13, EPOLL_CTL_ADD, 78, {EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP, {u32=149035320, u64=149035320}}) = 0 epoll_ctl(13, EPOLL_CTL_DEL, 78, {0, {u32=149035320, u64=149035320}}) = 0 epoll_ctl(13, EPOLL_CTL_ADD, 78, {EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP, {u32=149035320, u64=149035320}}) = 0 epoll_ctl(13, EPOLL_CTL_DEL, 78, {0, {u32=149035320, u64=149035320}}) = 0 epoll_ctl(13, EPOLL_CTL_ADD, 78, {EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP, {u32=149035320, u64=149035320}}) = 0 epoll_ctl(13, EPOLL_CTL_DEL, 78, {0, {u32=149035320, u64=149035320}}) = 0 epoll_ctl(13, EPOLL_CTL_ADD, 78, {EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP, {u32=149035320, u64=149035320}}) = 0 epoll_ctl(13, EPOLL_CTL_DEL, 78, {0, {u32=149035320, u64=149035320}}) = 0 epoll_ctl(13, EPOLL_CTL_ADD, 78, {EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP, {u32=149035320, u64=149035320}}) = 0 epoll_ctl(13, EPOLL_CTL_DEL, 78, {0, {u32=149035320, u64=149035320}}) = 0 epoll_ctl(13, EPOLL_CTL_ADD, 78, {EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP, {u32=149035320, u64=149035320}}) = 0 epoll_ctl(13, EPOLL_CTL_DEL, 78, {0, {u32=149035320, u64=149035320}}) = 0 epoll_ctl(13, EPOLL_CTL_ADD, 78, {EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP, {u32=149035320, u64=149035320}}) = 0 ... ... ...
FD 13 is "anon_inode:[eventpoll]"
Killing (with -9) the director process, without stopping or restarting the service, restores the correct behaviour.
*Simone Lazzaris* *Qcom S.p.A.* simone.lazzaris@qcom.it[1] | www.qcom.it[2]
- LinkedIn[3]* | *Facebook[4]* [5]
[1] mailto:simone.lazzaris@qcom.it [2] https://www.qcom.it [3] https://www.linkedin.com/company/qcom-spa [4] http://www.facebook.com/qcomspa [5] https://www.qcom.it/includes/email-banner.gif
On 7 Sep 2018, at 16.50, Simone Lazzaris <s.lazzaris@interactive.eu> wrote:
Some more information: the issue has just occurred, again on an instance without the "service_count = 0" configuration directive on pop3-login.
I've observed that while the issue is occurring, the director process goes 100% CPU. I've straced the process. It is seemingly looping:
... ... epoll_ctl(13, EPOLL_CTL_ADD, 78, {EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP, {u32=149035320, u64=149035320}}) = 0 epoll_ctl(13, EPOLL_CTL_DEL, 78, {0, {u32=149035320, u64=149035320}}) = 0 epoll_ctl(13, EPOLL_CTL_ADD, 78, {EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP, {u32=149035320, u64=149035320}}) = 0 epoll_ctl(13, EPOLL_CTL_DEL, 78, {0, {u32=149035320, u64=149035320}}) = 0 epoll_ctl(13, EPOLL_CTL_ADD, 78, {EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP, {u32=149035320, u64=149035320}}) = 0 epoll_ctl(13, EPOLL_CTL_DEL, 78, {0, {u32=149035320, u64=149035320}}) = 0
Nothing else but these epoll_ctl() calls? So it's gone to some loop where it keeps calling io_add() and io_remove().
FD 13 is "anon_inode:[eventpoll]"
What about fd 78? I guess some socket.
Could you also try two more things when it happens again:
ltrace -tt -e '*' -o ltrace.log -p <pid> (My guess this isn't going to be very useful, but just in case it might be..)
gdb -p <pid> bt full quit
Preferably install dovecot-dbg package also so the gdb backtrace output will be better.
On 7 Sep 2018, at 19.43, Timo Sirainen <tss@iki.fi> wrote:
On 7 Sep 2018, at 16.50, Simone Lazzaris <s.lazzaris@interactive.eu <mailto:s.lazzaris@interactive.eu>> wrote:
Some more information: the issue has just occurred, again on an instance without the "service_count = 0" configuration directive on pop3-login.
I've observed that while the issue is occurring, the director process goes 100% CPU. I've straced the process. It is seemingly looping:
... ... epoll_ctl(13, EPOLL_CTL_ADD, 78, {EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP, {u32=149035320, u64=149035320}}) = 0 epoll_ctl(13, EPOLL_CTL_DEL, 78, {0, {u32=149035320, u64=149035320}}) = 0 epoll_ctl(13, EPOLL_CTL_ADD, 78, {EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP, {u32=149035320, u64=149035320}}) = 0 epoll_ctl(13, EPOLL_CTL_DEL, 78, {0, {u32=149035320, u64=149035320}}) = 0 epoll_ctl(13, EPOLL_CTL_ADD, 78, {EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP, {u32=149035320, u64=149035320}}) = 0 epoll_ctl(13, EPOLL_CTL_DEL, 78, {0, {u32=149035320, u64=149035320}}) = 0
Nothing else but these epoll_ctl() calls? So it's gone to some loop where it keeps calling io_add() and io_remove().
I'm guessing it's because of doveadm command handling issues, since there's some weirdness in the code. Although I couldn't figure out exactly why it would go to infinite loop there. But attached a patch that may fix it, if you're able to test. We haven't noticed such infinite looping in other installations or automated director stresstests though..
FD 13 is "anon_inode:[eventpoll]"
What about fd 78? I guess some socket.
Could you also try two more things when it happens again:
ltrace -tt -e '*' -o ltrace.log -p <pid> (My guess this isn't going to be very useful, but just in case it might be..)
gdb -p <pid> bt full quit
Preferably install dovecot-dbg package also so the gdb backtrace output will be better.
These would still be useful to verify whether I'm even on the right track.
In data venerdì 7 settembre 2018 19:11:24 CEST, Timo Sirainen ha scritto:
On 7 Sep 2018, at 19.43, Timo Sirainen <tss@iki.fi> wrote:
On 7 Sep 2018, at 16.50, Simone Lazzaris <s.lazzaris@interactive.eu <mailto:s.lazzaris@interactive.eu>> wrote:
Some more information: the issue has just occurred, again on an instance without the "service_count = 0" configuration directive on pop3-login.
I've observed that while the issue is occurring, the director process goes 100% CPU. I've straced the process. It is seemingly looping:
... ... epoll_ctl(13, EPOLL_CTL_ADD, 78, {EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP, {u32=149035320, u64=149035320}}) = 0 epoll_ctl(13, EPOLL_CTL_DEL, 78, {0, {u32=149035320, u64=149035320}}) = 0 epoll_ctl(13, EPOLL_CTL_ADD, 78, {EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP, {u32=149035320, u64=149035320}}) = 0 epoll_ctl(13, EPOLL_CTL_DEL, 78, {0, {u32=149035320, u64=149035320}}) = 0 epoll_ctl(13, EPOLL_CTL_ADD, 78, {EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP, {u32=149035320, u64=149035320}}) = 0 epoll_ctl(13, EPOLL_CTL_DEL, 78, {0, {u32=149035320, u64=149035320}}) = 0> Nothing else but these epoll_ctl() calls? So it's gone to some loop where it keeps calling io_add() and io_remove(). I'm guessing it's because of doveadm command handling issues, since there's some weirdness in the code. Although I couldn't figure out exactly why it would go to infinite loop there. But attached a patch that may fix it, if you're able to test. We haven't noticed such infinite looping in other installations or automated director stresstests though.. FD 13 is "anon_inode:[eventpoll]"
What about fd 78? I guess some socket.
Could you also try two more things when it happens again:
ltrace -tt -e '*' -o ltrace.log -p <pid> (My guess this isn't going to be very useful, but just in case it might be..)
gdb -p <pid> bt full quit
Preferably install dovecot-dbg package also so the gdb backtrace output will be better. These would still be useful to verify whether I'm even on the right track.
Timo, unfortunately the patch doesn't compile; I've moved the declaration of "conn" one line up to make it work.
In other words, the function becomes like this: void doveadm_connections_ring_synced(void) { struct doveadm_connection *conn = doveadm_ring_sync_pending_connections; while (doveadm_ring_sync_pending_connections != NULL && conn->dir->ring_synced) { doveadm_connection_ring_sync_callback_t *callback = conn->ring_sync_callback;
conn->ring_sync_callback = NULL;
doveadm_connection_ring_sync_list_move(conn);
doveadm_connection_set_io(conn);
io_set_pending(conn->io);
callback(conn);
}
if (conn->dir->ring_synced)
doveadm_connections_continue_reset_cmds();
}
I've tried the patched version but without luck: the director keeps crashing:
Sep 8 14:10:56 imapfront2 dovecot: director: Connecting to 212.183.164.163:9090 (as 212.183.164.162): Initial connection Sep 8 14:10:56 imapfront2 dovecot: director: Incoming connection from director 212.183.164.161/in Sep 8 14:10:56 imapfront2 dovecot: director: director(212.183.164.163:9090/ out): Handshake finished in 0.002 secs (bytes in=61, bytes out=473, 0+0 USERs received, last input 0.000 s ago, last output 0.000 s ago, connected 0.002 s ago, 207 peak output buffer size, 0.000 CPU secs since connected) Sep 8 14:10:56 imapfront2 dovecot: director: director(212.183.164.161/in): Host 192.168.1.142 vhost count changed from 100 to 100 Sep 8 14:10:56 imapfront2 dovecot: director: director(212.183.164.161/in): Host 192.168.1.143 vhost count changed from 100 to 100 Sep 8 14:10:56 imapfront2 dovecot: director: director(212.183.164.161/in): Host 192.168.1.144 vhost count changed from 100 to 100 Sep 8 14:10:56 imapfront2 dovecot: director: director(212.183.164.161/in): Host 192.168.1.145 vhost count changed from 100 to 100 Sep 8 14:10:56 imapfront2 dovecot: director: director(212.183.164.161/in): Host 192.168.1.216 vhost count changed from 100 to 100 Sep 8 14:10:56 imapfront2 dovecot: director: director(212.183.164.161/in): Host 192.168.1.217 vhost count changed from 100 to 100 Sep 8 14:10:56 imapfront2 dovecot: director: director(212.183.164.161/in): Host 192.168.1.218 vhost count changed from 100 to 100 Sep 8 14:10:56 imapfront2 dovecot: director: director(212.183.164.161/in): Host 192.168.1.219 vhost count changed from 100 to 100 Sep 8 14:10:56 imapfront2 dovecot: director: director(212.183.164.161/in): Handshake finished in 0.005 secs (bytes in=67951, bytes out=56, 1788+0 USERs received, last input 0.000 s ago, last output 0.005 s ago, connected 0.005 s ago, 0 peak output buffer size, 0.004 CPU secs since connected) Sep 8 14:10:56 imapfront2 dovecot: pop3-login: Warning: Auth connection closed with 1 pending requests (max 0 secs, pid=22703, EOF) Sep 8 14:10:56 imapfront2 dovecot: auth: Warning: auth client 22703 disconnected with 1 pending requests: EOF Sep 8 14:10:56 imapfront2 dovecot: director: Fatal: master: service(director): child 23037 killed with signal 11 (core dumps disabled) Sep 8 14:10:56 imapfront2 dovecot: pop3-login: Disconnected (auth process communication failure): user=<>, method=DIGEST-MD5, rip=95.253.200.113, lip=80.76.81.122, session=<CLkuBFt1icNf/chx> Sep 8 14:10:56 imapfront2 dovecot: pop3-login: proxy(paola@cosmocarta.it): started proxying to 192.168.1.142:110: user=<paola@cosmocarta.it>, method=PLAIN, rip=85.32.56.86, lip=80.76.81.122, session=<KbkuBFt1NXVVIDhW> Sep 8 14:10:56 imapfront2 dovecot: director: Connecting to 212.183.164.163:9090 (as 212.183.164.162): Initial connection Sep 8 14:10:56 imapfront2 dovecot: director: director(212.183.164.163:9090/ out): Handshake finished in 0.000 secs (bytes in=61, bytes out=383, 0+0 USERs received, last input 0.000 s ago, last output 0.000 s ago, connected 0.000 s ago, 207 peak output buffer size, 0.000 CPU secs since connected) Sep 8 14:10:56 imapfront2 dovecot: director: Incoming connection from director 212.183.164.161/in Sep 8 14:10:56 imapfront2 dovecot: director: director(212.183.164.161/in): Host 192.168.1.142 vhost count changed from 100 to 100 Sep 8 14:10:56 imapfront2 dovecot: director: director(212.183.164.161/in): Host 192.168.1.143 vhost count changed from 100 to 100 Sep 8 14:10:56 imapfront2 dovecot: director: director(212.183.164.161/in): Host 192.168.1.144 vhost count changed from 100 to 100 Sep 8 14:10:56 imapfront2 dovecot: director: director(212.183.164.161/in): Host 192.168.1.145 vhost count changed from 100 to 100 Sep 8 14:10:56 imapfront2 dovecot: director: director(212.183.164.161/in): Host 192.168.1.216 vhost count changed from 100 to 100 Sep 8 14:10:56 imapfront2 dovecot: director: director(212.183.164.161/in): Host 192.168.1.217 vhost count changed from 100 to 100 Sep 8 14:10:56 imapfront2 dovecot: director: director(212.183.164.161/in): Host 192.168.1.218 vhost count changed from 100 to 100 Sep 8 14:10:56 imapfront2 dovecot: director: director(212.183.164.161/in): Host 192.168.1.219 vhost count changed from 100 to 100 Sep 8 14:10:56 imapfront2 dovecot: director: director(212.183.164.161/in): Handshake finished in 0.005 secs (bytes in=67921, bytes out=56, 1788+0 USERs received, last input 0.000 s ago, last output 0.005 s ago, connected 0.005 s ago, 0 peak output buffer size, 0.004 CPU secs since connected)
I'm thinking to downgrade to version 2.2.30.2, which I use on another cluster without issues.
-- Simone Lazzaris Qcom SpA
In data venerdì 7 settembre 2018 18:43:36 CEST, Timo Sirainen ha scritto:
Hi Timo; it happened again, this time on a "high-performance" instance (e.g., WITH service_count=0)
On 7 Sep 2018, at 16.50, Simone Lazzaris <s.lazzaris@interactive.eu> wrote:
Some more information: the issue has just occurred, again on an instance without the "service_count = 0" configuration directive on pop3-login. Could you also try two more things when it happens again:
ltrace -tt -e '*' -o ltrace.log -p <pid> (My guess this isn't going to be very useful, but just in case it might be..)
Done; unfortunately, ltrace.log is empty.
gdb -p <pid> bt full quit
Here it is:
root@imapfront2:/usr/local/src/dovecot-2.2.36/src# gdb -p 31635 GNU gdb (GDB) 7.4.1-debian Copyright (C) 2012 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "i486-linux-gnu". For bug reporting instructions, please see: <http://www.gnu.org/software/gdb/bugs/>. Attaching to process 31635 Reading symbols from /usr/local/libexec/dovecot/director...done. Reading symbols from /usr/local/lib/dovecot/libdovecot.so.0...done. Loaded symbols for /usr/local/lib/dovecot/libdovecot.so.0 Reading symbols from /lib/i386-linux-gnu/i686/cmov/libc.so.6...(no debugging symbols found)...done. Loaded symbols for /lib/i386-linux-gnu/i686/cmov/libc.so.6 Reading symbols from /lib/i386-linux-gnu/i686/cmov/libdl.so.2...(no debugging symbols found)...done. Loaded symbols for /lib/i386-linux-gnu/i686/cmov/libdl.so.2 Reading symbols from /lib/i386-linux-gnu/i686/cmov/librt.so.1...(no debugging symbols found)...done. Loaded symbols for /lib/i386-linux-gnu/i686/cmov/librt.so.1 Reading symbols from /lib/ld-linux.so.2...(no debugging symbols found)...done. Loaded symbols for /lib/ld-linux.so.2 Reading symbols from /lib/i386-linux-gnu/i686/cmov/libpthread.so.0...(no debugging symbols found)...done. [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib/i386-linux-gnu/i686/cmov/ libthread_db.so.1". Loaded symbols for /lib/i386-linux-gnu/i686/cmov/libpthread.so.0 0xb76fa428 in __kernel_vsyscall () (gdb) bt full #0 0xb76fa428 in __kernel_vsyscall () No symbol table info available. #1 0xb752929e in epoll_ctl () from /lib/i386-linux-gnu/i686/cmov/libc.so.6 No symbol table info available. #2 0xb7678aa0 in io_loop_handle_add (io=io@entry=0x815db90) at ioloop- epoll.c:106 ctx = 0x8144e98 list = 0x814556c event = {events = 27, data = {ptr = 0x815c160, fd = 135643488, u32 = 135643488, u64 = 135643488}} op = 1 first = <optimized out> #3 0xb7676a7d in io_add_file (fd=fd@entry=27, condition=condition@entry=IO_READ, source_filename=source_filename@entry=0x805f7f4 "doveadm-connection.c", source_linenum=source_linenum@entry=1070, callback=callback@entry=0x8057e10 <doveadm_connection_input>, context=context@entry=0x8166398) at ioloop.c:64 io = 0x815db90 __FUNCTION__ = "io_add_file" #4 0xb7676bef in io_add (fd=27, condition=condition@entry=IO_READ, source_filename=source_filename@entry=0x805f7f4 "doveadm-connection.c", source_linenum=source_linenum@entry=1070, callback=callback@entry=0x8057e10 <doveadm_connection_input>, context=context@entry=0x8166398) at ioloop.c:87 io = <optimized out> __FUNCTION__ = "io_add" #5 0x08057135 in doveadm_connection_set_io (conn=conn@entry=0x8166398) at doveadm-connection.c:1070 No locals. #6 0x08059528 in doveadm_connections_ring_synced () at doveadm-connection.c: 1171 conn = 0x8166398 callback = 0x8057f90 <doveadm_connection_cmd_run_synced> #7 0xb76777b7 in io_loop_handle_timeouts_real (ioloop=0x8138678) at ioloop.c: 568 timeout = 0x8165a38 item = 0x8165a38 tv = {tv_sec = 0, tv_usec = 0} tv_call = {tv_sec = 1536404874, tv_usec = 999375} t_id = 3 #8 io_loop_handle_timeouts (ioloop=ioloop@entry=0x8138678) at ioloop.c:581 _data_stack_cur_id = 2 #9 0xb7678dd2 in io_loop_handler_run_internal (ioloop=ioloop@entry=0x8138678) at ioloop-epoll.c:196 ctx = 0x8144e98 events = 0x0 event = 0x8138678 list = <optimized out> io = <optimized out> tv = {tv_sec = 0, tv_usec = 0} events_count = 15 msecs = 0 ret = 0 i = <optimized out> j = <optimized out> call = <optimized out> __FUNCTION__ = "io_loop_handler_run_internal" #10 0xb7677496 in io_loop_handler_run (ioloop=ioloop@entry=0x8138678) at ioloop.c:649 No locals. ---Type <return> to continue, or q <return> to quit--- #11 0xb7677658 in io_loop_run (ioloop=0x8138678) at ioloop.c:624 __FUNCTION__ = "io_loop_run" #12 0xb75f045e in master_service_run (service=0x81385a8, callback=callback@entry=0x804d360 <client_connected>) at master-service.c:719 No locals. #13 0x0804cf5e in main (argc=1, argv=0x8138300) at main.c:366 set_roots = {0x805f680, 0x0} test_port = 0 error = <optimized out> debug = false c = <optimized out> (gdb) quit A debugging session is active.
Inferior 1 [process 31635] will be detached.
Quit anyway? (y or n) y
Preferably install dovecot-dbg package also so the gdb backtrace output will be better.
Well, I've compiled dovecot from the source, so I don't have a package. I'll try your patch on a server or two, and see if it solve the issues for them.
-- Simone Lazzaris Qcom SpA
In data venerdì 7 settembre 2018 10:06:00 CEST, Sami Ketola ha scritto:
On 7 Sep 2018, at 11.00, Simone Lazzaris <s.lazzaris@interactive.eu> wrote:
The only suspect thing is this:
Sep 6 14:45:41 imap-front13 dovecot: director: doveadm: Host 192.168.1.142 vhost count changed from 100 to 0 Sep 6 14:45:41 imap-front13 dovecot: director: doveadm: Host 192.168.1.143 vhost count changed from 100 to 0 Sep 6 14:45:41 imap-front13 dovecot: director: doveadm: Host 192.168.1.219 vhost count changed from 100 to 0
Nothing on the other system logs (e.g. kernel, daemon, syslog, messages ....). Any idea what is changing the vhost count on the backends? Do you have some script running that possibly does change the vhost count for some triggering event?
Sami
Actually, I have a poolmon script running that should drop vhost count for unresponsive backends; the strage thing is, the backends are NOT unresponsive, they are working as ususal.
To be honest, the configuration that keep failing (again, right now) has imap- login not configured as I've told you. The configuration you saw was an attempt to overcame the problem, in which I've set pop3-login like this:
service pop3-login { executable = imap-login director service_count = 0 vsz_limit = 128 M }
But the machines that fails are like this:
service pop3-login { executable = pop3-login director }
I've got 4 VM configured in the ring. The one with service_count=0 has (so far) never failed, the other 3 has failed multiple time (every few hours).
-- *Simone Lazzaris* *Qcom S.p.A.*
participants (5)
-
Aki Tuomi
-
Sami Ketola
-
Simone Lazzaris
-
Simone Lazzaris
-
Timo Sirainen