Re: Auth process sometimes stop responding after upgrade
In data venerdì 7 settembre 2018 09:39:23 CEST, Aki Tuomi ha scritto:
Do you see anything else in logs?
Just the usual proxy messages, like these:
Sep 6 14:45:41 imap-front13 dovecot: imap-login: proxy(alessandro@soldinicarrelli.it): disconnecting 5.169.233.224 (Disconnected by server(0s idle, in=451, out=3556)): user=alessandro@soldinicarrelli.it, method=PLAIN, rip=5.169.233.224, lip=80.76.81.92, TLS, session=<uqVq2TJ1VbkFqeng>
Sep 6 14:45:51 imap-front13 dovecot: pop3-login: proxy(prg@pedrini.it): started proxying to 192.168.1.144:110: user=prg@pedrini.it, method=DIGEST- MD5, rip=5.34.191.10, lip=80.76.81.92, TLS, session=<WqhRRDN1qw0FIr8K>
The only suspect thing is this:
Sep 6 14:45:41 imap-front13 dovecot: director: doveadm: Host 192.168.1.142 vhost count changed from 100 to 0 Sep 6 14:45:41 imap-front13 dovecot: director: doveadm: Host 192.168.1.143 vhost count changed from 100 to 0 Sep 6 14:45:41 imap-front13 dovecot: director: doveadm: Host 192.168.1.219 vhost count changed from 100 to 0
Nothing on the other system logs (e.g. kernel, daemon, syslog, messages ....).
-- *Simone Lazzaris* *Qcom S.p.A.*
On 7 Sep 2018, at 11.00, Simone Lazzaris s.lazzaris@interactive.eu wrote:
The only suspect thing is this:
Sep 6 14:45:41 imap-front13 dovecot: director: doveadm: Host 192.168.1.142 vhost count changed from 100 to 0 Sep 6 14:45:41 imap-front13 dovecot: director: doveadm: Host 192.168.1.143 vhost count changed from 100 to 0 Sep 6 14:45:41 imap-front13 dovecot: director: doveadm: Host 192.168.1.219 vhost count changed from 100 to 0
Nothing on the other system logs (e.g. kernel, daemon, syslog, messages ....).
Any idea what is changing the vhost count on the backends? Do you have some script running that possibly does change the vhost count for some triggering event?
Sami
In data venerdì 7 settembre 2018 10:06:00 CEST, Sami Ketola ha scritto:
On 7 Sep 2018, at 11.00, Simone Lazzaris s.lazzaris@interactive.eu wrote:
The only suspect thing is this:
Sep 6 14:45:41 imap-front13 dovecot: director: doveadm: Host 192.168.1.142 vhost count changed from 100 to 0 Sep 6 14:45:41 imap-front13 dovecot: director: doveadm: Host 192.168.1.143 vhost count changed from 100 to 0 Sep 6 14:45:41 imap-front13 dovecot: director: doveadm: Host 192.168.1.219 vhost count changed from 100 to 0
Nothing on the other system logs (e.g. kernel, daemon, syslog, messages ....). Any idea what is changing the vhost count on the backends? Do you have some script running that possibly does change the vhost count for some triggering event?
Sami
Actually, I have a poolmon script running that should drop vhost count for unresponsive backends; the strage thing is, the backends are NOT unresponsive, they are working as ususal.
*Simone Lazzaris* *Qcom S.p.A.* simone.lazzaris@qcom.it[1] | www.qcom.it[2]
- LinkedIn[3]* | *Facebook[4]* [5]
[1] mailto:simone.lazzaris@qcom.it [2] https://www.qcom.it [3] https://www.linkedin.com/company/qcom-spa [4] http://www.facebook.com/qcomspa [5] https://www.qcom.it/includes/email-banner.gif
On 7 Sep 2018, at 11.25, Simone Lazzaris simone.lazzaris@qcom.it wrote: Actually, I have a poolmon script running that should drop vhost count for unresponsive backends; the strage thing is, the backends are NOT unresponsive, they are working as ususal.
If it's this one https://github.com/brandond/poolmon/blob/master/poolmon https://github.com/brandond/poolmon/blob/master/poolmon you are running and old version of it as the latest version is more compatible with recent dovecot releases.
current version in git correctly uses HOST-DOWN and HOST-FLUSH instead of modifying vhost count.
But even then I still consider poolmon a bit too aggressive in marking hosts down. It does HOST-DOWN and HOST-FLUSH already after first scan failure. Maybe there should be some grace period and wait for few failed polls in certain period before doing it in case there is some temporary overload or networking failure.
Sami
In data venerdì 7 settembre 2018 11:20:49 CEST, Sami Ketola ha scritto:
On 7 Sep 2018, at 11.25, Simone Lazzaris simone.lazzaris@qcom.it wrote: Actually, I have a poolmon script running that should drop vhost count for unresponsive backends; the strage thing is, the backends are NOT unresponsive, they are working as ususal. If it's this one https://github.com/brandond/poolmon/blob/master/poolmon https://github.com/brandond/poolmon/blob/master/poolmon you are running and old version of it as the latest version is more compatible with recent dovecot releases.
current version in git correctly uses HOST-DOWN and HOST-FLUSH instead of modifying vhost count.
Interesting, I'll surely upgrade the script in the next days. Thanks for the hint.
But the script is surely not the ultimate cause of the failures: the backend (and the script itself) are untouched - and working - since many moons ago.
The only modified entity is dovecot on the frontends.
And even in the event of some (3 out of 8, in this very case) backends marked as failed, the authentication on the frontends should work, shouldn't it?
I've tried to strace the auth process during the last failure, and this is what I've got:
Process 2539 attached - interrupt to quit gettimeofday({1536308480, 998803}, NULL) = 0 epoll_wait(15,
After about 60 seconds, I've aborted the strace and restarted dovecot to avoid upsetting customers. Searching for file descriptor #15 in /proc/nnnn/fd I found "anon_inode: [eventpoll]"
*Simone Lazzaris* *Qcom S.p.A.* simone.lazzaris@qcom.it[1] | www.qcom.it[2]
- LinkedIn[3]* | *Facebook[4]* [5]
[1] mailto:simone.lazzaris@qcom.it [2] https://www.qcom.it [3] https://www.linkedin.com/company/qcom-spa [4] http://www.facebook.com/qcomspa [5] https://www.qcom.it/includes/email-banner.gif
Try tracing auth-worker process instead, as sql uses worker. ---Aki TuomiDovecot oy -------- Original message --------From: Simone Lazzaris s.lazzaris@interactive.eu Date: 07/09/2018 12:59 (GMT+02:00) To: Sami Ketola sami.ketola@dovecot.fi Cc: Dovecot Mailing List dovecot@dovecot.org, Aki Tuomi aki.tuomi@dovecot.fi Subject: Re: Auth process sometimes stop responding after upgrade
In data venerdì 7 settembre 2018 11:20:49 CEST, Sami Ketola ha scritto:
On 7 Sep 2018, at 11.25, Simone Lazzaris simone.lazzaris@qcom.it wrote: Actually, I have a poolmon script running that should drop vhost count for unresponsive backends; the strage thing is, the backends are NOT unresponsive, they are working as ususal. If it's this one https://github.com/brandond/poolmon/blob/master/poolmon https://github.com/brandond/poolmon/blob/master/poolmon you are running and old version of it as the latest version is more compatible with recent dovecot releases.
current version in git correctly uses HOST-DOWN and HOST-FLUSH instead of modifying vhost count.
Interesting, I'll surely upgrade the script in the next days. Thanks for the hint. But the script is surely not the ultimate cause of the failures: the backend (and the script itself) are untouched - and working - since many moons ago. The only modified entity is dovecot on the frontends. And even in the event of some (3 out of 8, in this very case) backends marked as failed, the authentication on the frontends should work, shouldn't it? I've tried to strace the auth process during the last failure, and this is what I've got: Process 2539 attached - interrupt to quit gettimeofday({1536308480, 998803}, NULL) = 0 epoll_wait(15, After about 60 seconds, I've aborted the strace and restarted dovecot to avoid upsetting customers. Searching for file descriptor #15 in /proc/nnnn/fd I found "anon_inode:[eventpoll]"
-- Simone Lazzaris Responsabile datacenter
Qcom S.p.A. Via Roggia Vignola, 9 | 24047 Treviglio (BG) T +39036347905 | D +3903631970352| M +393938111237 simone.lazzaris@qcom.it | www.qcom.it
Qcom Official Pages LinkedIn | Facebook
Some more information: the issue has just occurred, again on an instance without the "service_count = 0" configuration directive on pop3-login.
I've observed that while the issue is occurring, the director process goes 100% CPU. I've straced the process. It is seemingly looping:
... ... epoll_ctl(13, EPOLL_CTL_ADD, 78, {EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP, {u32=149035320, u64=149035320}}) = 0 epoll_ctl(13, EPOLL_CTL_DEL, 78, {0, {u32=149035320, u64=149035320}}) = 0 epoll_ctl(13, EPOLL_CTL_ADD, 78, {EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP, {u32=149035320, u64=149035320}}) = 0 epoll_ctl(13, EPOLL_CTL_DEL, 78, {0, {u32=149035320, u64=149035320}}) = 0 epoll_ctl(13, EPOLL_CTL_ADD, 78, {EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP, {u32=149035320, u64=149035320}}) = 0 epoll_ctl(13, EPOLL_CTL_DEL, 78, {0, {u32=149035320, u64=149035320}}) = 0 epoll_ctl(13, EPOLL_CTL_ADD, 78, {EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP, {u32=149035320, u64=149035320}}) = 0 epoll_ctl(13, EPOLL_CTL_DEL, 78, {0, {u32=149035320, u64=149035320}}) = 0 epoll_ctl(13, EPOLL_CTL_ADD, 78, {EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP, {u32=149035320, u64=149035320}}) = 0 epoll_ctl(13, EPOLL_CTL_DEL, 78, {0, {u32=149035320, u64=149035320}}) = 0 epoll_ctl(13, EPOLL_CTL_ADD, 78, {EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP, {u32=149035320, u64=149035320}}) = 0 epoll_ctl(13, EPOLL_CTL_DEL, 78, {0, {u32=149035320, u64=149035320}}) = 0 epoll_ctl(13, EPOLL_CTL_ADD, 78, {EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP, {u32=149035320, u64=149035320}}) = 0 epoll_ctl(13, EPOLL_CTL_DEL, 78, {0, {u32=149035320, u64=149035320}}) = 0 epoll_ctl(13, EPOLL_CTL_ADD, 78, {EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP, {u32=149035320, u64=149035320}}) = 0 epoll_ctl(13, EPOLL_CTL_DEL, 78, {0, {u32=149035320, u64=149035320}}) = 0 epoll_ctl(13, EPOLL_CTL_ADD, 78, {EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP, {u32=149035320, u64=149035320}}) = 0 epoll_ctl(13, EPOLL_CTL_DEL, 78, {0, {u32=149035320, u64=149035320}}) = 0 epoll_ctl(13, EPOLL_CTL_ADD, 78, {EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP, {u32=149035320, u64=149035320}}) = 0 epoll_ctl(13, EPOLL_CTL_DEL, 78, {0, {u32=149035320, u64=149035320}}) = 0 epoll_ctl(13, EPOLL_CTL_ADD, 78, {EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP, {u32=149035320, u64=149035320}}) = 0 epoll_ctl(13, EPOLL_CTL_DEL, 78, {0, {u32=149035320, u64=149035320}}) = 0 epoll_ctl(13, EPOLL_CTL_ADD, 78, {EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP, {u32=149035320, u64=149035320}}) = 0 epoll_ctl(13, EPOLL_CTL_DEL, 78, {0, {u32=149035320, u64=149035320}}) = 0 epoll_ctl(13, EPOLL_CTL_ADD, 78, {EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP, {u32=149035320, u64=149035320}}) = 0 epoll_ctl(13, EPOLL_CTL_DEL, 78, {0, {u32=149035320, u64=149035320}}) = 0 epoll_ctl(13, EPOLL_CTL_ADD, 78, {EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP, {u32=149035320, u64=149035320}}) = 0 epoll_ctl(13, EPOLL_CTL_DEL, 78, {0, {u32=149035320, u64=149035320}}) = 0 epoll_ctl(13, EPOLL_CTL_ADD, 78, {EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP, {u32=149035320, u64=149035320}}) = 0 epoll_ctl(13, EPOLL_CTL_DEL, 78, {0, {u32=149035320, u64=149035320}}) = 0 epoll_ctl(13, EPOLL_CTL_ADD, 78, {EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP, {u32=149035320, u64=149035320}}) = 0 epoll_ctl(13, EPOLL_CTL_DEL, 78, {0, {u32=149035320, u64=149035320}}) = 0 epoll_ctl(13, EPOLL_CTL_ADD, 78, {EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP, {u32=149035320, u64=149035320}}) = 0 epoll_ctl(13, EPOLL_CTL_DEL, 78, {0, {u32=149035320, u64=149035320}}) = 0 epoll_ctl(13, EPOLL_CTL_ADD, 78, {EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP, {u32=149035320, u64=149035320}}) = 0 epoll_ctl(13, EPOLL_CTL_DEL, 78, {0, {u32=149035320, u64=149035320}}) = 0 epoll_ctl(13, EPOLL_CTL_ADD, 78, {EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP, {u32=149035320, u64=149035320}}) = 0 epoll_ctl(13, EPOLL_CTL_DEL, 78, {0, {u32=149035320, u64=149035320}}) = 0 epoll_ctl(13, EPOLL_CTL_ADD, 78, {EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP, {u32=149035320, u64=149035320}}) = 0 epoll_ctl(13, EPOLL_CTL_DEL, 78, {0, {u32=149035320, u64=149035320}}) = 0 epoll_ctl(13, EPOLL_CTL_ADD, 78, {EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP, {u32=149035320, u64=149035320}}) = 0 ... ... ...
FD 13 is "anon_inode:[eventpoll]"
Killing (with -9) the director process, without stopping or restarting the service, restores the correct behaviour.
*Simone Lazzaris* *Qcom S.p.A.* simone.lazzaris@qcom.it[1] | www.qcom.it[2]
- LinkedIn[3]* | *Facebook[4]* [5]
[1] mailto:simone.lazzaris@qcom.it [2] https://www.qcom.it [3] https://www.linkedin.com/company/qcom-spa [4] http://www.facebook.com/qcomspa [5] https://www.qcom.it/includes/email-banner.gif
On 7 Sep 2018, at 16.50, Simone Lazzaris s.lazzaris@interactive.eu wrote:
Some more information: the issue has just occurred, again on an instance without the "service_count = 0" configuration directive on pop3-login.
I've observed that while the issue is occurring, the director process goes 100% CPU. I've straced the process. It is seemingly looping:
... ... epoll_ctl(13, EPOLL_CTL_ADD, 78, {EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP, {u32=149035320, u64=149035320}}) = 0 epoll_ctl(13, EPOLL_CTL_DEL, 78, {0, {u32=149035320, u64=149035320}}) = 0 epoll_ctl(13, EPOLL_CTL_ADD, 78, {EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP, {u32=149035320, u64=149035320}}) = 0 epoll_ctl(13, EPOLL_CTL_DEL, 78, {0, {u32=149035320, u64=149035320}}) = 0 epoll_ctl(13, EPOLL_CTL_ADD, 78, {EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP, {u32=149035320, u64=149035320}}) = 0 epoll_ctl(13, EPOLL_CTL_DEL, 78, {0, {u32=149035320, u64=149035320}}) = 0
Nothing else but these epoll_ctl() calls? So it's gone to some loop where it keeps calling io_add() and io_remove().
FD 13 is "anon_inode:[eventpoll]"
What about fd 78? I guess some socket.
Could you also try two more things when it happens again:
ltrace -tt -e '*' -o ltrace.log -p <pid> (My guess this isn't going to be very useful, but just in case it might be..)
gdb -p <pid> bt full quit
Preferably install dovecot-dbg package also so the gdb backtrace output will be better.
On 7 Sep 2018, at 19.43, Timo Sirainen tss@iki.fi wrote:
On 7 Sep 2018, at 16.50, Simone Lazzaris
mailto:s.lazzaris@interactive.eu> wrote: Some more information: the issue has just occurred, again on an instance without the "service_count = 0" configuration directive on pop3-login.
I've observed that while the issue is occurring, the director process goes 100% CPU. I've straced the process. It is seemingly looping:
... ... epoll_ctl(13, EPOLL_CTL_ADD, 78, {EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP, {u32=149035320, u64=149035320}}) = 0 epoll_ctl(13, EPOLL_CTL_DEL, 78, {0, {u32=149035320, u64=149035320}}) = 0 epoll_ctl(13, EPOLL_CTL_ADD, 78, {EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP, {u32=149035320, u64=149035320}}) = 0 epoll_ctl(13, EPOLL_CTL_DEL, 78, {0, {u32=149035320, u64=149035320}}) = 0 epoll_ctl(13, EPOLL_CTL_ADD, 78, {EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP, {u32=149035320, u64=149035320}}) = 0 epoll_ctl(13, EPOLL_CTL_DEL, 78, {0, {u32=149035320, u64=149035320}}) = 0
Nothing else but these epoll_ctl() calls? So it's gone to some loop where it keeps calling io_add() and io_remove().
I'm guessing it's because of doveadm command handling issues, since there's some weirdness in the code. Although I couldn't figure out exactly why it would go to infinite loop there. But attached a patch that may fix it, if you're able to test. We haven't noticed such infinite looping in other installations or automated director stresstests though..
FD 13 is "anon_inode:[eventpoll]"
What about fd 78? I guess some socket.
Could you also try two more things when it happens again:
ltrace -tt -e '*' -o ltrace.log -p <pid> (My guess this isn't going to be very useful, but just in case it might be..)
gdb -p <pid> bt full quit
Preferably install dovecot-dbg package also so the gdb backtrace output will be better.
These would still be useful to verify whether I'm even on the right track.
In data venerdì 7 settembre 2018 19:11:24 CEST, Timo Sirainen ha scritto:
On 7 Sep 2018, at 19.43, Timo Sirainen tss@iki.fi wrote:
On 7 Sep 2018, at 16.50, Simone Lazzaris
mailto:s.lazzaris@interactive.eu> wrote: Some more information: the issue has just occurred, again on an instance without the "service_count = 0" configuration directive on pop3-login.
I've observed that while the issue is occurring, the director process goes 100% CPU. I've straced the process. It is seemingly looping:
... ... epoll_ctl(13, EPOLL_CTL_ADD, 78, {EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP, {u32=149035320, u64=149035320}}) = 0 epoll_ctl(13, EPOLL_CTL_DEL, 78, {0, {u32=149035320, u64=149035320}}) = 0 epoll_ctl(13, EPOLL_CTL_ADD, 78, {EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP, {u32=149035320, u64=149035320}}) = 0 epoll_ctl(13, EPOLL_CTL_DEL, 78, {0, {u32=149035320, u64=149035320}}) = 0 epoll_ctl(13, EPOLL_CTL_ADD, 78, {EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP, {u32=149035320, u64=149035320}}) = 0 epoll_ctl(13, EPOLL_CTL_DEL, 78, {0, {u32=149035320, u64=149035320}}) = 0> Nothing else but these epoll_ctl() calls? So it's gone to some loop where it keeps calling io_add() and io_remove(). I'm guessing it's because of doveadm command handling issues, since there's some weirdness in the code. Although I couldn't figure out exactly why it would go to infinite loop there. But attached a patch that may fix it, if you're able to test. We haven't noticed such infinite looping in other installations or automated director stresstests though.. FD 13 is "anon_inode:[eventpoll]"
What about fd 78? I guess some socket.
Could you also try two more things when it happens again:
ltrace -tt -e '*' -o ltrace.log -p <pid> (My guess this isn't going to be very useful, but just in case it might be..)
gdb -p <pid> bt full quit
Preferably install dovecot-dbg package also so the gdb backtrace output will be better. These would still be useful to verify whether I'm even on the right track.
Timo, unfortunately the patch doesn't compile; I've moved the declaration of "conn" one line up to make it work.
In other words, the function becomes like this: void doveadm_connections_ring_synced(void) { struct doveadm_connection *conn = doveadm_ring_sync_pending_connections; while (doveadm_ring_sync_pending_connections != NULL && conn->dir->ring_synced) { doveadm_connection_ring_sync_callback_t *callback = conn->ring_sync_callback;
conn->ring_sync_callback = NULL;
doveadm_connection_ring_sync_list_move(conn);
doveadm_connection_set_io(conn);
io_set_pending(conn->io);
callback(conn);
}
if (conn->dir->ring_synced)
doveadm_connections_continue_reset_cmds();
}
I've tried the patched version but without luck: the director keeps crashing:
Sep 8 14:10:56 imapfront2 dovecot: director: Connecting to
212.183.164.163:9090 (as 212.183.164.162): Initial connection
Sep 8 14:10:56 imapfront2 dovecot: director: Incoming connection from
director 212.183.164.161/in
Sep 8 14:10:56 imapfront2 dovecot: director: director(212.183.164.163:9090/
out): Handshake finished in 0.002 secs (bytes in=61, bytes out=473, 0+0 USERs
received, last input 0.000 s ago, last output 0.000 s ago, connected 0.002 s
ago, 207 peak output buffer size, 0.000 CPU secs since connected)
Sep 8 14:10:56 imapfront2 dovecot: director: director(212.183.164.161/in):
Host 192.168.1.142 vhost count changed from 100 to 100
Sep 8 14:10:56 imapfront2 dovecot: director: director(212.183.164.161/in):
Host 192.168.1.143 vhost count changed from 100 to 100
Sep 8 14:10:56 imapfront2 dovecot: director: director(212.183.164.161/in):
Host 192.168.1.144 vhost count changed from 100 to 100
Sep 8 14:10:56 imapfront2 dovecot: director: director(212.183.164.161/in):
Host 192.168.1.145 vhost count changed from 100 to 100
Sep 8 14:10:56 imapfront2 dovecot: director: director(212.183.164.161/in):
Host 192.168.1.216 vhost count changed from 100 to 100
Sep 8 14:10:56 imapfront2 dovecot: director: director(212.183.164.161/in):
Host 192.168.1.217 vhost count changed from 100 to 100
Sep 8 14:10:56 imapfront2 dovecot: director: director(212.183.164.161/in):
Host 192.168.1.218 vhost count changed from 100 to 100
Sep 8 14:10:56 imapfront2 dovecot: director: director(212.183.164.161/in):
Host 192.168.1.219 vhost count changed from 100 to 100
Sep 8 14:10:56 imapfront2 dovecot: director: director(212.183.164.161/in):
Handshake finished in 0.005 secs (bytes in=67951, bytes out=56, 1788+0 USERs
received, last input 0.000 s ago, last output 0.005 s ago, connected 0.005 s
ago, 0 peak output buffer size, 0.004 CPU secs since connected)
Sep 8 14:10:56 imapfront2 dovecot: pop3-login: Warning: Auth connection
closed with 1 pending requests (max 0 secs, pid=22703, EOF)
Sep 8 14:10:56 imapfront2 dovecot: auth: Warning: auth client 22703
disconnected with 1 pending requests: EOF
Sep 8 14:10:56 imapfront2 dovecot: director: Fatal: master:
service(director): child 23037 killed with signal 11 (core dumps disabled)
Sep 8 14:10:56 imapfront2 dovecot: pop3-login: Disconnected (auth process
communication failure): user=<>, method=DIGEST-MD5, rip=95.253.200.113,
lip=80.76.81.122, session=
I'm thinking to downgrade to version 2.2.30.2, which I use on another cluster without issues.
-- Simone Lazzaris Qcom SpA
In data venerdì 7 settembre 2018 18:43:36 CEST, Timo Sirainen ha scritto:
Hi Timo; it happened again, this time on a "high-performance" instance (e.g., WITH service_count=0)
On 7 Sep 2018, at 16.50, Simone Lazzaris s.lazzaris@interactive.eu wrote:
Some more information: the issue has just occurred, again on an instance without the "service_count = 0" configuration directive on pop3-login. Could you also try two more things when it happens again:
ltrace -tt -e '*' -o ltrace.log -p <pid> (My guess this isn't going to be very useful, but just in case it might be..)
Done; unfortunately, ltrace.log is empty.
gdb -p <pid> bt full quit
Here it is:
root@imapfront2:/usr/local/src/dovecot-2.2.36/src# gdb -p 31635
GNU gdb (GDB) 7.4.1-debian
Copyright (C) 2012 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "i486-linux-gnu".
For bug reporting instructions, please see:
http://www.gnu.org/software/gdb/bugs/.
Attaching to process 31635
Reading symbols from /usr/local/libexec/dovecot/director...done.
Reading symbols from /usr/local/lib/dovecot/libdovecot.so.0...done.
Loaded symbols for /usr/local/lib/dovecot/libdovecot.so.0
Reading symbols from /lib/i386-linux-gnu/i686/cmov/libc.so.6...(no debugging
symbols found)...done.
Loaded symbols for /lib/i386-linux-gnu/i686/cmov/libc.so.6
Reading symbols from /lib/i386-linux-gnu/i686/cmov/libdl.so.2...(no debugging
symbols found)...done.
Loaded symbols for /lib/i386-linux-gnu/i686/cmov/libdl.so.2
Reading symbols from /lib/i386-linux-gnu/i686/cmov/librt.so.1...(no debugging
symbols found)...done.
Loaded symbols for /lib/i386-linux-gnu/i686/cmov/librt.so.1
Reading symbols from /lib/ld-linux.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib/ld-linux.so.2
Reading symbols from /lib/i386-linux-gnu/i686/cmov/libpthread.so.0...(no
debugging symbols found)...done.
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/i386-linux-gnu/i686/cmov/
libthread_db.so.1".
Loaded symbols for /lib/i386-linux-gnu/i686/cmov/libpthread.so.0
0xb76fa428 in __kernel_vsyscall ()
(gdb) bt full
#0 0xb76fa428 in __kernel_vsyscall ()
No symbol table info available.
#1 0xb752929e in epoll_ctl () from /lib/i386-linux-gnu/i686/cmov/libc.so.6
No symbol table info available.
#2 0xb7678aa0 in io_loop_handle_add (io=io@entry=0x815db90) at ioloop-
epoll.c:106
ctx = 0x8144e98
list = 0x814556c
event = {events = 27, data = {ptr = 0x815c160, fd = 135643488, u32 =
135643488, u64 = 135643488}}
op = 1
first = <optimized out>
#3 0xb7676a7d in io_add_file (fd=fd@entry=27,
condition=condition@entry=IO_READ,
source_filename=source_filename@entry=0x805f7f4 "doveadm-connection.c",
source_linenum=source_linenum@entry=1070,
callback=callback@entry=0x8057e10
Inferior 1 [process 31635] will be detached.
Quit anyway? (y or n) y
Preferably install dovecot-dbg package also so the gdb backtrace output will be better.
Well, I've compiled dovecot from the source, so I don't have a package. I'll try your patch on a server or two, and see if it solve the issues for them.
-- Simone Lazzaris Qcom SpA
In data venerdì 7 settembre 2018 10:06:00 CEST, Sami Ketola ha scritto:
On 7 Sep 2018, at 11.00, Simone Lazzaris s.lazzaris@interactive.eu wrote:
The only suspect thing is this:
Sep 6 14:45:41 imap-front13 dovecot: director: doveadm: Host 192.168.1.142 vhost count changed from 100 to 0 Sep 6 14:45:41 imap-front13 dovecot: director: doveadm: Host 192.168.1.143 vhost count changed from 100 to 0 Sep 6 14:45:41 imap-front13 dovecot: director: doveadm: Host 192.168.1.219 vhost count changed from 100 to 0
Nothing on the other system logs (e.g. kernel, daemon, syslog, messages ....). Any idea what is changing the vhost count on the backends? Do you have some script running that possibly does change the vhost count for some triggering event?
Sami
Actually, I have a poolmon script running that should drop vhost count for unresponsive backends; the strage thing is, the backends are NOT unresponsive, they are working as ususal.
To be honest, the configuration that keep failing (again, right now) has imap- login not configured as I've told you. The configuration you saw was an attempt to overcame the problem, in which I've set pop3-login like this:
service pop3-login { executable = imap-login director service_count = 0 vsz_limit = 128 M }
But the machines that fails are like this:
service pop3-login { executable = pop3-login director }
I've got 4 VM configured in the ring. The one with service_count=0 has (so far) never failed, the other 3 has failed multiple time (every few hours).
-- *Simone Lazzaris* *Qcom S.p.A.*
participants (5)
-
Aki Tuomi
-
Sami Ketola
-
Simone Lazzaris
-
Simone Lazzaris
-
Timo Sirainen