In data venerdì 7 settembre 2018 19:11:24 CEST, Timo Sirainen ha scritto:
On 7 Sep 2018, at 19.43, Timo Sirainen tss@iki.fi wrote:
On 7 Sep 2018, at 16.50, Simone Lazzaris
mailto:s.lazzaris@interactive.eu> wrote: Some more information: the issue has just occurred, again on an instance without the "service_count = 0" configuration directive on pop3-login.
I've observed that while the issue is occurring, the director process goes 100% CPU. I've straced the process. It is seemingly looping:
... ... epoll_ctl(13, EPOLL_CTL_ADD, 78, {EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP, {u32=149035320, u64=149035320}}) = 0 epoll_ctl(13, EPOLL_CTL_DEL, 78, {0, {u32=149035320, u64=149035320}}) = 0 epoll_ctl(13, EPOLL_CTL_ADD, 78, {EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP, {u32=149035320, u64=149035320}}) = 0 epoll_ctl(13, EPOLL_CTL_DEL, 78, {0, {u32=149035320, u64=149035320}}) = 0 epoll_ctl(13, EPOLL_CTL_ADD, 78, {EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP, {u32=149035320, u64=149035320}}) = 0 epoll_ctl(13, EPOLL_CTL_DEL, 78, {0, {u32=149035320, u64=149035320}}) = 0> Nothing else but these epoll_ctl() calls? So it's gone to some loop where it keeps calling io_add() and io_remove(). I'm guessing it's because of doveadm command handling issues, since there's some weirdness in the code. Although I couldn't figure out exactly why it would go to infinite loop there. But attached a patch that may fix it, if you're able to test. We haven't noticed such infinite looping in other installations or automated director stresstests though.. FD 13 is "anon_inode:[eventpoll]"
What about fd 78? I guess some socket.
Could you also try two more things when it happens again:
ltrace -tt -e '*' -o ltrace.log -p <pid> (My guess this isn't going to be very useful, but just in case it might be..)
gdb -p <pid> bt full quit
Preferably install dovecot-dbg package also so the gdb backtrace output will be better. These would still be useful to verify whether I'm even on the right track.
Timo, unfortunately the patch doesn't compile; I've moved the declaration of "conn" one line up to make it work.
In other words, the function becomes like this: void doveadm_connections_ring_synced(void) { struct doveadm_connection *conn = doveadm_ring_sync_pending_connections; while (doveadm_ring_sync_pending_connections != NULL && conn->dir->ring_synced) { doveadm_connection_ring_sync_callback_t *callback = conn->ring_sync_callback;
conn->ring_sync_callback = NULL;
doveadm_connection_ring_sync_list_move(conn);
doveadm_connection_set_io(conn);
io_set_pending(conn->io);
callback(conn);
}
if (conn->dir->ring_synced)
doveadm_connections_continue_reset_cmds();
}
I've tried the patched version but without luck: the director keeps crashing:
Sep 8 14:10:56 imapfront2 dovecot: director: Connecting to
212.183.164.163:9090 (as 212.183.164.162): Initial connection
Sep 8 14:10:56 imapfront2 dovecot: director: Incoming connection from
director 212.183.164.161/in
Sep 8 14:10:56 imapfront2 dovecot: director: director(212.183.164.163:9090/
out): Handshake finished in 0.002 secs (bytes in=61, bytes out=473, 0+0 USERs
received, last input 0.000 s ago, last output 0.000 s ago, connected 0.002 s
ago, 207 peak output buffer size, 0.000 CPU secs since connected)
Sep 8 14:10:56 imapfront2 dovecot: director: director(212.183.164.161/in):
Host 192.168.1.142 vhost count changed from 100 to 100
Sep 8 14:10:56 imapfront2 dovecot: director: director(212.183.164.161/in):
Host 192.168.1.143 vhost count changed from 100 to 100
Sep 8 14:10:56 imapfront2 dovecot: director: director(212.183.164.161/in):
Host 192.168.1.144 vhost count changed from 100 to 100
Sep 8 14:10:56 imapfront2 dovecot: director: director(212.183.164.161/in):
Host 192.168.1.145 vhost count changed from 100 to 100
Sep 8 14:10:56 imapfront2 dovecot: director: director(212.183.164.161/in):
Host 192.168.1.216 vhost count changed from 100 to 100
Sep 8 14:10:56 imapfront2 dovecot: director: director(212.183.164.161/in):
Host 192.168.1.217 vhost count changed from 100 to 100
Sep 8 14:10:56 imapfront2 dovecot: director: director(212.183.164.161/in):
Host 192.168.1.218 vhost count changed from 100 to 100
Sep 8 14:10:56 imapfront2 dovecot: director: director(212.183.164.161/in):
Host 192.168.1.219 vhost count changed from 100 to 100
Sep 8 14:10:56 imapfront2 dovecot: director: director(212.183.164.161/in):
Handshake finished in 0.005 secs (bytes in=67951, bytes out=56, 1788+0 USERs
received, last input 0.000 s ago, last output 0.005 s ago, connected 0.005 s
ago, 0 peak output buffer size, 0.004 CPU secs since connected)
Sep 8 14:10:56 imapfront2 dovecot: pop3-login: Warning: Auth connection
closed with 1 pending requests (max 0 secs, pid=22703, EOF)
Sep 8 14:10:56 imapfront2 dovecot: auth: Warning: auth client 22703
disconnected with 1 pending requests: EOF
Sep 8 14:10:56 imapfront2 dovecot: director: Fatal: master:
service(director): child 23037 killed with signal 11 (core dumps disabled)
Sep 8 14:10:56 imapfront2 dovecot: pop3-login: Disconnected (auth process
communication failure): user=<>, method=DIGEST-MD5, rip=95.253.200.113,
lip=80.76.81.122, session=
I'm thinking to downgrade to version 2.2.30.2, which I use on another cluster without issues.
-- Simone Lazzaris Qcom SpA