On 26.04.2018 11:15, Jan-Pieter Cornet wrote:
Occasionally, all of our directors get stuck in an endless loop. Likely triggered by some network event, but I haven't been able to isolate exactly what triggers it, and am unable to reproduce at the moment. When this happens, it makes the dovecot/director process on all directors unresponsive, making login impossible.
Restarting dovecot "fixes" it.
Here's the info collected from the stuck process. I hope that a dovecot dev could use this to either fix a bug in the async IO code or point to a possible external source that could have triggered this. Would it help to compile dovecot with -g and run a version with debugging info, to be able to inspect more when the crash happens again?
"strace" showed this endless loop:
09:27:52.837960 epoll_ctl(14, EPOLL_CTL_DEL, 213, 0x7ffe8e642cdc) = 0 09:27:52.837993 epoll_ctl(14, EPOLL_CTL_ADD, 213, {EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP, {u32=3317538000, u64=94320799358160}}) = 0 09:27:52.838035 epoll_ctl(14, EPOLL_CTL_DEL, 213, 0x7ffe8e642cdc) = 0 09:27:52.838070 epoll_ctl(14, EPOLL_CTL_ADD, 213, {EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP, {u32=3317538000, u64=94320799358160}}) = 0
"ltrace" showed the same loop as:
10:38:01.773867 timeout_add(0x7530, 0x55c8c351df75, 965, 0x55c8c3515a00) = 0x55c8c5c658b0 10:38:01.774060 timeout_remove(0x55c8c5acf7a8, 0, 0xffffffff, 0) = 0 10:38:01.774239 io_add(213, 1, 0x55c8c351df75, 1070) = 0x55c8c5c658b0 10:38:01.774429 io_set_pending(0x55c8c5c658b0, 1, 0x55c8c54b8aa0, 0x7fb4ab11b06a) = 0x55c8c54b8aa0 10:38:01.774598 p_strarray_dup(0x7fb4ab703660, 0x55c8c5786e00, 5, 0x55c8c3723740) = 0x55c8c56d7530 10:38:01.774771 io_remove(0x55c8c5acf788, 0x55c8c3516d60, 5, 48) = 0 10:38:01.774957 timeout_add(0x7530, 0x55c8c351df75, 965, 0x55c8c3515a00) = 0x55c8c5c658b0 10:38:01.775129 timeout_remove(0x55c8c5acf7a8, 0, 0xffffffff, 0) = 0 10:38:01.775301 io_add(213, 1, 0x55c8c351df75, 1070) = 0x55c8c5c658b0 10:38:01.775492 io_set_pending(0x55c8c5c658b0, 1, 0x55c8c54b8aa0, 0x7fb4ab11b06a) = 0x55c8c54b8aa0 10:38:01.775662 p_strarray_dup(0x7fb4ab703660, 0x55c8c56d7530, 5, 0x55c8c3723740) = 0x55c8c5786e00 10:38:01.775833 io_remove(0x55c8c5acf788, 0x55c8c3516d60, 5, 48) = 0
That fd 213 is the next-highest FD opened, and is bound to /var/run/dovecot/director-admin. In fact the last few FDs are all connections to director-admin. Last few lines of "lsof" say:
director 16109 dovecot 207u unix 0xffff940d13c93800 0t0 1106341075 type=STREAM director 16109 dovecot 208u unix 0xffff9409f5280000 0t0 1106337217 /var/run/dovecot/director-userdb type=STREAM director 16109 dovecot 209u unix 0xffff940d580fa000 0t0 1106337218 type=STREAM director 16109 dovecot 210u unix 0xffff940ac0f61000 0t0 1106343439 /var/run/dovecot/director-userdb type=STREAM director 16109 dovecot 211u unix 0xffff940bed105400 0t0 1106343440 type=STREAM director 16109 dovecot 212u unix 0xffff940b76d31c00 0t0 1106342408 /var/run/dovecot/director-admin type=STREAM director 16109 dovecot 213u unix 0xffff940ab241d000 0t0 1106340510 /var/run/dovecot/director-admin type=STREAM director 16109 dovecot 214u unix 0xffff9409c317ec00 0t0 1106340511 /var/run/dovecot/director-admin type=STREAM
I attached with "gdb" and did "bt" at two different moments, resulting in:
(gdb) bt #0 0x00007fb4ab11b06a in epoll_ctl () from target:/lib/x86_64-linux-gnu/libc.so.6 #1 0x00007fb4ab48964d in io_loop_handle_add () from target:/usr/lib/dovecot/libdovecot.so.0 #2 0x00007fb4ab4876b9 in ?? () from target:/usr/lib/dovecot/libdovecot.so.0 #3 0x00007fb4ab48776d in io_add () from target:/usr/lib/dovecot/libdovecot.so.0 #4 0x000055c8c35179d3 in doveadm_connections_ring_synced () #5 0x00007fb4ab488166 in io_loop_handle_timeouts () from target:/usr/lib/dovecot/libdovecot.so.0 #6 0x00007fb4ab489963 in io_loop_handler_run_internal () from target:/usr/lib/dovecot/libdovecot.so.0 #7 0x00007fb4ab48839c in io_loop_handler_run () from target:/usr/lib/dovecot/libdovecot.so.0 #8 0x00007fb4ab488548 in io_loop_run () from target:/usr/lib/dovecot/libdovecot.so.0 #9 0x00007fb4ab40c0a3 in master_service_run () from target:/usr/lib/dovecot/libdovecot.so.0 #10 0x000055c8c350b46d in main () (gdb) c Continuing. ^C Program received signal SIGINT, Interrupt. 0x00007fb4ab11b06a in epoll_ctl () from target:/lib/x86_64-linux-gnu/libc.so.6 (gdb) bt #0 0x00007fb4ab11b06a in epoll_ctl () from target:/lib/x86_64-linux-gnu/libc.so.6 #1 0x00007fb4ab489873 in io_loop_handle_remove () from target:/usr/lib/dovecot/libdovecot.so.0 #2 0x00007fb4ab487adf in ?? () from target:/usr/lib/dovecot/libdovecot.so.0 #3 0x000055c8c35158b7 in ?? () #4 0x000055c8c3516b4a in ?? () #5 0x000055c8c3516d78 in ?? () #6 0x000055c8c35179e4 in doveadm_connections_ring_synced () #7 0x00007fb4ab488166 in io_loop_handle_timeouts () from target:/usr/lib/dovecot/libdovecot.so.0 #8 0x00007fb4ab489963 in io_loop_handler_run_internal () from target:/usr/lib/dovecot/libdovecot.so.0 #9 0x00007fb4ab48839c in io_loop_handler_run () from target:/usr/lib/dovecot/libdovecot.so.0 #10 0x00007fb4ab488548 in io_loop_run () from target:/usr/lib/dovecot/libdovecot.so.0 #11 0x00007fb4ab40c0a3 in master_service_run () from target:/usr/lib/dovecot/libdovecot.so.0 #12 0x000055c8c350b46d in main ()
For completeness, doveconf -n output attached.
Hi!
Thank you for reporting this, we'll look into it.
Aki