director stuck in inifite loop on 2.2.35
Aki Tuomi
aki.tuomi at dovecot.fi
Thu Apr 26 11:35:59 EEST 2018
On 26.04.2018 11:15, Jan-Pieter Cornet wrote:
> Occasionally, all of our directors get stuck in an endless loop.
> Likely triggered by some network event, but I haven't been able to
> isolate exactly what triggers it, and am unable to reproduce at the
> moment. When this happens, it makes the dovecot/director process on
> all directors unresponsive, making login impossible.
>
> Restarting dovecot "fixes" it.
>
> Here's the info collected from the stuck process. I hope that a
> dovecot dev could use this to either fix a bug in the async IO code or
> point to a possible external source that could have triggered this.
> Would it help to compile dovecot with -g and run a version with
> debugging info, to be able to inspect more when the crash happens again?
>
> "strace" showed this endless loop:
>
> 09:27:52.837960 epoll_ctl(14, EPOLL_CTL_DEL, 213, 0x7ffe8e642cdc) = 0
> 09:27:52.837993 epoll_ctl(14, EPOLL_CTL_ADD, 213,
> {EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP, {u32=3317538000,
> u64=94320799358160}}) = 0
> 09:27:52.838035 epoll_ctl(14, EPOLL_CTL_DEL, 213, 0x7ffe8e642cdc) = 0
> 09:27:52.838070 epoll_ctl(14, EPOLL_CTL_ADD, 213,
> {EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP, {u32=3317538000,
> u64=94320799358160}}) = 0
>
> "ltrace" showed the same loop as:
>
> 10:38:01.773867 timeout_add(0x7530, 0x55c8c351df75, 965,
> 0x55c8c3515a00) = 0x55c8c5c658b0
> 10:38:01.774060 timeout_remove(0x55c8c5acf7a8, 0, 0xffffffff, 0) = 0
> 10:38:01.774239 io_add(213, 1, 0x55c8c351df75, 1070) =
> 0x55c8c5c658b0
> 10:38:01.774429 io_set_pending(0x55c8c5c658b0, 1, 0x55c8c54b8aa0,
> 0x7fb4ab11b06a) = 0x55c8c54b8aa0
> 10:38:01.774598 p_strarray_dup(0x7fb4ab703660, 0x55c8c5786e00, 5,
> 0x55c8c3723740) = 0x55c8c56d7530
> 10:38:01.774771 io_remove(0x55c8c5acf788, 0x55c8c3516d60, 5, 48) = 0
> 10:38:01.774957 timeout_add(0x7530, 0x55c8c351df75, 965,
> 0x55c8c3515a00) = 0x55c8c5c658b0
> 10:38:01.775129 timeout_remove(0x55c8c5acf7a8, 0, 0xffffffff, 0) = 0
> 10:38:01.775301 io_add(213, 1, 0x55c8c351df75, 1070) =
> 0x55c8c5c658b0
> 10:38:01.775492 io_set_pending(0x55c8c5c658b0, 1, 0x55c8c54b8aa0,
> 0x7fb4ab11b06a) = 0x55c8c54b8aa0
> 10:38:01.775662 p_strarray_dup(0x7fb4ab703660, 0x55c8c56d7530, 5,
> 0x55c8c3723740) = 0x55c8c5786e00
> 10:38:01.775833 io_remove(0x55c8c5acf788, 0x55c8c3516d60, 5, 48) = 0
>
> That fd 213 is the next-highest FD opened, and is bound to
> /var/run/dovecot/director-admin. In fact the last few FDs are all
> connections to director-admin. Last few lines of "lsof" say:
>
> director 16109 dovecot 207u unix 0xffff940d13c93800 0t0
> 1106341075 type=STREAM
> director 16109 dovecot 208u unix 0xffff9409f5280000 0t0
> 1106337217 /var/run/dovecot/director-userdb type=STREAM
> director 16109 dovecot 209u unix 0xffff940d580fa000 0t0
> 1106337218 type=STREAM
> director 16109 dovecot 210u unix 0xffff940ac0f61000 0t0
> 1106343439 /var/run/dovecot/director-userdb type=STREAM
> director 16109 dovecot 211u unix 0xffff940bed105400 0t0
> 1106343440 type=STREAM
> director 16109 dovecot 212u unix 0xffff940b76d31c00 0t0
> 1106342408 /var/run/dovecot/director-admin type=STREAM
> director 16109 dovecot 213u unix 0xffff940ab241d000 0t0
> 1106340510 /var/run/dovecot/director-admin type=STREAM
> director 16109 dovecot 214u unix 0xffff9409c317ec00 0t0
> 1106340511 /var/run/dovecot/director-admin type=STREAM
>
> I attached with "gdb" and did "bt" at two different moments, resulting
> in:
>
> (gdb) bt
> #0 0x00007fb4ab11b06a in epoll_ctl () from
> target:/lib/x86_64-linux-gnu/libc.so.6
> #1 0x00007fb4ab48964d in io_loop_handle_add () from
> target:/usr/lib/dovecot/libdovecot.so.0
> #2 0x00007fb4ab4876b9 in ?? () from
> target:/usr/lib/dovecot/libdovecot.so.0
> #3 0x00007fb4ab48776d in io_add () from
> target:/usr/lib/dovecot/libdovecot.so.0
> #4 0x000055c8c35179d3 in doveadm_connections_ring_synced ()
> #5 0x00007fb4ab488166 in io_loop_handle_timeouts () from
> target:/usr/lib/dovecot/libdovecot.so.0
> #6 0x00007fb4ab489963 in io_loop_handler_run_internal ()
> from target:/usr/lib/dovecot/libdovecot.so.0
> #7 0x00007fb4ab48839c in io_loop_handler_run () from
> target:/usr/lib/dovecot/libdovecot.so.0
> #8 0x00007fb4ab488548 in io_loop_run () from
> target:/usr/lib/dovecot/libdovecot.so.0
> #9 0x00007fb4ab40c0a3 in master_service_run () from
> target:/usr/lib/dovecot/libdovecot.so.0
> #10 0x000055c8c350b46d in main ()
> (gdb) c
> Continuing.
> ^C
> Program received signal SIGINT, Interrupt.
> 0x00007fb4ab11b06a in epoll_ctl () from
> target:/lib/x86_64-linux-gnu/libc.so.6
> (gdb) bt
> #0 0x00007fb4ab11b06a in epoll_ctl () from
> target:/lib/x86_64-linux-gnu/libc.so.6
> #1 0x00007fb4ab489873 in io_loop_handle_remove () from
> target:/usr/lib/dovecot/libdovecot.so.0
> #2 0x00007fb4ab487adf in ?? () from
> target:/usr/lib/dovecot/libdovecot.so.0
> #3 0x000055c8c35158b7 in ?? ()
> #4 0x000055c8c3516b4a in ?? ()
> #5 0x000055c8c3516d78 in ?? ()
> #6 0x000055c8c35179e4 in doveadm_connections_ring_synced ()
> #7 0x00007fb4ab488166 in io_loop_handle_timeouts () from
> target:/usr/lib/dovecot/libdovecot.so.0
> #8 0x00007fb4ab489963 in io_loop_handler_run_internal ()
> from target:/usr/lib/dovecot/libdovecot.so.0
> #9 0x00007fb4ab48839c in io_loop_handler_run () from
> target:/usr/lib/dovecot/libdovecot.so.0
> #10 0x00007fb4ab488548 in io_loop_run () from
> target:/usr/lib/dovecot/libdovecot.so.0
> #11 0x00007fb4ab40c0a3 in master_service_run () from
> target:/usr/lib/dovecot/libdovecot.so.0
> #12 0x000055c8c350b46d in main ()
>
> For completeness, doveconf -n output attached.
>
Hi!
Thank you for reporting this, we'll look into it.
Aki
More information about the dovecot
mailing list