director stuck in inifite loop on 2.2.35

Aki Tuomi aki.tuomi at dovecot.fi
Thu Apr 26 11:35:59 EEST 2018



On 26.04.2018 11:15, Jan-Pieter Cornet wrote:
> Occasionally, all of our directors get stuck in an endless loop.
> Likely triggered by some network event, but I haven't been able to
> isolate exactly what triggers it, and am unable to reproduce at the
> moment. When this happens, it makes the dovecot/director process on
> all directors unresponsive, making login impossible.
>
> Restarting dovecot "fixes" it.
>
> Here's the info collected from the stuck process. I hope that a
> dovecot dev could use this to either fix a bug in the async IO code or
> point to a possible external source that could have triggered this.
> Would it help to compile dovecot with -g and run a version with
> debugging info, to be able to inspect more when the crash happens again?
>
> "strace" showed this endless loop:
>
> 09:27:52.837960 epoll_ctl(14, EPOLL_CTL_DEL, 213, 0x7ffe8e642cdc) = 0
> 09:27:52.837993 epoll_ctl(14, EPOLL_CTL_ADD, 213,
> {EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP, {u32=3317538000,
> u64=94320799358160}}) = 0
> 09:27:52.838035 epoll_ctl(14, EPOLL_CTL_DEL, 213, 0x7ffe8e642cdc) = 0
> 09:27:52.838070 epoll_ctl(14, EPOLL_CTL_ADD, 213,
> {EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP, {u32=3317538000,
> u64=94320799358160}}) = 0
>
> "ltrace" showed the same loop as:
>
> 10:38:01.773867 timeout_add(0x7530, 0x55c8c351df75, 965,
> 0x55c8c3515a00) = 0x55c8c5c658b0
> 10:38:01.774060 timeout_remove(0x55c8c5acf7a8, 0, 0xffffffff, 0) = 0
> 10:38:01.774239 io_add(213, 1, 0x55c8c351df75, 1070)         =
> 0x55c8c5c658b0
> 10:38:01.774429 io_set_pending(0x55c8c5c658b0, 1, 0x55c8c54b8aa0,
> 0x7fb4ab11b06a) = 0x55c8c54b8aa0
> 10:38:01.774598 p_strarray_dup(0x7fb4ab703660, 0x55c8c5786e00, 5,
> 0x55c8c3723740) = 0x55c8c56d7530
> 10:38:01.774771 io_remove(0x55c8c5acf788, 0x55c8c3516d60, 5, 48) = 0
> 10:38:01.774957 timeout_add(0x7530, 0x55c8c351df75, 965,
> 0x55c8c3515a00) = 0x55c8c5c658b0
> 10:38:01.775129 timeout_remove(0x55c8c5acf7a8, 0, 0xffffffff, 0) = 0
> 10:38:01.775301 io_add(213, 1, 0x55c8c351df75, 1070)         =
> 0x55c8c5c658b0
> 10:38:01.775492 io_set_pending(0x55c8c5c658b0, 1, 0x55c8c54b8aa0,
> 0x7fb4ab11b06a) = 0x55c8c54b8aa0
> 10:38:01.775662 p_strarray_dup(0x7fb4ab703660, 0x55c8c56d7530, 5,
> 0x55c8c3723740) = 0x55c8c5786e00
> 10:38:01.775833 io_remove(0x55c8c5acf788, 0x55c8c3516d60, 5, 48) = 0
>
> That fd 213 is the next-highest FD opened, and is bound to
> /var/run/dovecot/director-admin. In fact the last few FDs are all
> connections to director-admin. Last few lines of "lsof" say:
>
> director 16109 dovecot  207u     unix 0xffff940d13c93800      0t0
> 1106341075 type=STREAM
> director 16109 dovecot  208u     unix 0xffff9409f5280000      0t0
> 1106337217 /var/run/dovecot/director-userdb type=STREAM
> director 16109 dovecot  209u     unix 0xffff940d580fa000      0t0
> 1106337218 type=STREAM
> director 16109 dovecot  210u     unix 0xffff940ac0f61000      0t0
> 1106343439 /var/run/dovecot/director-userdb type=STREAM
> director 16109 dovecot  211u     unix 0xffff940bed105400      0t0
> 1106343440 type=STREAM
> director 16109 dovecot  212u     unix 0xffff940b76d31c00      0t0
> 1106342408 /var/run/dovecot/director-admin type=STREAM
> director 16109 dovecot  213u     unix 0xffff940ab241d000      0t0
> 1106340510 /var/run/dovecot/director-admin type=STREAM
> director 16109 dovecot  214u     unix 0xffff9409c317ec00      0t0
> 1106340511 /var/run/dovecot/director-admin type=STREAM
>
> I attached with "gdb" and did "bt" at two different moments, resulting
> in:
>
> (gdb) bt
> #0  0x00007fb4ab11b06a in epoll_ctl () from
> target:/lib/x86_64-linux-gnu/libc.so.6
> #1  0x00007fb4ab48964d in io_loop_handle_add () from
> target:/usr/lib/dovecot/libdovecot.so.0
> #2  0x00007fb4ab4876b9 in ?? () from
> target:/usr/lib/dovecot/libdovecot.so.0
> #3  0x00007fb4ab48776d in io_add () from
> target:/usr/lib/dovecot/libdovecot.so.0
> #4  0x000055c8c35179d3 in doveadm_connections_ring_synced ()
> #5  0x00007fb4ab488166 in io_loop_handle_timeouts () from
> target:/usr/lib/dovecot/libdovecot.so.0
> #6  0x00007fb4ab489963 in io_loop_handler_run_internal ()
>    from target:/usr/lib/dovecot/libdovecot.so.0
> #7  0x00007fb4ab48839c in io_loop_handler_run () from
> target:/usr/lib/dovecot/libdovecot.so.0
> #8  0x00007fb4ab488548 in io_loop_run () from
> target:/usr/lib/dovecot/libdovecot.so.0
> #9  0x00007fb4ab40c0a3 in master_service_run () from
> target:/usr/lib/dovecot/libdovecot.so.0
> #10 0x000055c8c350b46d in main ()
> (gdb) c
> Continuing.
> ^C
> Program received signal SIGINT, Interrupt.
> 0x00007fb4ab11b06a in epoll_ctl () from
> target:/lib/x86_64-linux-gnu/libc.so.6
> (gdb) bt
> #0  0x00007fb4ab11b06a in epoll_ctl () from
> target:/lib/x86_64-linux-gnu/libc.so.6
> #1  0x00007fb4ab489873 in io_loop_handle_remove () from
> target:/usr/lib/dovecot/libdovecot.so.0
> #2  0x00007fb4ab487adf in ?? () from
> target:/usr/lib/dovecot/libdovecot.so.0
> #3  0x000055c8c35158b7 in ?? ()
> #4  0x000055c8c3516b4a in ?? ()
> #5  0x000055c8c3516d78 in ?? ()
> #6  0x000055c8c35179e4 in doveadm_connections_ring_synced ()
> #7  0x00007fb4ab488166 in io_loop_handle_timeouts () from
> target:/usr/lib/dovecot/libdovecot.so.0
> #8  0x00007fb4ab489963 in io_loop_handler_run_internal ()
>    from target:/usr/lib/dovecot/libdovecot.so.0
> #9  0x00007fb4ab48839c in io_loop_handler_run () from
> target:/usr/lib/dovecot/libdovecot.so.0
> #10 0x00007fb4ab488548 in io_loop_run () from
> target:/usr/lib/dovecot/libdovecot.so.0
> #11 0x00007fb4ab40c0a3 in master_service_run () from
> target:/usr/lib/dovecot/libdovecot.so.0
> #12 0x000055c8c350b46d in main ()
>
> For completeness, doveconf -n output attached.
>

Hi!

Thank you for reporting this, we'll look into it.

Aki



More information about the dovecot mailing list