director stuck in inifite loop on 2.2.35

Jan-Pieter Cornet johnpc at xs4all.net
Thu Apr 26 11:15:30 EEST 2018


Occasionally, all of our directors get stuck in an endless loop. Likely triggered by some network event, but I haven't been able to isolate exactly what triggers it, and am unable to reproduce at the moment. When this happens, it makes the dovecot/director process on all directors unresponsive, making login impossible.

Restarting dovecot "fixes" it.

Here's the info collected from the stuck process. I hope that a dovecot dev could use this to either fix a bug in the async IO code or point to a possible external source that could have triggered this. Would it help to compile dovecot with -g and run a version with debugging info, to be able to inspect more when the crash happens again?

"strace" showed this endless loop:

09:27:52.837960 epoll_ctl(14, EPOLL_CTL_DEL, 213, 0x7ffe8e642cdc) = 0
09:27:52.837993 epoll_ctl(14, EPOLL_CTL_ADD, 213, {EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP, {u32=3317538000, u64=94320799358160}}) = 0
09:27:52.838035 epoll_ctl(14, EPOLL_CTL_DEL, 213, 0x7ffe8e642cdc) = 0
09:27:52.838070 epoll_ctl(14, EPOLL_CTL_ADD, 213, {EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP, {u32=3317538000, u64=94320799358160}}) = 0

"ltrace" showed the same loop as:

10:38:01.773867 timeout_add(0x7530, 0x55c8c351df75, 965, 0x55c8c3515a00) = 0x55c8c5c658b0
10:38:01.774060 timeout_remove(0x55c8c5acf7a8, 0, 0xffffffff, 0) = 0
10:38:01.774239 io_add(213, 1, 0x55c8c351df75, 1070)         = 0x55c8c5c658b0
10:38:01.774429 io_set_pending(0x55c8c5c658b0, 1, 0x55c8c54b8aa0, 0x7fb4ab11b06a) = 0x55c8c54b8aa0
10:38:01.774598 p_strarray_dup(0x7fb4ab703660, 0x55c8c5786e00, 5, 0x55c8c3723740) = 0x55c8c56d7530
10:38:01.774771 io_remove(0x55c8c5acf788, 0x55c8c3516d60, 5, 48) = 0
10:38:01.774957 timeout_add(0x7530, 0x55c8c351df75, 965, 0x55c8c3515a00) = 0x55c8c5c658b0
10:38:01.775129 timeout_remove(0x55c8c5acf7a8, 0, 0xffffffff, 0) = 0
10:38:01.775301 io_add(213, 1, 0x55c8c351df75, 1070)         = 0x55c8c5c658b0
10:38:01.775492 io_set_pending(0x55c8c5c658b0, 1, 0x55c8c54b8aa0, 0x7fb4ab11b06a) = 0x55c8c54b8aa0
10:38:01.775662 p_strarray_dup(0x7fb4ab703660, 0x55c8c56d7530, 5, 0x55c8c3723740) = 0x55c8c5786e00
10:38:01.775833 io_remove(0x55c8c5acf788, 0x55c8c3516d60, 5, 48) = 0

That fd 213 is the next-highest FD opened, and is bound to /var/run/dovecot/director-admin. In fact the last few FDs are all connections to director-admin. Last few lines of "lsof" say:

director 16109 dovecot  207u     unix 0xffff940d13c93800      0t0 1106341075 type=STREAM
director 16109 dovecot  208u     unix 0xffff9409f5280000      0t0 1106337217 /var/run/dovecot/director-userdb type=STREAM
director 16109 dovecot  209u     unix 0xffff940d580fa000      0t0 1106337218 type=STREAM
director 16109 dovecot  210u     unix 0xffff940ac0f61000      0t0 1106343439 /var/run/dovecot/director-userdb type=STREAM
director 16109 dovecot  211u     unix 0xffff940bed105400      0t0 1106343440 type=STREAM
director 16109 dovecot  212u     unix 0xffff940b76d31c00      0t0 1106342408 /var/run/dovecot/director-admin type=STREAM
director 16109 dovecot  213u     unix 0xffff940ab241d000      0t0 1106340510 /var/run/dovecot/director-admin type=STREAM
director 16109 dovecot  214u     unix 0xffff9409c317ec00      0t0 1106340511 /var/run/dovecot/director-admin type=STREAM

I attached with "gdb" and did "bt" at two different moments, resulting in:

(gdb) bt
#0  0x00007fb4ab11b06a in epoll_ctl () from target:/lib/x86_64-linux-gnu/libc.so.6
#1  0x00007fb4ab48964d in io_loop_handle_add () from target:/usr/lib/dovecot/libdovecot.so.0
#2  0x00007fb4ab4876b9 in ?? () from target:/usr/lib/dovecot/libdovecot.so.0
#3  0x00007fb4ab48776d in io_add () from target:/usr/lib/dovecot/libdovecot.so.0
#4  0x000055c8c35179d3 in doveadm_connections_ring_synced ()
#5  0x00007fb4ab488166 in io_loop_handle_timeouts () from target:/usr/lib/dovecot/libdovecot.so.0
#6  0x00007fb4ab489963 in io_loop_handler_run_internal ()
    from target:/usr/lib/dovecot/libdovecot.so.0
#7  0x00007fb4ab48839c in io_loop_handler_run () from target:/usr/lib/dovecot/libdovecot.so.0
#8  0x00007fb4ab488548 in io_loop_run () from target:/usr/lib/dovecot/libdovecot.so.0
#9  0x00007fb4ab40c0a3 in master_service_run () from target:/usr/lib/dovecot/libdovecot.so.0
#10 0x000055c8c350b46d in main ()
(gdb) c
Continuing.
^C
Program received signal SIGINT, Interrupt.
0x00007fb4ab11b06a in epoll_ctl () from target:/lib/x86_64-linux-gnu/libc.so.6
(gdb) bt
#0  0x00007fb4ab11b06a in epoll_ctl () from target:/lib/x86_64-linux-gnu/libc.so.6
#1  0x00007fb4ab489873 in io_loop_handle_remove () from target:/usr/lib/dovecot/libdovecot.so.0
#2  0x00007fb4ab487adf in ?? () from target:/usr/lib/dovecot/libdovecot.so.0
#3  0x000055c8c35158b7 in ?? ()
#4  0x000055c8c3516b4a in ?? ()
#5  0x000055c8c3516d78 in ?? ()
#6  0x000055c8c35179e4 in doveadm_connections_ring_synced ()
#7  0x00007fb4ab488166 in io_loop_handle_timeouts () from target:/usr/lib/dovecot/libdovecot.so.0
#8  0x00007fb4ab489963 in io_loop_handler_run_internal ()
    from target:/usr/lib/dovecot/libdovecot.so.0
#9  0x00007fb4ab48839c in io_loop_handler_run () from target:/usr/lib/dovecot/libdovecot.so.0
#10 0x00007fb4ab488548 in io_loop_run () from target:/usr/lib/dovecot/libdovecot.so.0
#11 0x00007fb4ab40c0a3 in master_service_run () from target:/usr/lib/dovecot/libdovecot.so.0
#12 0x000055c8c350b46d in main ()

For completeness, doveconf -n output attached.

-- 
Jan-Pieter Cornet <johnpc at xs4all.net>
Systeembeheer XS4ALL Internet bv
www.xs4all.nl
-------------- next part --------------
A non-text attachment was scrubbed...
Name: doveconf-n.txt.gz
Type: application/x-gzip
Size: 1663 bytes
Desc: not available
URL: <https://dovecot.org/pipermail/dovecot/attachments/20180426/94a336d0/attachment.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 963 bytes
Desc: OpenPGP digital signature
URL: <https://dovecot.org/pipermail/dovecot/attachments/20180426/94a336d0/attachment.sig>


More information about the dovecot mailing list