dovecot replication crashing
Hi,
I'm a but clueless, having issues with replication. doveadm dsync -u hans
works.
But using the following replication setup, I see coredumps.
Where to go next?
Interestingly not for all users. (For testing purposes I've only 2 users. One having about 20 messages: here even the replication works, but the 2nd user (having about 14k messages) fails. The failure seems to happen immediatly after starting the replication attempt).
I saw the replication crashing using Dovecot packages from the current Debian distro. In order to debug this, I'm now using 2.3.20, built from Git.
The replicator config boils down to this (complete config is attached)
replication_full_sync_interval = 90d
replication_max_conns = 16
mail_plugins = $mail_plugins notify replication
plugin {
mail_replica = tcps:smtp-mz.example.com:9090
}
service replicator {
process_min_avail = 1
unix_listener replicator-doveadm {
mode = 0666
}
}
service aggregator {
fifo_listener replication-notify-fifo {
user = dovecot
mode = 0666
}
unix_listener replication-notify {
user = dovecot
mode = 0666
}
}
The stacktrace I get looks like this:
Stack trace of thread 729589: #0 0x00007f9e4f2c8ce1 __GI_raise (libc.so.6 + 0x38ce1) #1 0x00007f9e4f2b2537 __GI_abort (libc.so.6 + 0x22537) #2 0x00007f9e4f5fe8c6 default_fatal_finish (libdovecot.so.0 + 0x558c6) #3 0x00007f9e4f6ab601 i_internal_fatal_handler (libdovecot.so.0 + 0x102601) #4 0x00007f9e4f5fe589 i_panic (libdovecot.so.0 + 0x55589) #5 0x00007f9e4f5fe99e fd_set_nonblock (libdovecot.so.0 + 0x5599e) #6 0x00005564cbe6a08d cmd_dsync_ibc_stream_init (doveadm-server + 0x3008d) #7 0x00005564cbe6b772 cmd_dsync_run (doveadm-server + 0x31772) #8 0x00005564cbe6d284 doveadm_mail_next_user (doveadm-server + 0x33284) #9 0x00005564cbe6e4ba doveadm_mail_cmd_exec (doveadm-server + 0x344ba) #10 0x00005564cbe7eb71 doveadm_cmd_run_ver2 (doveadm-server + 0x44b71) #11 0x00005564cbe8300a doveadm_cmd_server_run_ver2 (doveadm-server + 0x4900a) #12 0x00007f9e4f6c1799 io_loop_call_io (libdovecot.so.0 + 0x118799) #13 0x00007f9e4f6c2e82 io_loop_handler_run_internal (libdovecot.so.0 + 0x119e82) #14 0x00007f9e4f6c1840 io_loop_handler_run (libdovecot.so.0 + 0x118840) #15 0x00007f9e4f6c1a00 io_loop_run (libdovecot.so.0 + 0x118a00) #16 0x00007f9e4f6343a3 master_service_run (libdovecot.so.0 + 0x8b3a3) #17 0x00005564cbe5d9a2 main (doveadm-server + 0x239a2) #18 0x00007f9e4f2b3d0a __libc_start_main (libc.so.6 + 0x23d0a) #19 0x00005564cbe5da2a _start (doveadm-server + 0x23a2a)
2023 Mar 28 00:09:28 clone doveadm(hans)<729589><TdSzNhcUImT1IQsAKQCViw>: Fatal: master: service(doveadm): child 729589 killed with signal 6 (core dumped)
I'm a bit clueless now, even using the coredump and gdb.
-- Heiko
Hi all,
Our Dovecot replication crashes because an assertion in dovecot-core/src/lib/fd-util.c fails:
#6 0x00007fd5ac89b08e in fd_set_nonblock (fd=-1, nonblock=true) at fd-util.c:102 102 i_assert(fd > -1);
This only happens if dovecot starts the replication by itself, i.e. not when using “doveadm sync -d -u USER”
Chasing the contents of fd in the stack: #7 0x0000560045173e21 in cmd_dsync_ibc_stream_init (ctx=0x560045966d48, name=0x560045986e15 "smtp-mz.chiorino.com:9090", temp_prefix=0x56004590dbe8 "/tmp/dovecot.doveadm.") at doveadm-dsync.c:567 567 fd_set_nonblock(ctx->fd_in, TRUE);
The struct member ctx->fd_in is already -1 at this point.
#8 0x0000560045174662 in cmd_dsync_run (_ctx=0x560045966d48, user=0x5600459a1748) at doveadm-dsync.c:706 706 ibc = cmd_dsync_ibc_stream_init(ctx, ctx->remote_name,
The variable ctx gets defined in this function. The struct member ctx->fd_in is -1.
This is as much (hopefully useful) info I can give without guidance.
dovecot-sysreport attached.
Full backtrace:
#0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1 0x00007f1df1eb8537 in __GI_abort () at abort.c:79
#2 0x00007f1df22ff57f in default_fatal_finish (type=LOG_TYPE_PANIC, status=0) at failures.c:465
#3 0x00007f1df22ff5e6 in fatal_handler_real (ctx=0x7fff02d24160, format=0x7f1df23970a8 "file %s: line %d (%s): assertion failed: (%s)",
args=0x7fff02d24190) at failures.c:477
#4 0x00007f1df2300871 in i_internal_fatal_handler (ctx=0x7fff02d24160, format=0x7f1df23970a8 "file %s: line %d (%s): assertion failed: (%s)",
args=0x7fff02d24190) at failures.c:879
#5 0x00007f1df22ff8ba in i_panic (format=0x7f1df23970a8 "file %s: line %d (%s): assertion failed: (%s)") at failures.c:530
#6 0x00007f1df230108e in fd_set_nonblock (fd=-1, nonblock=true) at fd-util.c:102
#7 0x0000560045173e21 in cmd_dsync_ibc_stream_init (ctx=0x560045966d48, name=0x560045986e15 "smtp-mz.chiorino.com:9090",
temp_prefix=0x56004590dbe8 "/tmp/dovecot.doveadm.") at doveadm-dsync.c:567
#8 0x0000560045174662 in cmd_dsync_run (_ctx=0x560045966d48, user=0x5600459a1748) at doveadm-dsync.c:706
#9 0x00005600451780bb in doveadm_mail_next_user (ctx=0x560045966d48, error_r=0x7fff02d24538) at doveadm-mail.c:472
#10 0x00005600451782b9 in doveadm_mail_single_user (ctx=0x560045966d48, error_r=0x7fff02d24538) at doveadm-mail.c:506
#11 0x0000560045178974 in doveadm_mail_cmd_exec (ctx=0x560045966d48, wildcard_user=0x0) at doveadm-mail.c:667
#12 0x00005600451795bc in doveadm_cmd_ver2_to_mail_cmd_wrapper (cctx=0x7fff02d24720) at doveadm-mail.c:996
#13 0x0000560045190108 in doveadm_cmd_run_ver2 (argc=8, argv=0x56004590b378, cctx=0x7fff02d24720) at doveadm-cmd.c:465
#14 0x000056004519682c in doveadm_cmd_server_run_ver2 (conn=0x56004595e9f8, argc=8, argv=0x56004590b378, cctx=0x7fff02d24720)
at client-connection-tcp.c:187
#15 0x0000560045196920 in doveadm_cmd_handle (conn=0x56004595e9f8, cmd_name=0x56004590b349 "sync", argc=8, argv=0x56004590b378,
cctx=0x7fff02d24720) at client-connection-tcp.c:214
#16 0x0000560045196bdf in client_handle_command (conn=0x56004595e9f8, args=0x56004590b368) at client-connection-tcp.c:284
#17 0x00005600451972ee in client_connection_tcp_input (conn=0x56004595e9f8) at client-connection-tcp.c:436
#18 0x00007f1df2326e4f in io_loop_call_io (io=0x560045969a10) at ioloop.c:737
#19 0x00007f1df232a470 in io_loop_handler_run_internal (ioloop=0x560045913e80) at ioloop-epoll.c:222
#20 0x00007f1df23270d6 in io_loop_handler_run (ioloop=0x560045913e80) at ioloop.c:789
#21 0x00007f1df2326fb1 in io_loop_run (ioloop=0x560045913e80) at ioloop.c:762
#22 0x00007f1df22407dc in master_service_run (service=0x560045913cd0, callback=0x56004519b447
dovecot --version (self-compiled): 2.3.20 (80a5ac675d) Debian GNU/Linux 11 (bullseye) x86_64 Mail storage on BTRFS, also tested on ext4.
Heiko already asked about this a while ago, but there was no feedback. We still have this problem and now I want to try to escalate this to a bug. What do you think?
Best regards, Moritz
-- [schlittermann – internet & unix support ] [support@schlittermann.de | https://schlittermann.de/ ] [Tel(Fax): +49 351 802998-1 ] https://schlittermann.de/keys/gpg/mo@schlittermann.de/pubkey.asc
On 16/05/2023 16:24 EEST Moritz Orbach via dovecot dovecot@dovecot.org wrote:
Hi all,
Our Dovecot replication crashes because an assertion in dovecot-core/src/lib/fd-util.c fails:
#6 0x00007fd5ac89b08e in fd_set_nonblock (fd=-1, nonblock=true) at fd-util.c:102 102 i_assert(fd > -1);
This only happens if dovecot starts the replication by itself, i.e. not when using “doveadm sync -d -u USER”
Chasing the contents of fd in the stack: #7 0x0000560045173e21 in cmd_dsync_ibc_stream_init (ctx=0x560045966d48, name=0x560045986e15 "smtp-mz.chiorino.com:9090", temp_prefix=0x56004590dbe8 "/tmp/dovecot.doveadm.") at doveadm-dsync.c:567 567 fd_set_nonblock(ctx->fd_in, TRUE);
The struct member ctx->fd_in is already -1 at this point.
#8 0x0000560045174662 in cmd_dsync_run (_ctx=0x560045966d48, user=0x5600459a1748) at doveadm-dsync.c:706 706 ibc = cmd_dsync_ibc_stream_init(ctx, ctx->remote_name,
The variable ctx gets defined in this function. The struct member ctx->fd_in is -1.
This is as much (hopefully useful) info I can give without guidance.
dovecot-sysreport attached.
The sysreport is missing the core file. Log excerpt would've been nice too, in addition to the assert error. Can you provide these?
I'm wondering if this happens because connection fails to smtp-mz.chiorino.com:9090. Can you test that?
Aki
Am Di, 16.05.2023 20:10 Uhr schrieb Aki Tuomi via dovecot:
On 16/05/2023 16:24 EEST Moritz Orbach via dovecot dovecot@dovecot.org wrote: #6 0x00007fd5ac89b08e in fd_set_nonblock (fd=-1, nonblock=true) at fd-util.c:102 102 i_assert(fd > -1);
The sysreport is missing the core file. Log excerpt would've been nice too, in addition to the assert error. Can you provide these?
Sure, attached. But the core file is from a self-compiled binary and the logs didn't show more than the assertion. I reproduced it with log_debug = (event=*) and there was some more output (journalctl.txt).
I'm wondering if this happens because connection fails to smtp-mz.chiorino.com:9090. Can you test that?
- I looked at tcpdump on the target server and the connection succeeds.
- If I stop dovecot on smtp-mz (provoking connection refused) the crash doesn't occur.
- Manual replication works.
Best regards, Moritz
-- [schlittermann – internet & unix support ] [support@schlittermann.de | https://schlittermann.de/ ] [Tel(Fax): +49 351 802998-1 ] https://schlittermann.de/keys/gpg/mo@schlittermann.de/pubkey.asc
On 17/05/2023 09:56 EEST Moritz Orbach via dovecot dovecot@dovecot.org wrote:
Am Di, 16.05.2023 20:10 Uhr schrieb Aki Tuomi via dovecot:
On 16/05/2023 16:24 EEST Moritz Orbach via dovecot dovecot@dovecot.org wrote: #6 0x00007fd5ac89b08e in fd_set_nonblock (fd=-1, nonblock=true) at fd-util.c:102 102 i_assert(fd > -1);
The sysreport is missing the core file. Log excerpt would've been nice too, in addition to the assert error. Can you provide these?
Sure, attached. But the core file is from a self-compiled binary and the logs didn't show more than the assertion. I reproduced it with log_debug = (event=*) and there was some more output (journalctl.txt).
I'm wondering if this happens because connection fails to smtp-mz.chiorino.com:9090. Can you test that?
- I looked at tcpdump on the target server and the connection succeeds.
- If I stop dovecot on smtp-mz (provoking connection refused) the crash doesn't occur.
- Manual replication works.
Best regards, Moritz
--
Hi!
Do you have noreplicate
used anywhere? It seems this could potentially cause the crash.
Aki
Hi Aki,
Do you have
noreplicate
used anywhere? It seems this could potentially cause the crash. Yes, noreplicate was set. But removing it (*) doesn't change anything. The stack trace is identical.
*)
- removed default_fields = noreplicate:protected=true in conf.d/auth-system.conf.ext
- removed userdb_noreplicate=false in users
- doveconf | grep norepl # empty
- grep -r norepl /etc/dovecot/ # everything commented out
-- [schlittermann – internet & unix support ] [support@schlittermann.de | https://schlittermann.de/ ] [Tel(Fax): +49 351 802998-1 ] https://schlittermann.de/keys/gpg/mo@schlittermann.de/pubkey.asc
Hi Aki,
I'm willing to spend some time on this. Is there something more I could do to help find the cause?
Best regards, Moritz
-- [schlittermann – internet & unix support ] [support@schlittermann.de | https://schlittermann.de/ ] [Tel(Fax): +49 351 802998-1 ] https://schlittermann.de/keys/gpg/mo@schlittermann.de/pubkey.asc
On 24/05/2023 17:23 EEST Moritz Orbach via dovecot dovecot@dovecot.org wrote:
Hi Aki,
I'm willing to spend some time on this. Is there something more I could do to help find the cause?
Best regards, Moritz
--
I would need some way to reproduce this, as the crash happens way too late. I can try looking around the core file next week to see if it gives any good insight.
Aki
participants (3)
-
Aki Tuomi
-
Heiko Schlittermann
-
Moritz Orbach