Dovecot (>= 2.3.15) is not properly replicating Expunge commands running in a containerised environment
Hello *,
migrating a mailbox setup for several tens of thousands of customers to Docker containers running on Debian VMs, we are currently observing Expunge commands not being replicated properly between two Dovecot containers.
Delivering a new message into someones' inbox is replicated instantly via dsync (using TLS). However, expunging a message from the very same inbox via
doveadm expunge -u [users' email address] mailbox '*' header Message-ID [MID of the message in question]
is not replicated, since a
doveadm fetch -u [users' email address] text header Message-ID [MID of the message in question]
still returns the mail on the replica. Sometimes, a
doveadm replicator replicate [users' email address]
fixes this, but mostly, it does not. While doing all of this, these messages are logged by Dovecot:
doveadm: Info: copy from INBOX: box=INBOX, uid=13, msgid=<[MID of the message in question]>, size=1793 doveadm: Info: expunge: box=INBOX, uid=12, msgid=<[MID of the message in question]>, size=1793 doveadm: Error: dsync(x): I/O has stalled, no activity for 600 seconds (last sent=handshake, last recv=handshake) doveadm: Error: Timeout during state=slave_recv_last_common doveadm: Error: dsync(x): I/O has stalled, no activity for 600 seconds (last sent=mailbox_delete, last recv=mailbox_delete) doveadm: Error: Timeout during state=slave_recv_mailbox doveadm: Error: dsync(SSL x): I/O has stalled, no activity for 600 seconds (version not received) doveadm: Error: Timeout during state=slave_recv_handshake
Unfortunately, these errors do not seem to be reproducible very well - the underlying behaviour however is. Setting
replication_sync_timeout = 2s
, as mentioned in https://doc.dovecot.org/settings/plugin/replication-plugin/ (the "s" seems to be bogus, though, as Dovecot complains about an invalid timeframe given) results in these messages:
doveadm(x): Error: doveadm client disconnected before handshake: Connection closed replicator: Panic: data stack: Out of memory when allocating 268435496 bytes replicator: Error: Raw backtrace: /usr/local/lib/dovecot/libdovecot.so.0(backtrace_append+0x3d) [0x7ff8f55343dd] -> /usr/local/lib/dovecot/libdovecot.so.0(backtrace_get+0x1e) [0x7ff8f55344fe] -> /usr/local/lib/dovecot/libdovecot.so.0(+0xfcc6b) [0x7ff8f5540c6b] -> /usr/local/lib/dovecot/libdovecot.so.0(+0xfcd01) [0x7ff8f5540d01] -> /usr/local/lib/dovecot/libdovecot.so.0(+0x533e7) [0x7ff8f54973e7] -> /usr/local/lib/dovecot/libdovecot.so.0(+0x52dcb) [0x7ff8f5496dcb] -> /usr/local/lib/dovecot/libdovecot.so.0(+0xf6625) [0x7ff8f553a625] -> /usr/local/lib/dovecot/libdovecot.so.0(+0x11c248) [0x7ff8f5560248] -> /usr/local/lib/dovecot/libdovecot.so.0(+0xf2655) [0x7ff8f5536655] -> /usr/local/lib/dovecot/libdovecot.so.0(buffer_write+0x67) [0x7ff8f5536867] -> dovecot/replicator(replicator_queue_push+0xe9) [0x5619b1b81a39] -> dovecot/replicator(+0x61f8) [0x5619b1b811f8] -> dovecot/replicator(+0x58ef) [0x5619b1b808ef] -> dovecot/replicator(+0x5ab2) [0x5619b1b80ab2] -> /usr/local/lib/dovecot/libdovecot.so.0(io_loop_call_io+0x69) [0x7ff8f5556b99] -> /usr/local/lib/dovecot/libdovecot.so.0(io_loop_handler_run_internal+0x131) [0x7ff8f5558101] -> /usr/local/lib/dovecot/libdovecot.so.0(io_loop_handler_run+0x4c) [0x7ff8f5556c3c] -> /usr/local/lib/dovecot/libdovecot.so.0(io_loop_run+0x40) [0x7ff8f5556db0] -> /usr/local/lib/dovecot/libdovecot.so.0(master_service_run+0x13) [0x7ff8f54cb7b3] -> dovecot/replicator(main+0x18f) [0x5619b1b7fc4f] -> /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xeb) [0x7ff8f52a309b] -> dovecot/replicator(_start+0x2a) [0x5619b1b7fd0a] replicator: Fatal: master: service(replicator): child 175 killed with signal 6 (core not dumped - https://dovecot.org/bugreport.html#coredumps - set /proc/sys/fs/suid_dumpable to 2)
Not having observed this behaviour on physical or virtual Linux machines, we are pretty much out of ideas on what to do. https://dovecot.org/pipermail/dovecot/2019-November/117467.html sounds like being a related issue.
Thank you in advance for any hints and replies.
root@x:/# doveconf -n # 2.3.15 (0503334ab1): /usr/local/etc/dovecot/dovecot.conf # Pigeonhole version 0.5.15 (e6a84e31) # OS: Linux 4.9.0-16-amd64 x86_64 Debian 10.10 # Hostname: x auth_cache_negative_ttl = 1 mins auth_cache_size = 10 k auth_master_user_separator = "#" auth_mechanisms = plain login auth_verbose_passwords = sha1 doveadm_password = # hidden, use -P to show it doveadm_port = 12345 first_valid_uid = 8 imap_hibernate_timeout = 1 mins imap_id_log = * lmtp_save_to_detail_mailbox = yes log_path = /dev/stderr log_timestamp = login_log_format_elements = user=<%u> method=%m rip=%r lip=%l mpid=%e %c session=<%{session}> %k mail_access_groups = dovecot mail_attribute_dict = file:%h/Maildir/dovecot-attributes mail_gid = mail mail_location = maildir:~/Maildir/ mail_plugins = acl quota mail_log notify replication mail_uid = mail managesieve_notify_capability = mailto managesieve_sieve_capability = fileinto reject envelope encoded-character vacation subaddress comparator-i;ascii-numeric relational regex imap4flags copy include variables body enotify environment mailbox date index ihave duplicate mime foreverypart extracttext vacation-seconds passdb { args = /etc/dovecot/conf.d/ldap-searches/ldap.conf driver = ldap } plugin { acl = vfile acl_defaults_from_inbox = yes acl_shared_dict = file:/data/acl_shared_dict/acl_shared_dict mail_replica = tcps:x sieve_extensions = +vacation-seconds sieve_vacation_min_period = 1m } pop3_uidl_format = %f postmaster_address = x protocols = imap lmtp pop3 sieve replication_dsync_parameters = -d -N -l 60 -U service aggregator { fifo_listener replication-notify-fifo { group = mail mode = 0666 user = mail } unix_listener replication-notify { group = mail mode = 0666 user = mail } } service auth-worker { user = $default_internal_user } service auth { unix_listener /socket/dovecot/x/sasl-server { group = postfix mode = 0660 user = postfix } } service doveadm { inet_listener { port = 12345 ssl = yes } } service imap-hibernate { unix_listener imap-hibernate { user = mail } } service imap { unix_listener imap-master { group = dovecot mode = 0660 } } service lmtp { inet_listener { port = 24 ssl = yes } unix_listener /socket/dovecot/x/lmtp-server { mode = 0600 user = postfix } } service replicator { process_min_avail = 1 unix_listener replicator-doveadm { group = mail mode = 0600 user = mail } } ssl_cert =
Am 05.08.21 um 13:03 schrieb A. Schulze:
Hello,
nobody else seeing such issues? Any hints are highly appreciated.
Andreas
doveadm: Info: copy from INBOX: box=INBOX, uid=13, msgid=<[MID of the message in question]>, size=1793 doveadm: Info: expunge: box=INBOX, uid=12, msgid=<[MID of the message in question]>, size=1793 doveadm: Error: dsync(x): I/O has stalled, no activity for 600 seconds (last sent=handshake, last recv=handshake) doveadm: Error: Timeout during state=slave_recv_last_common doveadm: Error: dsync(x): I/O has stalled, no activity for 600 seconds (last sent=mailbox_delete, last recv=mailbox_delete) doveadm: Error: Timeout during state=slave_recv_mailbox doveadm: Error: dsync(SSL x): I/O has stalled, no activity for 600 seconds (version not received) doveadm: Error: Timeout during state=slave_recv_handshake
participants (1)
-
A. Schulze