Index Corruption Problem with new VM Host - But Only With Replication Enabled
Hi,
I've recently migrated my two VMs across from Linode (who use KVM) onto a local VPS service (which also uses KVM). Since doing so I have started to see some strange problems with Dovecot relating to indexes and replication.
I have copied the configuration files across from old host to new host. The kernel is the same - as this is Gentoo everything was rebuilt and installed from fresh, but with the same options (use flags). Even the Linux kernel is the same version with the exact same options (as is Dovecot). The filesystem is the same EXT4 with the same options too.
Here is what is logged:
Feb 18 15:20:41 tornado.reub.net dovecot: imap(reuben)<20031><d0yxIEPY3ROfxG3u>: Error: Mailbox INBOX: /home/reuben/Maildir/dovecot.index reset, view is now inconsistent Feb 18 15:20:41 tornado.reub.net dovecot: imap(reuben)<20031><d0yxIEPY3ROfxG3u>: Disconnected: IMAP session state is inconsistent, please relogin. in=146 out=1398 deleted=0 expunged=0 trashed=0 hdr_count=0 hdr_bytes=0 body_count=0 body_bytes=0
The trigger is the move/delete of an email. Just reading seems to be OK. But as soon as you move/delete an email this event occurs and is logged. The move fails on the first attempt (error is logged, client refreshes) but then if attempted again it succeeds.
This error causes the clients to be thrown off and the client view refreshed which seems to require a redownload of everything. It is extremely disruptive to client access.
The odd thing is that as soon as I disable replication by commenting out the mail_replica line, *all* of the index corruption messages and symptoms go away completely and I don't see this problem occur. As soon as I re-add it they come back. So it appears that replication at least is triggering the problem.
Replication is imap1 -> imap2 and on the other host imap2 - imap1
The odd thing is that this never was a problem while the system was on Linode.
I have removed all files on the remote and let dovecot re-replicate everything but this didn't help.
I can go back to the VPS host and talk to them about it as I have an open ticket but from their perspective it looks like an application level problem, so there's not a lot for them to go on especially given everything else is working OK. Could the underlying host be playing a role in this? It seems unlikely given turning replication or or off causes the problem to appear or disappear.
All clients are connecting to only the master of the two replication partners. Many have multiple concurrent sessions (eg my phone, and Thunderbird at the same time) but all are connecting to only one instance at a time. The second instance is only there for a failover scenario (update a DNS entry to failover).
There is no NFS or any sort of sharing set up on the VMs. They are dedicated virtual disks.
Can anyone give me any ideas how to troubleshoot this further or things to try?
Some system details:
# 2.3.18 (9dd8408c18): /etc/dovecot/dovecot.conf # Pigeonhole version 0.5.18 (0bc28b32) # OS: Linux 5.16.10-gentoo x86_64 Gentoo Base System release 2.8 # Hostname: tornado.reub.net auth_mechanisms = plain login auth_username_format = %Ln doveadm_password = # hidden, use -P to show it first_valid_uid = 1000 imap_client_workarounds = tb-lsub-flags tb-extra-mailbox-sep last_valid_uid = 1099 login_log_format_elements = user=<%u> auth-method=%m remote=%r local=%l %c %k mail_attribute_dict = file:%h/Maildir/dovecot-attributes mail_location = maildir:~/Maildir mail_plugins = notify replication managesieve_notify_capability = mailto managesieve_sieve_capability = fileinto reject envelope encoded-character vacation subaddress comparator-i;ascii-numeric relational regex imap4flags copy include variables body enotify environment mailbox date index ihave duplicate mime foreverypart extracttext namespace inbox { inbox = yes location = mailbox Drafts { special_use = \Drafts } mailbox Junk { special_use = \Junk } mailbox Sent { special_use = \Sent } mailbox "Sent Messages" { special_use = \Sent } mailbox Trash { special_use = \Trash } prefix = } passdb { args = failure_show_msg=yes %s driver = pam } plugin { mail_replica = tcp:imap2.reub.net:4814 replication_full_sync_interval = 2 hours sieve = file:~/sieve;active=~/.dovecot.sieve } postmaster_address = postmaster@reub.net protocols = imap lmtp sieve submission sieve recipient_delimiter = - service aggregator { fifo_listener replication-notify-fifo { mode = 0666 user = root } unix_listener replication-notify { mode = 0666 user = root } } service auth { unix_listener /var/spool/postfix/private/auth { group = postfix mode = 0666 user = postfix } unix_listener auth-userdb { mode = 0777 } } service doveadm { inet_listener { address = 2404:9400:2264:7200::143 port = 4813 ssl = yes } inet_listener { address = 2404:9400:2264:7200::143 port = 4814 ssl = no } user = root } service imap-login { inet_listener imap { port = 143 } } service lmtp { inet_listener lmtp { address = ::1 port = 24 } unix_listener /var/spool/postfix/private/dovecot-lmtp { group = postfix mode = 0660 user = postfix } } service managesieve-login { inet_listener sieve { address = 127.0.0.1 port = 4190 } } service replicator { process_min_avail = 1 unix_listener replicator-doveadm { mode = 0666 } } service submission-login { inet_listener submission { address = 103.4.234.81 2404:9400:2264:7200::587 port = 587 } } ssl_cert =
Thanks, Reuben
Following up to my original mail:
On 18/02/2022 3:59 pm, Reuben Farrelly wrote:
Hi,
I've recently migrated my two VMs across from Linode (who use KVM) onto a local VPS service (which also uses KVM). Since doing so I have started to see some strange problems with Dovecot relating to indexes and replication.
I have copied the configuration files across from old host to new host. The kernel is the same - as this is Gentoo everything was rebuilt and installed from fresh, but with the same options (use flags). Even the Linux kernel is the same version with the exact same options (as is Dovecot). The filesystem is the same EXT4 with the same options too.
No one responded from here (is anyone helping on this list anymore?) but after many hours I found out the problem was to do with replication on the far end host, and not anything to do with either the new VPS or the existing dovecot or linux config.
It turns out that if there is an existing Maildir/ in the user's directory on the remote replica, the initial sync from the master fails. It may fail early on in the sync, or at the end of the initial replication but either way it fails and the user ends up with a mailbox in a half sync'd state. Even if the remote Maildir is completely empty as mine were, it fails - it is the mere presence of the Maildir/ directory on the remote breaks the sync. Typically new users have a new and empty Maildir (copied from /etc/skel) so it fails for them by default.
Once I deleted the Maildir/ from the remote user's home directory and the entire contents of a half replica, then dovecot created a new Maildir and everything was able to sync through on all users to completion.
To reproduce this: create a new user with an empty (Maildir/new Maildir/cur and Maildir/tmp) and then trigger the sync with debug manually: doveadm -v -D sync -u username -f tcp:imap2.reub.net:4814
Here - with a completely empty and brand new Maildir/ on both master and remote replica we can see it already fails:
tornado ~ # doveadm -v -D sync -u testuser -f tcp:imap2.reub.net:4814
Debug: Loading modules from directory: /usr/lib64/dovecot/doveadm
Debug: Skipping module doveadm_acl_plugin, because dlopen() failed:
/usr/lib64/dovecot/doveadm/lib10_doveadm_acl_plugin.so: undefined
symbol: acl_user_module (this is usually intentional, so just ignore
this message)
Debug: Skipping module doveadm_quota_plugin, because dlopen() failed:
/usr/lib64/dovecot/doveadm/lib10_doveadm_quota_plugin.so: undefined
symbol: quota_user_module (this is usually intentional, so just ignore
this message)
Debug: Module loaded:
/usr/lib64/dovecot/doveadm/lib10_doveadm_sieve_plugin.so
Debug: Skipping module doveadm_fts_plugin, because dlopen() failed:
/usr/lib64/dovecot/doveadm/lib20_doveadm_fts_plugin.so: undefined
symbol: fts_user_get_language_list (this is usually intentional, so just
ignore this message)
Debug: Skipping module doveadm_mail_crypt_plugin, because dlopen()
failed: /usr/lib64/dovecot/doveadm/libdoveadm_mail_crypt_plugin.so:
undefined symbol: mail_crypt_box_get_pvt_digests (this is usually
intentional, so just ignore this message)
Feb 21 16:31:51 Debug: Loading modules from directory: /usr/lib64/dovecot
Feb 21 16:31:51 Debug: Module loaded:
/usr/lib64/dovecot/lib15_notify_plugin.so
Feb 21 16:31:51 Debug: Module loaded:
/usr/lib64/dovecot/lib20_replication_plugin.so
Feb 21 16:31:51 Debug: Loading modules from directory:
/usr/lib64/dovecot/doveadm
Feb 21 16:31:51 Debug: Skipping module doveadm_acl_plugin, because
dlopen() failed: /usr/lib64/dovecot/doveadm/lib10_doveadm_acl_plugin.so:
undefined symbol: acl_user_module (this is usually intentional, so just
ignore this message)
Feb 21 16:31:51 Debug: Skipping module doveadm_quota_plugin, because
dlopen() failed:
/usr/lib64/dovecot/doveadm/lib10_doveadm_quota_plugin.so: undefined
symbol: quota_user_module (this is usually intentional, so just ignore
this message)
Feb 21 16:31:51 Debug: Skipping module doveadm_fts_plugin, because
dlopen() failed: /usr/lib64/dovecot/doveadm/lib20_doveadm_fts_plugin.so:
undefined symbol: fts_user_get_language_list (this is usually
intentional, so just ignore this message)
Feb 21 16:31:51 Debug: Skipping module doveadm_mail_crypt_plugin,
because dlopen() failed:
/usr/lib64/dovecot/doveadm/libdoveadm_mail_crypt_plugin.so: undefined
symbol: mail_crypt_box_get_pvt_digests (this is usually intentional, so
just ignore this message)
Feb 21 16:31:51 doveadm(testuser)<11258><>: Debug: auth-master: userdb
lookup(testuser): Started userdb lookup
Feb 21 16:31:51 doveadm(testuser)<11258><>: Debug: auth-master: conn
unix:/run/dovecot/auth-userdb: Connecting
Feb 21 16:31:51 doveadm(testuser)<11258><>: Debug: auth-master: conn
unix:/run/dovecot/auth-userdb (pid=3070,uid=0): Client connected (fd=9)
Feb 21 16:31:51 doveadm(testuser)<11258><>: Debug: auth-master: userdb
lookup(testuser): auth USER input: testuser system_groups_user=testuser
uid=1006 gid=1006 home=/home/testuser
Feb 21 16:31:51 doveadm(testuser)<11258><>: Debug: auth-master: userdb
lookup(testuser): Finished userdb lookup (username=testuser
system_groups_user=testuser uid=1006 gid=1006 home=/home/testuser)
Feb 21 16:31:51 doveadm(testuser)<11258>
tornado ~ #
That message about the Mailbox changing causing a desync is what screws up the client (in this case testuser) if they happen to be connected to the mater at the time of the attempt. That is why client access was impacted as soon as remote replication was enabled.
If this is run via normal dovecot sync (and not triggered) it results in severe performance problems and disconnections for any clients connected to the master replica pretty much anytime they try and move/delete anything.
I saw a couple of old posts over the years where other people have found the same thing so I know I'm not the first person to run into this problem and I suspect there are a lot more who have run into this problem but not realised what was going on.
What do I need to do to get this fixed properly? Bailing out mid way through replicating and disconnecting clients (and not self-healing) surely can't be the intended behaviour so I'd consider this to be a pretty nasty bug.
Reuben
Something about this a little bit ominous.
There's a new type of "architecture" unrolling with a certain flavor, and it is becoming, by and by, irremediably complex. I'm not really sure where the stopping or turning point is, or perhaps there are other "tools" for memory leak detection and static code analysis that could in theory help find bugs like this.
Assuming the bug is in Dovecot and not in the Linux kernel or the underlying KVM virtualization container.
I was using a KVM that got hacked, and I'm having better luck with CentOS on OpenVZ at the provider on a very small scale system, but certain critical security bits and pieces are going missing in action.
On February 20, 2022 8:39:13 PM AKST, Reuben Farrelly reuben-dovecot@reub.net wrote:
Following up to my original mail:
On 18/02/2022 3:59 pm, Reuben Farrelly wrote:
Hi,
I've recently migrated my two VMs across from Linode (who use KVM) onto a local VPS service (which also uses KVM). Since doing so I have started to see some strange problems with Dovecot relating to indexes and replication.
I have copied the configuration files across from old host to new host. The kernel is the same - as this is Gentoo everything was rebuilt and installed from fresh, but with the same options (use flags). Even the Linux kernel is the same version with the exact same options (as is Dovecot). The filesystem is the same EXT4 with the same options too.
No one responded from here (is anyone helping on this list anymore?) but after many hours I found out the problem was to do with replication on the far end host, and not anything to do with either the new VPS or the existing dovecot or linux config.
It turns out that if there is an existing Maildir/ in the user's directory on the remote replica, the initial sync from the master fails. It may fail early on in the sync, or at the end of the initial replication but either way it fails and the user ends up with a mailbox in a half sync'd state. Even if the remote Maildir is completely empty as mine were, it fails - it is the mere presence of the Maildir/ directory on the remote breaks the sync. Typically new users have a new and empty Maildir (copied from /etc/skel) so it fails for them by default.
Once I deleted the Maildir/ from the remote user's home directory and the entire contents of a half replica, then dovecot created a new Maildir and everything was able to sync through on all users to completion.
To reproduce this: create a new user with an empty (Maildir/new Maildir/cur and Maildir/tmp) and then trigger the sync with debug manually: doveadm -v -D sync -u username -f tcp:imap2.reub.net:4814
Here - with a completely empty and brand new Maildir/ on both master and remote replica we can see it already fails:
tornado ~ # doveadm -v -D sync -u testuser -f tcp:imap2.reub.net:4814 Debug: Loading modules from directory: /usr/lib64/dovecot/doveadm Debug: Skipping module doveadm_acl_plugin, because dlopen() failed: /usr/lib64/dovecot/doveadm/lib10_doveadm_acl_plugin.so: undefined symbol: acl_user_module (this is usually intentional, so just ignore this message) Debug: Skipping module doveadm_quota_plugin, because dlopen() failed: /usr/lib64/dovecot/doveadm/lib10_doveadm_quota_plugin.so: undefined symbol: quota_user_module (this is usually intentional, so just ignore this message) Debug: Module loaded: /usr/lib64/dovecot/doveadm/lib10_doveadm_sieve_plugin.so Debug: Skipping module doveadm_fts_plugin, because dlopen() failed: /usr/lib64/dovecot/doveadm/lib20_doveadm_fts_plugin.so: undefined symbol: fts_user_get_language_list (this is usually intentional, so just ignore this message) Debug: Skipping module doveadm_mail_crypt_plugin, because dlopen() failed: /usr/lib64/dovecot/doveadm/libdoveadm_mail_crypt_plugin.so: undefined symbol: mail_crypt_box_get_pvt_digests (this is usually intentional, so just ignore this message) Feb 21 16:31:51 Debug: Loading modules from directory: /usr/lib64/dovecot Feb 21 16:31:51 Debug: Module loaded: /usr/lib64/dovecot/lib15_notify_plugin.so Feb 21 16:31:51 Debug: Module loaded: /usr/lib64/dovecot/lib20_replication_plugin.so Feb 21 16:31:51 Debug: Loading modules from directory: /usr/lib64/dovecot/doveadm Feb 21 16:31:51 Debug: Skipping module doveadm_acl_plugin, because dlopen() failed: /usr/lib64/dovecot/doveadm/lib10_doveadm_acl_plugin.so: undefined symbol: acl_user_module (this is usually intentional, so just ignore this message) Feb 21 16:31:51 Debug: Skipping module doveadm_quota_plugin, because dlopen() failed: /usr/lib64/dovecot/doveadm/lib10_doveadm_quota_plugin.so: undefined symbol: quota_user_module (this is usually intentional, so just ignore this message) Feb 21 16:31:51 Debug: Skipping module doveadm_fts_plugin, because dlopen() failed: /usr/lib64/dovecot/doveadm/lib20_doveadm_fts_plugin.so: undefined symbol: fts_user_get_language_list (this is usually intentional, so just ignore this message) Feb 21 16:31:51 Debug: Skipping module doveadm_mail_crypt_plugin, because dlopen() failed: /usr/lib64/dovecot/doveadm/libdoveadm_mail_crypt_plugin.so: undefined symbol: mail_crypt_box_get_pvt_digests (this is usually intentional, so just ignore this message) Feb 21 16:31:51 doveadm(testuser)<11258><>: Debug: auth-master: userdb lookup(testuser): Started userdb lookup Feb 21 16:31:51 doveadm(testuser)<11258><>: Debug: auth-master: conn unix:/run/dovecot/auth-userdb: Connecting Feb 21 16:31:51 doveadm(testuser)<11258><>: Debug: auth-master: conn unix:/run/dovecot/auth-userdb (pid=3070,uid=0): Client connected (fd=9) Feb 21 16:31:51 doveadm(testuser)<11258><>: Debug: auth-master: userdb lookup(testuser): auth USER input: testuser system_groups_user=testuser uid=1006 gid=1006 home=/home/testuser Feb 21 16:31:51 doveadm(testuser)<11258><>: Debug: auth-master: userdb lookup(testuser): Finished userdb lookup (username=testuser system_groups_user=testuser uid=1006 gid=1006 home=/home/testuser) Feb 21 16:31:51 doveadm(testuser)<11258>
: Debug: Effective uid=1006, gid=1006, home=/home/testuser Feb 21 16:31:51 doveadm(testuser)<11258> : Debug: Namespace inbox: type=private, prefix=, sep=, inbox=yes, hidden=no, list=yes, subscriptions=yes location=maildir:~/Maildir Feb 21 16:31:51 doveadm(testuser)<11258> : Debug: maildir++: root=/home/testuser/Maildir, index=, indexpvt=, control=, inbox=/home/testuser/Maildir, alt= Feb 21 16:31:51 doveadm(testuser): Debug: brain M: Namespace has location maildir:~/Maildir Feb 21 16:31:51 doveadm(testuser): Debug: Mailbox INBOX: Couldn't open mailbox in list index: Mailbox not found Feb 21 16:31:51 doveadm(testuser): Debug: Mailbox INBOX: Mailbox opened Feb 21 16:31:51 doveadm(testuser): Debug: Namespace : Using permissions from /home/testuser/Maildir: mode=0700 gid=default Feb 21 16:31:51 doveadm(testuser): Debug: Mailbox INBOX: Mailbox opened Feb 21 16:31:51 doveadm(testuser)<11258> : Debug: brain M: Local mailbox tree: INBOX guid=c0e4ea33c7231362fa2b0000ffcd8856 uid_validity=1645421511 uid_next=1 subs=no last_change=0 last_subs=0 Feb 21 16:31:51 doveadm(testuser)<11258> : Debug: brain M: Remote mailbox tree: INBOX guid=c04b3636c72313620f770000794c72b2 uid_validity=1645421511 uid_next=1 subs=no last_change=0 last_subs=0 Feb 21 16:31:51 doveadm(testuser)<11258> : Debug: brain M: Mailbox INBOX: local=c0e4ea33c7231362fa2b0000ffcd8856/0/1, remote=c04b3636c72313620f770000794c72b2/0/1: GUIDs conflict - will be merged later Feb 21 16:31:51 doveadm(testuser)<11258> : Debug: Mailbox INBOX: Mailbox opened Feb 21 16:31:51 doveadm(testuser)<11258> : Debug: Mailbox INBOX: Mailbox opened Feb 21 16:31:51 doveadm(testuser)<11258> : Debug: Mailbox INBOX: Mailbox opened Feb 21 16:31:51 doveadm(testuser)<11258> : Debug: brain M: Change during sync: Remote lost mailbox GUID c04b3636c72313620f770000794c72b2 (maybe it was just deleted?) Feb 21 16:31:52 doveadm(testuser)<11258> : Warning: Mailbox changes caused a desync. You may want to run dsync again: Remote lost mailbox GUID c04b3636c72313620f770000794c72b2 (maybe it was just deleted?) Feb 21 16:31:52 doveadm(11258): Debug: auth-master: conn unix:/run/dovecot/auth-userdb (pid=3070,uid=0): Disconnected: Connection closed (fd=9) tornado ~ #
That message about the Mailbox changing causing a desync is what screws up the client (in this case testuser) if they happen to be connected to the mater at the time of the attempt. That is why client access was impacted as soon as remote replication was enabled.
If this is run via normal dovecot sync (and not triggered) it results in severe performance problems and disconnections for any clients connected to the master replica pretty much anytime they try and move/delete anything.
I saw a couple of old posts over the years where other people have found the same thing so I know I'm not the first person to run into this problem and I suspect there are a lot more who have run into this problem but not realised what was going on.
What do I need to do to get this fixed properly? Bailing out mid way through replicating and disconnecting clients (and not self-healing) surely can't be the intended behaviour so I'd consider this to be a pretty nasty bug.
Reuben
-- Sent from my Android device with K-9 Mail. Please excuse my brevity.
participants (2)
-
justina colmena ~biz
-
Reuben Farrelly