Re: Nasty Bug: Re: Index Corruption Problem with new VM Host - But Only With Replication Enabled
justina colmena ~biz
justina at colmena.biz
Mon Feb 21 06:25:33 UTC 2022
Something about this a little bit ominous.
There's a new type of "architecture" unrolling with a certain flavor, and it is becoming, by and by, irremediably complex. I'm not really sure where the stopping or turning point is, or perhaps there are other "tools" for memory leak detection and static code analysis that could in theory help find bugs like this.
Assuming the bug is in Dovecot and not in the Linux kernel or the underlying KVM virtualization container.
I was using a KVM that got hacked, and I'm having better luck with CentOS on OpenVZ at the provider on a very small scale system, but certain critical security bits and pieces are going missing in action.
On February 20, 2022 8:39:13 PM AKST, Reuben Farrelly <reuben-dovecot at reub.net> wrote:
>Following up to my original mail:
>
>On 18/02/2022 3:59 pm, Reuben Farrelly wrote:
>> Hi,
>>
>> I've recently migrated my two VMs across from Linode (who use KVM)
>> onto a local VPS service (which also uses KVM). Since doing so I have
>> started to see some strange problems with Dovecot relating to indexes
>> and replication.
>>
>> I have copied the configuration files across from old host to new
>> host. The kernel is the same - as this is Gentoo everything was
>> rebuilt and installed from fresh, but with the same options (use
>> flags). Even the Linux kernel is the same version with the exact same
>> options (as is Dovecot). The filesystem is the same EXT4 with the
>> same options too.
>>
>No one responded from here (is anyone helping on this list anymore?) but
>after many hours I found out the problem was to do with replication on
>the far end host, and not anything to do with either the new VPS or the
>existing dovecot or linux config.
>
>It turns out that if there is an existing Maildir/ in the user's
>directory on the remote replica, the initial sync from the master
>fails. It may fail early on in the sync, or at the end of the initial
>replication but either way it fails and the user ends up with a mailbox
>in a half sync'd state. Even if the remote Maildir is completely empty
>as mine were, it fails - it is the mere presence of the Maildir/
>directory on the remote breaks the sync. Typically new users have a new
>and empty Maildir (copied from /etc/skel) so it fails for them by default.
>
>Once I deleted the Maildir/ from the remote user's home directory and
>the entire contents of a half replica, then dovecot created a new
>Maildir and everything was able to sync through on all users to completion.
>
>To reproduce this: create a new user with an empty (Maildir/new
>Maildir/cur and Maildir/tmp) and then trigger the sync with debug
>manually: doveadm -v -D sync -u username -f tcp:imap2.reub.net:4814
>
>Here - with a completely empty and brand new Maildir/ on both master and
>remote replica we can see it already fails:
>
>tornado ~ # doveadm -v -D sync -u testuser -f tcp:imap2.reub.net:4814
>Debug: Loading modules from directory: /usr/lib64/dovecot/doveadm
>Debug: Skipping module doveadm_acl_plugin, because dlopen() failed:
>/usr/lib64/dovecot/doveadm/lib10_doveadm_acl_plugin.so: undefined
>symbol: acl_user_module (this is usually intentional, so just ignore
>this message)
>Debug: Skipping module doveadm_quota_plugin, because dlopen() failed:
>/usr/lib64/dovecot/doveadm/lib10_doveadm_quota_plugin.so: undefined
>symbol: quota_user_module (this is usually intentional, so just ignore
>this message)
>Debug: Module loaded:
>/usr/lib64/dovecot/doveadm/lib10_doveadm_sieve_plugin.so
>Debug: Skipping module doveadm_fts_plugin, because dlopen() failed:
>/usr/lib64/dovecot/doveadm/lib20_doveadm_fts_plugin.so: undefined
>symbol: fts_user_get_language_list (this is usually intentional, so just
>ignore this message)
>Debug: Skipping module doveadm_mail_crypt_plugin, because dlopen()
>failed: /usr/lib64/dovecot/doveadm/libdoveadm_mail_crypt_plugin.so:
>undefined symbol: mail_crypt_box_get_pvt_digests (this is usually
>intentional, so just ignore this message)
>Feb 21 16:31:51 Debug: Loading modules from directory: /usr/lib64/dovecot
>Feb 21 16:31:51 Debug: Module loaded:
>/usr/lib64/dovecot/lib15_notify_plugin.so
>Feb 21 16:31:51 Debug: Module loaded:
>/usr/lib64/dovecot/lib20_replication_plugin.so
>Feb 21 16:31:51 Debug: Loading modules from directory:
>/usr/lib64/dovecot/doveadm
>Feb 21 16:31:51 Debug: Skipping module doveadm_acl_plugin, because
>dlopen() failed: /usr/lib64/dovecot/doveadm/lib10_doveadm_acl_plugin.so:
>undefined symbol: acl_user_module (this is usually intentional, so just
>ignore this message)
>Feb 21 16:31:51 Debug: Skipping module doveadm_quota_plugin, because
>dlopen() failed:
>/usr/lib64/dovecot/doveadm/lib10_doveadm_quota_plugin.so: undefined
>symbol: quota_user_module (this is usually intentional, so just ignore
>this message)
>Feb 21 16:31:51 Debug: Skipping module doveadm_fts_plugin, because
>dlopen() failed: /usr/lib64/dovecot/doveadm/lib20_doveadm_fts_plugin.so:
>undefined symbol: fts_user_get_language_list (this is usually
>intentional, so just ignore this message)
>Feb 21 16:31:51 Debug: Skipping module doveadm_mail_crypt_plugin,
>because dlopen() failed:
>/usr/lib64/dovecot/doveadm/libdoveadm_mail_crypt_plugin.so: undefined
>symbol: mail_crypt_box_get_pvt_digests (this is usually intentional, so
>just ignore this message)
>Feb 21 16:31:51 doveadm(testuser)<11258><>: Debug: auth-master: userdb
>lookup(testuser): Started userdb lookup
>Feb 21 16:31:51 doveadm(testuser)<11258><>: Debug: auth-master: conn
>unix:/run/dovecot/auth-userdb: Connecting
>Feb 21 16:31:51 doveadm(testuser)<11258><>: Debug: auth-master: conn
>unix:/run/dovecot/auth-userdb (pid=3070,uid=0): Client connected (fd=9)
>Feb 21 16:31:51 doveadm(testuser)<11258><>: Debug: auth-master: userdb
>lookup(testuser): auth USER input: testuser system_groups_user=testuser
>uid=1006 gid=1006 home=/home/testuser
>Feb 21 16:31:51 doveadm(testuser)<11258><>: Debug: auth-master: userdb
>lookup(testuser): Finished userdb lookup (username=testuser
>system_groups_user=testuser uid=1006 gid=1006 home=/home/testuser)
>Feb 21 16:31:51 doveadm(testuser)<11258><znOmIscjE2L6KwAA/82IVg>: Debug:
>Effective uid=1006, gid=1006, home=/home/testuser
>Feb 21 16:31:51 doveadm(testuser)<11258><znOmIscjE2L6KwAA/82IVg>: Debug:
>Namespace inbox: type=private, prefix=, sep=, inbox=yes, hidden=no,
>list=yes, subscriptions=yes location=maildir:~/Maildir
>Feb 21 16:31:51 doveadm(testuser)<11258><znOmIscjE2L6KwAA/82IVg>: Debug:
>maildir++: root=/home/testuser/Maildir, index=, indexpvt=, control=,
>inbox=/home/testuser/Maildir, alt=
>Feb 21 16:31:51 doveadm(testuser): Debug: brain M: Namespace has
>location maildir:~/Maildir
>Feb 21 16:31:51 doveadm(testuser): Debug: Mailbox INBOX: Couldn't open
>mailbox in list index: Mailbox not found
>Feb 21 16:31:51 doveadm(testuser): Debug: Mailbox INBOX: Mailbox opened
>Feb 21 16:31:51 doveadm(testuser): Debug: Namespace : Using permissions
>from /home/testuser/Maildir: mode=0700 gid=default
>Feb 21 16:31:51 doveadm(testuser): Debug: Mailbox INBOX: Mailbox opened
>Feb 21 16:31:51 doveadm(testuser)<11258><znOmIscjE2L6KwAA/82IVg>: Debug:
>brain M: Local mailbox tree: INBOX guid=c0e4ea33c7231362fa2b0000ffcd8856
>uid_validity=1645421511 uid_next=1 subs=no last_change=0 last_subs=0
>Feb 21 16:31:51 doveadm(testuser)<11258><znOmIscjE2L6KwAA/82IVg>: Debug:
>brain M: Remote mailbox tree: INBOX
>guid=c04b3636c72313620f770000794c72b2 uid_validity=1645421511 uid_next=1
>subs=no last_change=0 last_subs=0
>Feb 21 16:31:51 doveadm(testuser)<11258><znOmIscjE2L6KwAA/82IVg>: Debug:
>brain M: Mailbox INBOX: local=c0e4ea33c7231362fa2b0000ffcd8856/0/1,
>remote=c04b3636c72313620f770000794c72b2/0/1: GUIDs conflict - will be
>merged later
>Feb 21 16:31:51 doveadm(testuser)<11258><znOmIscjE2L6KwAA/82IVg>: Debug:
>Mailbox INBOX: Mailbox opened
>Feb 21 16:31:51 doveadm(testuser)<11258><znOmIscjE2L6KwAA/82IVg>: Debug:
>Mailbox INBOX: Mailbox opened
>Feb 21 16:31:51 doveadm(testuser)<11258><znOmIscjE2L6KwAA/82IVg>: Debug:
>Mailbox INBOX: Mailbox opened
>Feb 21 16:31:51 doveadm(testuser)<11258><znOmIscjE2L6KwAA/82IVg>: Debug:
>brain M: Change during sync: Remote lost mailbox GUID
>c04b3636c72313620f770000794c72b2 (maybe it was just deleted?)
>Feb 21 16:31:52 doveadm(testuser)<11258><znOmIscjE2L6KwAA/82IVg>:
>Warning: Mailbox changes caused a desync. You may want to run dsync
>again: Remote lost mailbox GUID c04b3636c72313620f770000794c72b2 (maybe
>it was just deleted?)
>Feb 21 16:31:52 doveadm(11258): Debug: auth-master: conn
>unix:/run/dovecot/auth-userdb (pid=3070,uid=0): Disconnected: Connection
>closed (fd=9)
>
>tornado ~ #
>
>That message about the Mailbox changing causing a desync is what screws
>up the client (in this case testuser) if they happen to be connected to
>the mater at the time of the attempt. That is why client access was
>impacted as soon as remote replication was enabled.
>
>If this is run via normal dovecot sync (and not triggered) it results in
>severe performance problems and disconnections for any clients connected
>to the master replica pretty much anytime they try and move/delete anything.
>
>I saw a couple of old posts over the years where other people have found
>the same thing so I know I'm not the first person to run into this
>problem and I suspect there are a lot more who have run into this
>problem but not realised what was going on.
>
>What do I need to do to get this fixed properly? Bailing out mid way
>through replicating and disconnecting clients (and not self-healing)
>surely can't be the intended behaviour so I'd consider this to be a
>pretty nasty bug.
>
>Reuben
>
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://dovecot.org/pipermail/dovecot/attachments/20220220/568e2a12/attachment-0001.htm>
More information about the dovecot
mailing list