dsync with vurtial folder leads to mess of msg ID
Greetings,
I have a setup with two dovecot servers and dsync replication between them.
On both of them I have a virtual folders, forexample All with config:
mx1# cat /etc/dovecot/virtual/All/dovecot-virtual * -Trash -Trash/* -Junk -Junk/* all
All virtual folders are excluded from replication because it leads to broken replication with errors like: virtual.All can't be created. So, I have.
replication_dsync_parameters = -d -l 30 -U -x virtual.
Unfortently, after a couple of weeks I've noticed that msg id for the same message on the different servers aren't match.
For example:
mx1# doveadm fetch -u kirill@korins.ky 'uid' mailbox virtual.All header Message-ID 'macports/macports-ports/pull/22284/c1901369454@github.com'
uid: 157786
mx1# doveadm fetch -u kirill@korins.ky 'uid' mailbox INBOX header Message-ID 'macports/macports-ports/pull/22284/c1901369454@github.com'
uid: 5561
mx1#
vs
mx2# doveadm fetch -u kirill@korins.ky 'uid' mailbox virtual.All header Message-ID 'macports/macports-ports/pull/22284/c1901369454@github.com'
uid: 157766
mx2# doveadm fetch -u kirill@korins.ky 'uid' mailbox INBOX header Message-ID 'macports/macports-ports/pull/22284/c1901369454@github.com'
uid: 5561
mx2#
as you may see msgid at INBOX are matched, but it isn't true for virtual.All, here 157786 vs 157766
So, here the question: how can I sync msg id inside virtual folder on this setup?
-- wbr, Kirill
I'd like to add that allowing to save messages via virtual folder to some folder by change it defintion to
- !Archive -Trash -Trash/* -Junk -Junk/* all
doesn't help and syncronization fails as usual:
Jan 20 15:34:06 mx1 dovecot: doveadm(kirill@korins.ky)<75563><VshZA97Zq2UrJwEAeeMcAQ>: Error: Can't create mailbox virtual.All: Can't create virtual mailboxes
And to share my setup of virtual folder which is quite simple:
mail_plugins = $mail_plugins virtual
namespace virtual { separator = . prefix = virtual.
location = virtual:/etc/dovecot/virtual:INDEX=~/virtual
mailbox All {
special_use = \All
comment = All messages without Trash and Junk
}
mailbox Flagged {
special_use = \Flagged
comment = All flagged messages without Trash and Junk
}
}
-- wbr, Kirill
On 20/01/2024 17:00 EET Kirill A. Korinsky kirill@korins.ky wrote:
I'd like to add that allowing to save messages via virtual folder to some folder by change it defintion to
- !Archive -Trash -Trash/* -Junk -Junk/* all
doesn't help and syncronization fails as usual:
Jan 20 15:34:06 mx1 dovecot: doveadm(kirill@korins.ky)<75563><VshZA97Zq2UrJwEAeeMcAQ>: Error: Can't create mailbox virtual.All: Can't create virtual mailboxes
And to share my setup of virtual folder which is quite simple:
mail_plugins = $mail_plugins virtual
namespace virtual { separator = . prefix = virtual.
location = virtual:/etc/dovecot/virtual:INDEX=~/virtual mailbox All { special_use = \All comment = All messages without Trash and Junk } mailbox Flagged { special_use = \Flagged comment = All flagged messages without Trash and Junk }
}
-- wbr, Kirill
Can you send output of doveconf -n?
Aki
On Sun, 21 Jan 2024 16:34:44 +0100, Aki Tuomi wrote:
Can you send output of doveconf -n?
Sure, here it is:
# 2.3.20 (80a5ac675d): /etc/dovecot/dovecot.conf
# Pigeonhole version 0.5.20 (149edcf2)
# OS: OpenBSD 7.4 amd64
# Hostname: mx1.catap.net
default_vsz_limit = 10 G
doveadm_password = # hidden, use -P to show it
doveadm_port = 12345
first_valid_uid = 1000
mail_attribute_dict = file:%h/Maildir/dovecot-attributes
mail_gid = vmail
mail_location = maildir:~/Maildir
mail_plugins = notify replication fts fts_flatcurve virtual
mail_uid = vmail
managesieve_notify_capability = mailto
managesieve_sieve_capability = fileinto reject envelope encoded-character vacation subaddress comparator-i;ascii-numeric relational regex imap4flags copy include variables body enotify environment mailbox date index ihave duplicate mime foreverypart extracttext
mbox_write_locks = fcntl
mmap_disable = yes
namespace inbox {
inbox = yes
location =
mailbox Archive {
auto = subscribe
special_use = \Archive
}
mailbox Drafts {
auto = subscribe
special_use = \Drafts
}
mailbox Junk {
auto = subscribe
autoexpunge = 60 days
special_use = \Junk
}
mailbox Sent {
auto = subscribe
special_use = \Sent
}
mailbox Trash {
auto = subscribe
autoexpunge = 60 days
special_use = \Trash
}
prefix =
separator = .
}
namespace virtual {
location = virtual:/etc/dovecot/virtual:INDEX=~/virtual
mailbox All {
comment = All messages without Trash and Junk
special_use = \All
}
mailbox Flagged {
comment = All flagged messages without Trash and Junk
special_use = \Flagged
}
prefix = virtual.
separator = .
}
passdb {
args = username_format=%u /etc/dovecot/users
driver = passwd-file
}
plugin {
fts = flatcurve
fts_filters = normalizer-icu snowball stopwords
fts_filters_en = lowercase snowball english-possessive stopwords
fts_flatcurve_substring_search = yes
fts_languages = en de ru es
fts_tokenizer_generic = algorithm=simple
fts_tokenizers = generic email-address
mail_replica = tcps:mx2.catap.net
sieve = file:~/sieve;active=~/.dovecot.sieve
sieve_before = /etc/dovecot/sieve_before.d/
}
protocols = imap lmtp sieve
replication_dsync_parameters = -d -l 30 -U -x virtual.
service aggregator {
fifo_listener replication-notify-fifo {
user = vmail
}
unix_listener replication-notify {
user = vmail
}
}
service doveadm {
inet_listener {
port = 12345
ssl = yes
}
}
service imap-login {
inet_listener imap {
port = 143
}
inet_listener imaps {
port = 993
ssl = yes
}
}
service lmtp {
unix_listener lmtp {
mode = 0666
}
}
service managesieve-login {
inet_listener sieve {
port = 4190
}
}
service replicator {
process_min_avail = 1
unix_listener replicator-doveadm {
mode = 0600
user = vmail
}
}
ssl_cert =
-- wbr, Kirill
On Sun, 21 Jan 2024 17:31:37 +0100, Kirill A. Korinsky wrote:
mail_replica = tcps:mx2.catap.net
I've discovered that using TCP+SSL replication leads to errors in logs:
Error: Couldn't lock ../.dovecot-sync.lock: fcntl(../.dovecot-sync.lock, write-lock, F_SETLKW) locking failed: Timed out after 30 seconds (WRITE lock held by pid XXX)
I can't say that mails weren't replicated. It was, but it was quite unstable.
So, now I've switched to simple TCP replication, and recreated all virtual folders.
So far so good.
-- wbr, Kirill
On Fri, 26 Jan 2024 01:44:06 +0100, Kirill A. Korinsky wrote:
So far so good.
And here we go again
mx1# doveadm fetch -u kirill@korins.ky 'uid' mailbox virtual.All | grep '^uid:' | tail -n 20 uid: 144044 uid: 144045 uid: 144046 uid: 144047 uid: 144048 uid: 144049 uid: 144050 uid: 144051 uid: 144052 uid: 144053 uid: 144055 uid: 144059 uid: 144062 uid: 144073 uid: 144074 uid: 144075 uid: 144076 uid: 144077 uid: 144078 uid: 144079 mx1#
vs
mx2# doveadm fetch -u kirill@korins.ky 'uid' mailbox virtual.All | grep '^uid:' | tail -n 20 uid: 144044 uid: 144045 uid: 144046 uid: 144047 uid: 144048 uid: 144049 uid: 144050 uid: 144051 uid: 144052 uid: 144053 uid: 144054 uid: 144057 uid: 144060 uid: 144071 uid: 144072 uid: 144073 uid: 144074 uid: 144075 uid: 144076 uid: 144077 mx2#
notable to say that message 144055 / 144054 was BCC of my message to the list where I said that so far so good.
BTW when I go to the logs, I haven't found anything that can explain it.
Any suggestion how can I debug it future?
-- wbr, Kirill
Greetings,
I'd like to report that switch from TCP based replication to SSH based seems to solve this issue.
SSH-based replication last for almost 24h. After I've switched to it I had removed all virtual folders on all users, that triggres re-creating it.
Since then, I can't reproduce it.
mx1# doveadm fetch -u kirill@korins.ky 'uid' mailbox virtual.All | grep '^uid:' | tail -n 5
uid: 144747
uid: 144748
uid: 144749
uid: 144750
uid: 144751
mx1#
and
mx2# doveadm fetch -u kirill@korins.ky 'uid' mailbox virtual.All | grep '^uid:' | tail -n 5
uid: 144747
uid: 144748
uid: 144749
uid: 144750
uid: 144751
mx2#
If error appears back, I'll be back to this thread.
-- wbr, Kirill
Greetings,
As promised: I'm back.
After processing about 740 mails, it is again.
Via SSH based replication it isn't so dramatic, but exists.
See:
mx1# doveadm fetch -u kirill@korins.ky 'uid' mailbox virtual.All | grep '^uid:' | tail -n 5 uid: 145485 uid: 145486 uid: 145487 uid: 145488 uid: 145489 mx1# doveadm fetch -u kirill@korins.ky hdr.Message-ID mailbox virtual.All uid 145489 hdr.message-id: CADguC0GTcKCTzWPDwF4XYu63oZzjZyCQmiUzgsUOeoz6gOp9pA@mail.gmail.com mx1#
vs
mx2# doveadm fetch -u kirill@korins.ky 'uid' mailbox virtual.All | grep '^uid:' | tail -n 5 uid: 145487 uid: 145488 uid: 145489 uid: 145490 uid: 145491 mx2# doveadm fetch -u kirill@korins.ky hdr.Message-ID mailbox virtual.All uid 145491 hdr.message-id: CADguC0GTcKCTzWPDwF4XYu63oZzjZyCQmiUzgsUOeoz6gOp9pA@mail.gmail.com mx2# doveadm fetch -u kirill@korins.ky hdr.Message-ID mailbox virtual.All uid 145489 hdr.message-id: 87edd58shf.wl-kirill@korins.ky mx2#
So, by some digging I discovered the mails which triggers an issue:
mx1# doveadm fetch -u kirill@korins.ky 'uid' mailbox virtual.All | grep '^uid: 14484' uid: 14484 uid: 144840 uid: 144841 uid: 144845 uid: 144848 uid: 144849 mx1# doveadm fetch -u kirill@korins.ky hdr mailbox virtual.All uid 144841 | md5 d65a22e12bd0c72d321a78d2efd3383b mx1# doveadm fetch -u kirill@korins.ky hdr mailbox virtual.All uid 144840 | md5 25a8b53ecfd513cf9031dae0e9000d92 mx1#
vs
mx2# doveadm fetch -u kirill@korins.ky 'uid' mailbox virtual.All | grep '^uid: 14484' uid: 14484 uid: 144840 uid: 144842 uid: 144846 uid: 144849 mx2# doveadm fetch -u kirill@korins.ky hdr mailbox virtual.All uid 144842 | md5 d65a22e12bd0c72d321a78d2efd3383b mx2# doveadm fetch -u kirill@korins.ky hdr mailbox virtual.All uid 144840 | md5 25a8b53ecfd513cf9031dae0e9000d92 mx2#
this email 144841 / 144842 were delivered to mx1 at 17 Feb 2024 17:52:18 which allows me do check logs:
mx1# zgrep '17:52:[123]' /var/log/maillog.3.gz Feb 17 17:52:15 mx1 smtpd[52094]: d7da603e4213f8ff smtp connected address=199.185.178.25 host=mail.openbsd.org Feb 17 17:52:15 mx1 smtpd[39991]: dnsbl: d7da603e4213f8ff DNS error 2 on cbl.abuseat.org Feb 17 17:52:17 mx1 smtpd[52094]: d7da603e4213f8ff smtp tls ciphers=TLSv1.3:TLS_AES_256_GCM_SHA384:256 Feb 17 17:52:18 mx1 smtpd[52094]: d7da603e4213f8ff smtp message msgid=d8e3748c size=4552 nrcpt=1 proto=ESMTP Feb 17 17:52:18 mx1 smtpd[52094]: d7da603e4213f8ff smtp envelope evpid=d8e3748cc73371ec from=owner-ports+M141210@openbsd.org to=kirill@korins.ky Feb 17 17:52:19 mx1 dovecot: lmtp(15471): Connect from local Feb 17 17:52:19 mx1 dovecot: lmtp(kirill@korins.ky)<15471><h2dQAEPk0GVvPAAAeeMcAQ>: sieve: msgid=4bc76d19-00f2-b60b-a87f-315adfd085bb@grey.my.domain: stored mail into mailbox 'INBOX' Feb 17 17:52:19 mx1 dovecot: lmtp(15471): Disconnect from local: Logged out (state=READY) Feb 17 17:52:19 mx1 smtpd[52094]: d7da603f60cff302 mda delivery evpid=d8e3748cc73371ec from=owner-ports+M141210@openbsd.org to=kirill@korins.ky rcpt=kirill@korins.ky user=vmail delay=1s result=Ok stat=Delivered Feb 17 17:52:19 mx1 smtpd[52094]: d7da603e4213f8ff smtp disconnected reason=quit mx1#
nothing, let check mx2:
mx2# zgrep '17:52:[123]' /var/log/maillog.3.gz Feb 17 17:52:17 mx2 smtpd[3556]: 99938cb221b4cb0b smtp connected address=199.185.178.25 host=mail.openbsd.org Feb 17 17:52:20 mx2 smtpd[3556]: 99938cb221b4cb0b smtp tls ciphers=TLSv1.3:TLS_AES_256_GCM_SHA384:256 Feb 17 17:52:30 mx2 smtpd[3556]: 99938cb221b4cb0b smtp disconnected reason=quit mx2#
nothing as well.
I do have full sync each 15 minutes, let check any dsync logs on bigger interval (an hour):
mx1# zgrep '17 1[78]:' /var/log/maillog.3.gz | grep dsync mx1#
and
mx2# zgrep '17 1[78]:' /var/log/maillog.3.gz | grep dsync mx2#
Also hasn't got anything.
It seems that SSH based replication is better than TCP, and both of at least works that I can't say about TCPS.
But it may introduce a kind of collision inside virtual boxes.
So, to future investigate an issue I've started to run dsync-server as: doveadm -Dv dsync-server, I hoep that it allows to catch this.
And as usual re-created vurtial boxes.
-- wbr, Kirill
Greetings,
Rigt now both of servers works with debug log of dsync and produces ~1G logs per day.
Seems that such verbose output decreases probability of issue.
For a week no more desync of uid in virtual boxes.
mx1# doveadm fetch -u kirill@korins.ky 'uid' mailbox virtual.All | grep '^uid:' | tail -n 5 uid: 145708 uid: 145709 uid: 145711 uid: 145713 uid: 145714 mx1# doveadm fetch -u kirill@korins.ky hdr mailbox virtual.All uid 145714 | md5 859986e6c659f8df9bab5efa8002ee8e mx1# doveadm fetch -u kirill@korins.ky hdr mailbox virtual.All uid 145713 | md5 ae1e087ff70a00cb7df739d4dda9b933 mx1# doveadm fetch -u kirill@korins.ky hdr mailbox virtual.All uid 145711 | md5 b70ad38a0ba4b1e8087f93e3c5cd438f
vs
mx2# doveadm fetch -u kirill@korins.ky 'uid' mailbox virtual.All | grep '^uid:' | tail -n 5 uid: 145708 uid: 145709 uid: 145711 uid: 145713 uid: 145714 mx2# doveadm fetch -u kirill@korins.ky hdr mailbox virtual.All uid 145714 | md5 859986e6c659f8df9bab5efa8002ee8e mx2# doveadm fetch -u kirill@korins.ky hdr mailbox virtual.All uid 145713 | md5 ae1e087ff70a00cb7df739d4dda9b933 mx2# doveadm fetch -u kirill@korins.ky hdr mailbox virtual.All uid 145711 | md5 b70ad38a0ba4b1e8087f93e3c5cd438f
As the next step I've reduced logs to verbose, let's see how it goes.
-- wbr, Kirill
Greetings,
On Wed, 28 Feb 2024 22:15:55 +0100, Kirill A. Korinsky wrote:
As the next step I've reduced logs to verbose, let's see how it goes.
Reducing verbosity re-triggered an issue.
Meanwhile, as a lucky guess I've increased the number of open files on all users which is used by dovecot as:
mx1$ cat /etc/login.conf.d/*
_dovecot:
:openfiles=1024:
:tc=daemon:
dsync:
:openfiles=1024:
:tc=daemon:
vmail:
:openfiles=1024:
:tc=daemon:
mx1$
My user has a lot of IMAP folders, and if I run doveadm fetch from shell with default limits (open files 128 files) it puts a lot of errors like:
doveadm(kirill@korins.ky): Error: fetch(guid) failed for box=virtual.All uid=5097: Internal error occurred. Refer to server log for more information. [2024-04-18 14:39:15] (for backend mailbox Archive.Texunatech) doveadm(kirill@korins.ky): Error: Mailbox Archive.Texunatech: open(/var/vmail/kirill@korins.ky/Maildir/.SomeFolder/dovecot-uidlist) failed: Too many open files
So, somewhere here I made a guess errors are swallow, but it exists.
So, I've moved future by this way.
- Shutdown smtpd and dovect.
- Disabled smtpd from autostart.
- Disabled replication on both servers.
- Update logins.conf to increase limits.
- Remove all virtual folders from all users.
- Reboot server.
- Wait couple of minuts (like 10) and enable replication.
- Wait one full sync cycle.
- Enable smtpd and start it.
After this I do have consisten output:
mx1# doveadm fetch "uid guid" -u kirill@korins.ky mailbox virtual.All | md5 8978c5e506bf7bcf769232d48b964e00 mx1#
and
mx2# doveadm fetch "uid guid" -u kirill@korins.ky mailbox virtual.All | md5 8978c5e506bf7bcf769232d48b964e00 mx2#
Anyway, that setup works for a day or two and I can't say that it really help because I need to update machines to OpenBSD 7.5 to update OpenSMTPD to fix another issue. This update included install of dovecot to 2.3.21 (47349e2482).
So, if issue do not reappear in couple of weeks, I can't say that help: increase of limits or update of dovecot.
Anyway, I'll be back.
-- wbr, Kirill
Greetings,
Seems that I had catched a proof that it is defently bug in dsync.
On Thu, 18 Apr 2024 16:59:23 +0200, Kirill A. Korinsky wrote:
Anyway, I'll be back.
As usual after I sent an update here, the issue is comeback.
mx1# doveadm fetch "uid guid" -u kirill@korins.ky mailbox virtual.All | tail -n 30
uid: 147696 guid: 1713451338.M678759P24080.mx2.catap.net,S=5182,W=5299
uid: 147699 guid: 1713452322.M905815P78454.mx2.catap.net,S=6036,W=6143
uid: 147700 guid: 1713452365.M594673P95241.mx1.catap.net,S=3958,W=4066
uid: 147701 guid: 1713452405.M535379P78454.mx2.catap.net,S=3351,W=3431
uid: 147705 guid: 1713453074.M691924P5954.mx1.catap.net,S=6333,W=6437
uid: 147706 guid: 1713453359.M591908P38204.mx1.catap.net,S=54792,W=56096
uid: 147707 guid: 1713453469.M178617P70020.mx2.catap.net,S=2406,W=2453
uid: 147708 guid: 1713452587.M931589P64530.mx1.catap.net,S=5044,W=5161
uid: 147709 guid: 1713453576.M23051P30161.mx1.catap.net,S=2044,W=2090
uid: 147710 guid: 1713452952.M729802P32094.mx1.catap.net,S=4873,W=5017 mx1#
vs
mx2# doveadm fetch "uid guid" -u kirill@korins.ky mailbox virtual.All | tail -n 30
uid: 147696 guid: 1713451338.M678759P24080.mx2.catap.net,S=5182,W=5299
uid: 147699 guid: 1713452365.M594673P95241.mx1.catap.net,S=3958,W=4066
uid: 147701 guid: 1713452322.M905815P78454.mx2.catap.net,S=6036,W=6143
uid: 147702 guid: 1713452405.M535379P78454.mx2.catap.net,S=3351,W=3431
uid: 147707 guid: 1713452587.M931589P64530.mx1.catap.net,S=5044,W=5161
uid: 147708 guid: 1713452952.M729802P32094.mx1.catap.net,S=4873,W=5017
uid: 147709 guid: 1713453074.M691924P5954.mx1.catap.net,S=6333,W=6437
uid: 147710 guid: 1713453359.M591908P38204.mx1.catap.net,S=54792,W=56096
uid: 147711 guid: 1713453469.M178617P70020.mx2.catap.net,S=2406,W=2453
uid: 147712 guid: 1713453576.M23051P30161.mx1.catap.net,S=2044,W=2090 mx2#
as you may notice differences at begining of output:
mx1: uid: 147696 guid: 1713451338.M678759P24080.mx2.catap.net,S=5182,W=5299
uid: 147699 guid: 1713452322.M905815P78454.mx2.catap.net,S=6036,W=6143
uid: 147700 guid: 1713452365.M594673P95241.mx1.catap.net,S=3958,W=4066
mx2: uid: 147696 guid: 1713451338.M678759P24080.mx2.catap.net,S=5182,W=5299
uid: 147699 guid: 1713452365.M594673P95241.mx1.catap.net,S=3958,W=4066
uid: 147701 guid: 1713452322.M905815P78454.mx2.catap.net,S=6036,W=6143
and here the collistion in email with id 147699 which is different on different servers.
Funny things that 1713452365.M594673P95241.mx1.catap.net,S=3958,W=4066 is my email to the mail list.
Let see the logs for 16:58 and 16:59 on both servers.
mx1:
Apr 18 16:58:36 mx1 dovecot: imap-login: Login: user=k@catap.net, method=PLAIN, rip=90.187.54.213, lip=162.55.82.72, mpid=33482, TLS, session=
mx2:
Apr 18 16:58:34 mx2 dovecot: imap(kirill@korins.ky)<22911><gixTEGAWhupauzbV>: Disconnected: Logged out in=383 out=7513 deleted=0 expunged=0 trashed=0 hdr_count=0 hdr_bytes=0 body_count=0 body_bytes=0
Apr 18 16:58:35 mx2 dovecot: imap(kirill@korins.ky)<24960><3jWvEGAWjupauzbV>: Disconnected: Logged out in=305 out=1496 deleted=0 expunged=0 trashed=0 hdr_count=0 hdr_bytes=0 body_count=0 body_bytes=0
Apr 18 16:58:35 mx2 dovecot: imap(kirill@korins.ky)<65049><6Xf9LmAWwepauzbV>: Disconnected: Logged out in=5277 out=606180 deleted=0 expunged=0 trashed=0 hdr_count=50 hdr_bytes=119949 body_count=52 body_bytes=460135
Apr 18 16:58:35 mx2 dovecot: imap(k@kcry.pt)<47696><+wdTEGAWiOpauzbV>: Disconnected: Logged out in=319 out=2153 deleted=0 expunged=0 trashed=0 hdr_count=0 hdr_bytes=0 body_count=0 body_bytes=0
Apr 18 16:58:35 mx2 dovecot: imap(k@kcry.pt)<67677><JqCWEGAWjOpauzbV>: Disconnected: Logged out in=438 out=1590 deleted=0 expunged=0 trashed=0 hdr_count=0 hdr_bytes=0 body_count=0 body_bytes=0
Apr 18 16:58:35 mx2 dovecot: imap(k@catap.de)<3774>
All of this allows to reconstruct the workflow which leads to an issue:
- mx2 had a new mail.
- mx2 registered that mail as 147699 inside virtual folder.
- by some reason this email wasn't replicated to mx1 for at least 30-40 seconds. Maybe it runs full-scan on all users, I have no idea.
- mx1 had a new mail as well.
- mx1 resitered that mail as 147699 inside virtual folder.
- mx1's mail was replicated to mx2 as 147700
- mx2's mail was replicated back to mx1 as 147701
-- wbr, Kirill
Greeting,
On Thu, 18 Apr 2024 17:51:06 +0200, Kirill A. Korinsky wrote:
All of this allows to reconstruct the workflow which leads to an issue:
- mx2 had a new mail.
- mx2 registered that mail as 147699 inside virtual folder.
- by some reason this email wasn't replicated to mx1 for at least 30-40 seconds. Maybe it runs full-scan on all users, I have no idea.
- mx1 had a new mail as well.
- mx1 resitered that mail as 147699 inside virtual folder.
- mx1's mail was replicated to mx2 as 147700
- mx2's mail was replicated back to mx1 as 147701
I was wrong. I run a series of expirements and achived a strategy which allows to reproduce this issue.
I've used the following definition of virtual folder all:
- !Archive -Trash -Trash/* -Junk -Junk/* all
I also have two servers: mx1 and mx2.
So, to reproduce an issue with dsync msgid inside virtual folder I need:
- Connect to mx1 via imap
- Connect to mx2 via smtp
- Initialize sending a mail via smtp
- In parallel move some mail from Inbox to Virtual.All
With quite big probability I broke the order.
When I enable the debug log, it slow down things because dovecot and smtpd uses the same /var/log/maillog on my system.
Another way to avoid such condition, is switch off multi core on machine.
Without parallerism I can't reproduce the issue.
-- wbr, Kirill
Greeting,
Since last status update I've switched my workflow from active using virtual.All into virtual.Archive folder.
After the switch I've recreated virtual.All on both servers.
Folder definition:
mx1$ cat /etc/dovecot/virtual/Archive/dovecot-virtual
!Archive
Archive*
all
mx1$ cat /etc/dovecot/virtual/All/dovecot-virtual
*
!Archive
-Trash*
-Junk*
all
mx1$
an output for doveadm fetch "uid guid" -u kirill@korins.ky mailbox FOLDER | md5
Archive mx1: 176fdedf6fe641b5580b5deeb3244539 mx2: 176fdedf6fe641b5580b5deeb3244539 All mx1: 4c1fcd6349bee770f2fbb8dc1bbe42a2 mx2: 9af53aa8fdbfadf5d1d5cf01daef5617
Excluding INBOX from virtual folder seems that allows to avoid the issue.
-- wbr, Kirill
Greetings,
On Sun, 21 Apr 2024 21:52:41 +0100, Kirill A. Korinsky kirill@korins.ky wrote:
Excluding INBOX from virtual folder seems that allows to avoid the issue.
I'd like to confirm that excluding INBOX from virtual folder indeed allows to avoid that issue.
Any suggestion how can I help to debug that issue future?
-- wbr, Kirill
Greetings,
On Sun, 21 Apr 2024 22:52:41 +0200, Kirill A. Korinsky kirill@korins.ky wrote:
Excluding INBOX from virtual folder seems that allows to avoid the issue.
And today it had exploded inside virtual.Archive folder.
Inside logs regarding an email which triggered an issue I have:
Sep 23 17:44:50 mx2 smtpd[84651]: b66d1fb2fe2594ce smtp tls ciphers=TLSv1.3:TLS_AES_256_GCM_SHA384:256 Sep 23 17:44:50 mx2 smtpd[84651]: b66d1fb2fe2594ce smtp message msgid=606a9956 size=7650 nrcpt=1 proto=ESMTP Sep 23 17:44:50 mx2 smtpd[84651]: b66d1fb2fe2594ce smtp envelope evpid=606a995604f4b3dc from=owner-ports+M146137@openbsd.org to=kirill@korins.ky Sep 23 17:44:50 mx2 dovecot: lmtp(79867): Connect from local Sep 23 17:44:50 mx2 dovecot: lmtp(kirill@korins.ky)<79867><kAmoLPKM8Wb7NwEAU0lCkg>: sieve: msgid=5e1eb67d940cfa13@mx1.catap.net: stored mail into mailbox 'INBOX' Sep 23 17:44:50 mx2 dovecot: lmtp(79867): Disconnect from local: Logged out (state=READY) Sep 23 17:44:50 mx2 smtpd[84651]: b66d1fb3159607a1 mda delivery evpid=606a995604f4b3dc from=owner-ports+M146137@openbsd.org to=kirill@korins.ky rcpt=kirill@korins.ky user=vmail delay=0s result=Ok stat=Delivered Sep 23 17:45:00 mx2 smtpd[84651]: b66d1fb2fe2594ce smtp disconnected reason=quit
This is the only incomming email inside the cluster at this time.
Meanwhile mx1 has uid 158528 for this email, and mx2 uid has 158529.
If I recall right about this time I had moved email from Archive to Inbox, and I was connected to mx1 that time.
I also discover this commit https://github.com/dovecot/core/commit/3001c1410e72440e48c285890dd31625b7e12...
As naive approach I had backported it to 2.3.21.1 let see how it goes.
-- wbr, Kirill
On Tue, 24 Sep 2024 00:20:58 +0200, Kirill A. Korinsky kirill@korins.ky wrote:
I also discover this commit https://github.com/dovecot/core/commit/3001c1410e72440e48c285890dd31625b7e12...
As naive approach I had backported it to 2.3.21.1 let see how it goes.
It doesn't help.
-- wbr, Kirill
Well,
Using replication_sync_timeout = 60 or 15 leads to crash, 10 doesn't.
It is Dovecot v2.3.21.1 (d492236fa0) on OpenBSD.
Inside mail logs:
Sep 24 15:58:20 mx2 dovecot: lmtp(70545): Connect from local Sep 24 15:58:25 mx2 dovecot: replicator: Panic: data stack: Out of memory when allocating 17179869224 bytes Sep 24 15:58:30 mx2 dovecot: lmtp(k@catap.net)<70545><4IQPAXzF8maREwEAU0lCkg>: Warning: replication(k@catap.net): Sync failure: Sep 24 15:58:30 mx2 dovecot: lmtp(k@catap.net)<70545><4IQPAXzF8maREwEAU0lCkg>: Warning: replication(k@catap.net): Remote sent invalid input: - Sep 24 15:58:30 mx2 dovecot: lmtp(k@catap.net)<70545><4IQPAXzF8maREwEAU0lCkg>: sieve: msgid=2bb5ed0ad8291eed@mx2.catap.net: stored mail into mailbox 'INBOX' Sep 24 15:58:30 mx2 dovecot: replicator: Fatal: master: service(replicator): child 21532 killed with signal 6 (core dumped) Sep 24 15:58:30 mx2 dovecot: lmtp(70545): Disconnect from local: Logged out (state=READY)
and stacktrace:
(gdb) bt f
#0 thrkill () at /tmp/-:2
No locals.
#1 0xfc8e9aa8d08a5c82 in ?? ()
No symbol table info available.
#2 0x00000c7c2b59121b in _libc_abort () at /usr/src/lib/libc/stdlib/abort.c:51
sa = {__sigaction_u = {__sa_handler = 0x0, __sa_sigaction = 0x0}, sa_mask = 4294967295, sa_flags = 0}
mask = 4294967263
#3 0x00000c7bec3b6015 in default_fatal_finish (type=LOG_TYPE_PANIC, status=0) at failures.c:465
recursed = 0
backtrace = <optimized out>
#4 0x00000c7bec3b4487 in fatal_handler_real (ctx=0x7bec0b423480, format=<optimized out>, args=<optimized out>) at failures.c:477
status = 0
#5 0x00000c7bec3b53c5 in i_internal_fatal_handler (ctx=0x0, format=0x6
-- wbr, Kirill
On Tue, 24 Sep 2024 16:05:14 +0200, Kirill A. Korinsky kirill@korins.ky wrote:
#9 0x00000c7bec3e1054 in pool_data_stack_realloc (pool=<optimized out>, mem=0xc7d0cfeb028, old_size=4294967296, new_size=8589934592) at mempool-datastack.c:173
Here:
- 4294967296 is 0x100000000
- 8589934592 is 0x200000000
I have many core files which always contains that value.
I don't belive that such round number is broken stack or similar issue.
-- wbr, Kirill
Folks,
finally, I was able to track the issue and fix it.
Here the PR: https://github.com/dovecot/core/pull/223
With this replication_sync_timeout = 10 wroks without any crash for more than of 7 hours on my cluster. Before it crashes each processed email.
Thus, this fix allows to switch to TCP+SSL replication.
Seems that this small fix makes replication quite stable. I hope that you can review and merge it into 2.3 branch, and consider to recover replicator in 2.4.
Aki?
-- wbr, Kirill
participants (2)
-
Aki Tuomi
-
Kirill A. Korinsky