Re: 2.3.1 Replication is throwing scary errors
Hi,
Message: 2 Date: Mon, 2 Apr 2018 22:06:07 +0200 From: Michael Grimm trashcan@ellael.org To: Dovecot Mailing List dovecot@dovecot.org Subject: 2.3.1 Replication is throwing scary errors Message-ID: 29998016-D62F-4348-93D1-613B13DA90DB@ellael.org Content-Type: text/plain; charset=utf-8
Hi
[This is Dovecot 2.3.1 at FreeBSD STABLE-11.1 running in two jails at distinct servers.]
I did upgrade from 2.2.35 to 2.3.1 today, and I do become pounded by error messages at server1 (and vice versa at server2) as follows:
| Apr 2 17:12:18
server1.lan dovecot: doveadm: Error: dsync(server2.lan): I/O has stalled,
no activity for 600 seconds (last sent=mail_change, last recv=mail_change (EOL)) | Apr 2 17:12:18server1.lan dovecot: doveadm: Error: Timeout during state=sync_mails
(send=changes recv=mail_requests) [?] | Apr 2 18:59:03server1.lan dovecot: doveadm: Error: dsync(server2.lan): I/O has stalled,
no activity for 600 seconds (last sent=mail, last recv=mail (EOL)) | Apr 2 18:59:03server1.lan dovecot: doveadm: Error: Timeout during state=sync_mails
(send=mails recv=recv_last_common)I cannot see in my personal account any missing replications, *but* I haven't tested this thoroughly enough. I do have customers being serviced at these productive servers, *thus* I'm back to 2.2.35 until I do understand or have learned what is going on.
Any ideas/feedback?
FYI: I haven't seen such errors before. Replication has been working for years now, without any glitches at all.
Regards, Michael
It's not just you. This issue hit me recently, and it was impacting replication noticeably. I am following git master-2.3 .
Here's a last known reasonably good point in the tree where things worked quite well:
EGIT_REPO_URI="https://github.com/dovecot/core.git" EGIT_BRANCH="master-2.3" EGIT_COMMIT="d9a1a7cbec19f4c6a47add47688351f8c3a0e372"
So something after that (which could have gone into 2.3.1) has caused this.
There is also a second issue of a long standing race with replication occurring somewhere whereby if a mail comes in, is written to disk, is replicated and then deleted in short succession, it will reappear again to the MUA. I suspect the mail is being replicated back from the remote. A few people have reported it over the years but it's not reliable or consistent, so it has never been fixed.
And lastly there has been an ongoing but seemingly minor issue relating to locking timing out after 30s particularly on the remote host that is being replicated to. I rarely see the problem on my local disk where almost all of the mail comes in, it's almost always occurring on the replicate/remote system. For me it seems very unlikely that on an unloaded/idle VPS there are legitimate problems obtaining a lock in under 30s. This is with the default locking configuration. This problem started happening a lot more after the breakage in (1) above.
These replication issues are similar, and could possibly be related.
My system is Gentoo Linux keeping up with the latest kernels, and on an EXT4 FS. I am using TCPS based replication. My remote replicate is also on Gentoo Linux with EXT4 but on a Linode VPS (around 250ms latency away).
I know in a later post you've said that you don't think it has anything to do with dovecot-2.3.1, so I'd be interested to know what really is the cause in that case.
Reuben
Hi,
There is also a second issue of a long standing race with replication occurring somewhere whereby if a mail comes in, is written to disk, is replicated and then deleted in short succession, it will reappear again to the MUA. I suspect the mail is being replicated back from the remote. A few people have reported it over the years but it's not reliable or consistent, so it has never been fixed.
sounds like my replication issue which is reproducible on 2.2.35 and does not occur on 2.2.33.2, so I assume something in the replication code has changed between these two versions.
dsync is copying the mail before expunge in this situation (no sieve filters involved):
(mail received on mx2a.example.com and delivered via dsync/ssh to mx2b.example.com, then expunged via pop3 on mx2b.example.com -> copy/duplicate) Mar 26 15:35:58 mx2b.example.com dovecot[3825]: pop3(poptest@example.com): expunge: box=INBOX, uid=23, msgid=ACD482CD-443E-479A-A9E9-67549E186E35@example.com, size=1210, subject=test 1535 Mar 26 15:35:58 mx2b.example.com dovecot[3825]: doveadm: Error: dsync-remote(poptest@example.com): Info: copy from INBOX: box=INBOX, uid=24, msgid=ACD482CD-443E-479A-A9E9-67549E186E35@example.com, size=1210, subject=test 1535 Mar 26 15:35:58 mx2b.example.com dovecot[3825]: doveadm: Error: dsync-remote(poptest@example.com): Info: expunge: box=INBOX, uid=23, msgid=ACD482CD-443E-479A-A9E9-67549E186E35@example.com, size=1210, subject=test 1535
For more details see mail from 2018-03-27 / Duplicate mails on pop3 expunge with dsync replication on 2.2.35 (2.2.33.2 works)
Best regards, Gerald
On 4 Apr 2018, at 01:34, Reuben Farrelly reuben-dovecot@reub.net wrote:
Hi,
Message: 2 Date: Mon, 2 Apr 2018 22:06:07 +0200 From: Michael Grimm trashcan@ellael.org To: Dovecot Mailing List dovecot@dovecot.org Subject: 2.3.1 Replication is throwing scary errors Message-ID: 29998016-D62F-4348-93D1-613B13DA90DB@ellael.org Content-Type: text/plain; charset=utf-8 Hi [This is Dovecot 2.3.1 at FreeBSD STABLE-11.1 running in two jails at distinct servers.] I did upgrade from 2.2.35 to 2.3.1 today, and I do become pounded by error messages at server1 (and vice versa at server2) as follows: | Apr 2 17:12:18
server1.lan dovecot: doveadm: Error: dsync(server2.lan): I/O has stalled,
no activity for 600 seconds (last sent=mail_change, last recv=mail_change (EOL)) | Apr 2 17:12:18server1.lan dovecot: doveadm: Error: Timeout during state=sync_mails
(send=changes recv=mail_requests) [?] | Apr 2 18:59:03server1.lan dovecot: doveadm: Error: dsync(server2.lan): I/O has stalled,
no activity for 600 seconds (last sent=mail, last recv=mail (EOL)) | Apr 2 18:59:03server1.lan dovecot: doveadm: Error: Timeout during state=sync_mails
(send=mails recv=recv_last_common) I cannot see in my personal account any missing replications, *but* I haven't tested this thoroughly enough. I do have customers being serviced at these productive servers, *thus* I'm back to 2.2.35 until I do understand or have learned what is going on. Any ideas/feedback? FYI: I haven't seen such errors before. Replication has been working for years now, without any glitches at all. Regards, MichaelIt's not just you. This issue hit me recently, and it was impacting replication noticeably. I am following git master-2.3 .
I am seeing the same as Michael Grimm also on FreeBSD-11. You’ll also notice in doveadm replicator status ‘*’ that the failed flag is raised for those users and that there are processes just hanging forever when those logs start to appear:
<user> 45949 0.0 0.0 47888 13276 - I 20:20 0:00.10 doveadm-server: [<user> Verwijderde items send:mail_requests recv:changes] (doveadm-server) <user2> 45964 0.0 0.0 49860 11608 - I 20:20 0:00.05 doveadm-server: [IP6 <user2> INBOX import:1/3] (doveadm-server) <user3> 45965 0.0 0.1 58256 19820 - I 20:20 0:00.11 doveadm-server: [IP6 <user3> INBOX import:16/18] (doveadm-server) <user4> 46480 0.0 0.0 53536 16288 - I 20:22 0:00.08 doveadm-server: [IP6 <user4> INBOX import:4/6] (doveadm-server) <user5> 46745 0.0 0.0 51496 14184 - I 20:22 0:00.07 doveadm-server: [IP6 <user5> INBOX import:5/6] (doveadm-server)
I also reverted to 2.2.35 because I started to get complaints from my users that mail was missing.
Cheers Remko
Reuben Farrelly wrote:
From: Michael Grimm trashcan@ellael.org
[This is Dovecot 2.3.1 at FreeBSD STABLE-11.1 running in two jails at distinct servers.] I did upgrade from 2.2.35 to 2.3.1 today, and I do become pounded by error messages at server1 (and vice versa at server2) as follows: | Apr 2 17:12:18
server1.lan dovecot: doveadm: Error: dsync(server2.lan): I/O has stalled,
no activity for 600 seconds (last sent=mail_change, last recv=mail_change (EOL)) | Apr 2 17:12:18server1.lan dovecot: doveadm: Error: Timeout during state=sync_mails
(send=changes recv=mail_requests) [?] | Apr 2 18:59:03server1.lan dovecot: doveadm: Error: dsync(server2.lan): I/O has stalled,
no activity for 600 seconds (last sent=mail, last recv=mail (EOL)) | Apr 2 18:59:03server1.lan dovecot: doveadm: Error: Timeout during state=sync_mails
(send=mails recv=recv_last_common) I cannot see in my personal account any missing replications, *but* I haven't tested this thoroughly enough. I do have customers being serviced at these productive servers, *thus* I'm back to 2.2.35 until I do understand or have learned what is going on.
In my reply to this statement of mine I mentioned that I have seen those timeouts quite some times during the past year. Thus, I upgraded to 2.3.1 again, and boom: after some hours I ended up in hanging processes [1] like (see Remko's mail in addition) ...
doveadm-server: [IP4/6 <USER1> SOME/MAILBOX import:0/0] (doveadm-server)
… at server2 paired with a file like …
-rw------- 1 vmail dovecot uarch 0 Apr 3 16:52 /home/to/USER1/.dovecot-sync.lock
Corresponding logfile entries at server2 are like …
Apr 3 17:10:49
fcntl(/home/to/USER1/.dovecot-sync.lock, write-lock, F_SETLKW) locking failed: Timed out after 30 seconds
(WRITE lock held by pid 51110)
[1] Even stopping dovecot will not end those processes. One has to manually kill those before restarting dovecot.
After one day of testing 2.3.1 with a couple of those episodes of locking/timeout, and now missing mails depending with server your MUA will connect to, I went back to 2.2.35. After two days at that version I never had such an episode again.
It's not just you. This issue hit me recently, and it was impacting replication noticeably. I am following git master-2.3 . [...] There is also a second issue of a long standing race with replication occurring somewhere whereby if a mail comes in, is written to disk, is replicated and then deleted in short succession, it will reappear again to the MUA. I suspect the mail is being replicated back from the remote. A few people have reported it over the years but it's not reliable or consistent, so it has never been fixed. And lastly there has been an ongoing but seemingly minor issue relating to locking timing out after 30s particularly on the remote host that is being replicated to. I rarely see the problem on my local disk where almost all of the mail comes in, it's almost always occurring on the replicate/remote system.
It might be time to describe our setups in order to possibly find common grounds that might trigger this issue you describe and Rimko and myself ran into as well.
Servers: Cloud Instances (both identical), around 25ms latency apart. Intel Core Processor (Haswell, no TSX) (3092.91-MHz K8-class CPU) Both servers are connected via IPsec/racoon tunnels OS: FreeBSD 11.1-STABLE (both servers) Filesystem: ZFS MTA: postfix 3.4-20180401 (postfix delivers via dovecot's LMTP) IMAP: dovecot running in FreeBSD jails (issues with 2.3.1, fine with 2.2.35) Replication: unsecured tcp / master-master MUA: mainly iOS or macOS mail.app, rarely roundcube
I believe it is worthwhile to mention here that I run a poor man's fail-over approach (round-robin DNS) as follows:
DNS: mail.server.tld resolves to one IP4 and one IP6 address of each server, thus 4 IP addresses in total
According it's MX priority one server (server1) will receive most mail, thus dovecot will mostly replicate mail from server1 to server2. Server2 is the one showing that deadlocking issues you see in your setup as well.
But connecting MUAs will hop quite frequently between server1 and server2, and sometimes will connect to both servers simultaneously, even mixing IP4 and IP6, because MUA do not follow MX priorities (IIRC). Normally I would believe that this shouldn't be an issue for dovecot, but let me ask dovecot's developers:
Can those simultaneous connects and modifications of \SEEN et al. be a reason for my issues regarding deadlocking?
For me it seems very unlikely that on an unloaded/idle VPS there are legitimate problems obtaining a lock in under 30s. This is with the default locking configuration. This problem started happening a lot more after the breakage in (1) above. These replication issues are similar, and could possibly be related.
Yes, I do think so as well.
My system is Gentoo Linux keeping up with the latest kernels, and on an EXT4 FS. I am using TCPS based replication. My remote replicate is also on Gentoo Linux with EXT4 but on a Linode VPS (around 250ms latency away).
Ok, your and my setup differs to some extend. What about your MUAs?
I know in a later post you've said that you don't think it has anything to do with dovecot-2.3.1, so I'd be interested to know what really is the cause in that case.
Thanks for bringing this issue up again, and I am eager to hear from the developers what they have to say. Now at least three of us users are reporting about "failing" replications in 2.3.1.
Thanks and regards, Michael
participants (4)
-
Gerald Galster
-
Michael Grimm
-
Remko Lodder
-
Reuben Farrelly