Hi,
Message: 2 Date: Mon, 2 Apr 2018 22:06:07 +0200 From: Michael Grimm trashcan@ellael.org To: Dovecot Mailing List dovecot@dovecot.org Subject: 2.3.1 Replication is throwing scary errors Message-ID: 29998016-D62F-4348-93D1-613B13DA90DB@ellael.org Content-Type: text/plain; charset=utf-8
Hi
[This is Dovecot 2.3.1 at FreeBSD STABLE-11.1 running in two jails at distinct servers.]
I did upgrade from 2.2.35 to 2.3.1 today, and I do become pounded by error messages at server1 (and vice versa at server2) as follows:
| Apr 2 17:12:18
server1.lan dovecot: doveadm: Error: dsync(server2.lan): I/O has stalled,
no activity for 600 seconds (last sent=mail_change, last recv=mail_change (EOL)) | Apr 2 17:12:18server1.lan dovecot: doveadm: Error: Timeout during state=sync_mails
(send=changes recv=mail_requests) [?] | Apr 2 18:59:03server1.lan dovecot: doveadm: Error: dsync(server2.lan): I/O has stalled,
no activity for 600 seconds (last sent=mail, last recv=mail (EOL)) | Apr 2 18:59:03server1.lan dovecot: doveadm: Error: Timeout during state=sync_mails
(send=mails recv=recv_last_common)I cannot see in my personal account any missing replications, *but* I haven't tested this thoroughly enough. I do have customers being serviced at these productive servers, *thus* I'm back to 2.2.35 until I do understand or have learned what is going on.
Any ideas/feedback?
FYI: I haven't seen such errors before. Replication has been working for years now, without any glitches at all.
Regards, Michael
It's not just you. This issue hit me recently, and it was impacting replication noticeably. I am following git master-2.3 .
Here's a last known reasonably good point in the tree where things worked quite well:
EGIT_REPO_URI="https://github.com/dovecot/core.git" EGIT_BRANCH="master-2.3" EGIT_COMMIT="d9a1a7cbec19f4c6a47add47688351f8c3a0e372"
So something after that (which could have gone into 2.3.1) has caused this.
There is also a second issue of a long standing race with replication occurring somewhere whereby if a mail comes in, is written to disk, is replicated and then deleted in short succession, it will reappear again to the MUA. I suspect the mail is being replicated back from the remote. A few people have reported it over the years but it's not reliable or consistent, so it has never been fixed.
And lastly there has been an ongoing but seemingly minor issue relating to locking timing out after 30s particularly on the remote host that is being replicated to. I rarely see the problem on my local disk where almost all of the mail comes in, it's almost always occurring on the replicate/remote system. For me it seems very unlikely that on an unloaded/idle VPS there are legitimate problems obtaining a lock in under 30s. This is with the default locking configuration. This problem started happening a lot more after the breakage in (1) above.
These replication issues are similar, and could possibly be related.
My system is Gentoo Linux keeping up with the latest kernels, and on an EXT4 FS. I am using TCPS based replication. My remote replicate is also on Gentoo Linux with EXT4 but on a Linode VPS (around 250ms latency away).
I know in a later post you've said that you don't think it has anything to do with dovecot-2.3.1, so I'd be interested to know what really is the cause in that case.
Reuben