2.3.1 Replication is throwing scary errors
Reuben Farrelly
reuben-dovecot at reub.net
Wed Apr 4 02:34:55 EEST 2018
Hi,
> ------------------------------
>
> Message: 2
> Date: Mon, 2 Apr 2018 22:06:07 +0200
> From: Michael Grimm <trashcan at ellael.org>
> To: Dovecot Mailing List <dovecot at dovecot.org>
> Subject: 2.3.1 Replication is throwing scary errors
> Message-ID: <29998016-D62F-4348-93D1-613B13DA90DB at ellael.org>
> Content-Type: text/plain; charset=utf-8
>
> Hi
>
> [This is Dovecot 2.3.1 at FreeBSD STABLE-11.1 running in two jails at distinct servers.]
>
> I did upgrade from 2.2.35 to 2.3.1 today, and I do become pounded by error messages at server1 (and vice versa at server2) as follows:
>
> | Apr 2 17:12:18 <mail.err> server1.lan dovecot: doveadm: Error: dsync(server2.lan): I/O has stalled, \
> no activity for 600 seconds (last sent=mail_change, last recv=mail_change (EOL))
> | Apr 2 17:12:18 <mail.err> server1.lan dovecot: doveadm: Error: Timeout during state=sync_mails \
> (send=changes recv=mail_requests)
> [?]
> | Apr 2 18:59:03 <mail.err> server1.lan dovecot: doveadm: Error: dsync(server2.lan): I/O has stalled, \
> no activity for 600 seconds (last sent=mail, last recv=mail (EOL))
> | Apr 2 18:59:03 <mail.err> server1.lan dovecot: doveadm: Error: Timeout during state=sync_mails \
> (send=mails recv=recv_last_common)
>
> I cannot see in my personal account any missing replications, *but* I haven't tested this thoroughly enough. I do have customers being serviced at these productive servers, *thus* I'm back to 2.2.35 until I do understand or have learned what is going on.
>
> Any ideas/feedback?
>
> FYI: I haven't seen such errors before. Replication has been working for years now, without any glitches at all.
>
> Regards,
> Michael
It's not just you. This issue hit me recently, and it was impacting
replication noticeably. I am following git master-2.3 .
Here's a last known reasonably good point in the tree where things
worked quite well:
EGIT_REPO_URI="https://github.com/dovecot/core.git"
EGIT_BRANCH="master-2.3"
EGIT_COMMIT="d9a1a7cbec19f4c6a47add47688351f8c3a0e372"
So something after that (which could have gone into 2.3.1) has caused this.
There is also a second issue of a long standing race with replication
occurring somewhere whereby if a mail comes in, is written to disk, is
replicated and then deleted in short succession, it will reappear again
to the MUA. I suspect the mail is being replicated back from the
remote. A few people have reported it over the years but it's not
reliable or consistent, so it has never been fixed.
And lastly there has been an ongoing but seemingly minor issue relating
to locking timing out after 30s particularly on the remote host that is
being replicated to. I rarely see the problem on my local disk where
almost all of the mail comes in, it's almost always occurring on the
replicate/remote system.
For me it seems very unlikely that on an unloaded/idle VPS there are
legitimate problems obtaining a lock in under 30s. This is with the
default locking configuration. This problem started happening a lot
more after the breakage in (1) above.
These replication issues are similar, and could possibly be related.
My system is Gentoo Linux keeping up with the latest kernels, and on an
EXT4 FS. I am using TCPS based replication. My remote replicate is
also on Gentoo Linux with EXT4 but on a Linode VPS (around 250ms latency
away).
I know in a later post you've said that you don't think it has anything
to do with dovecot-2.3.1, so I'd be interested to know what really is
the cause in that case.
Reuben
More information about the dovecot
mailing list