2.3.1 Replication is throwing scary errors

Reuben Farrelly reuben-dovecot at reub.net
Wed Apr 4 02:34:55 EEST 2018


Hi,

> ------------------------------
> 
> Message: 2
> Date: Mon, 2 Apr 2018 22:06:07 +0200
> From: Michael Grimm <trashcan at ellael.org>
> To: Dovecot Mailing List <dovecot at dovecot.org>
> Subject: 2.3.1 Replication is throwing scary errors
> Message-ID: <29998016-D62F-4348-93D1-613B13DA90DB at ellael.org>
> Content-Type: text/plain;	charset=utf-8
> 
> Hi
> 
> [This is Dovecot 2.3.1 at FreeBSD STABLE-11.1 running in two jails at distinct servers.]
> 
> I did upgrade from 2.2.35 to 2.3.1 today, and I do become pounded by error messages at server1 (and vice versa at server2) as follows:
> 
> 	| Apr  2 17:12:18 <mail.err> server1.lan dovecot: doveadm: Error: dsync(server2.lan): I/O has stalled, \
> 		no activity for 600 seconds (last sent=mail_change, last recv=mail_change (EOL))
> 	| Apr  2 17:12:18 <mail.err> server1.lan dovecot: doveadm: Error: Timeout during state=sync_mails \
> 		(send=changes recv=mail_requests)
> 	[?]
> 	| Apr  2 18:59:03 <mail.err> server1.lan dovecot: doveadm: Error: dsync(server2.lan): I/O has stalled, \
> 		no activity for 600 seconds (last sent=mail, last recv=mail (EOL))
> 	| Apr  2 18:59:03 <mail.err> server1.lan dovecot: doveadm: Error: Timeout during state=sync_mails \
> 		(send=mails recv=recv_last_common)
> 
> I cannot see in my personal account any missing replications, *but* I haven't tested this thoroughly enough. I do have customers being serviced at these productive servers, *thus* I'm back to 2.2.35 until I do understand or have learned what is going on.
> 
> Any ideas/feedback?
> 
> FYI: I haven't seen such errors before. Replication has been working for years now, without any glitches at all.
> 
> Regards,
> Michael

It's not just you.  This issue hit me recently, and it was impacting 
replication noticeably.  I am following git master-2.3 .

Here's a last known reasonably good point in the tree where things 
worked quite well:

EGIT_REPO_URI="https://github.com/dovecot/core.git"
EGIT_BRANCH="master-2.3"
EGIT_COMMIT="d9a1a7cbec19f4c6a47add47688351f8c3a0e372"

So something after that (which could have gone into 2.3.1) has caused this.

There is also a second issue of a long standing race with replication 
occurring somewhere whereby if a mail comes in, is written to disk, is 
replicated and then deleted in short succession, it will reappear again 
to the MUA.  I suspect the mail is being replicated back from the 
remote.  A few people have reported it over the years but it's not 
reliable or consistent, so it has never been fixed.

And lastly there has been an ongoing but seemingly minor issue relating 
to locking timing out after 30s particularly on the remote host that is 
being replicated to.  I rarely see the problem on my local disk where 
almost all of the mail comes in, it's almost always occurring on the 
replicate/remote system.
For me it seems very unlikely that on an unloaded/idle VPS there are 
legitimate problems obtaining a lock in under 30s.  This is with the 
default locking configuration.  This problem started happening a lot 
more after the breakage in (1) above.

These replication issues are similar, and could possibly be related.

My system is Gentoo Linux keeping up with the latest kernels, and on an 
EXT4 FS.  I am using TCPS based replication.  My remote replicate is 
also on Gentoo Linux with EXT4 but on a Linode VPS (around 250ms latency 
away).

I know in a later post you've said that you don't think it has anything 
to do with dovecot-2.3.1, so I'd be interested to know what really is 
the cause in that case.

Reuben


More information about the dovecot mailing list