2.3.1 Replication is throwing scary errors

Fri Apr 6 16:04:35 EEST 2018

Reuben Farrelly wrote:
> From: Michael Grimm <trashcan at ellael.org>

>> [This is Dovecot 2.3.1 at FreeBSD STABLE-11.1 running in two jails at distinct servers.]
>> I did upgrade from 2.2.35 to 2.3.1 today, and I do become pounded by error messages at server1 (and vice versa at server2) as follows:
>> 	| Apr  2 17:12:18 <mail.err> server1.lan dovecot: doveadm: Error: dsync(server2.lan): I/O has stalled, \
>> 		no activity for 600 seconds (last sent=mail_change, last recv=mail_change (EOL))
>> 	| Apr  2 17:12:18 <mail.err> server1.lan dovecot: doveadm: Error: Timeout during state=sync_mails \
>> 		(send=changes recv=mail_requests)
>> 	[?]
>> 	| Apr  2 18:59:03 <mail.err> server1.lan dovecot: doveadm: Error: dsync(server2.lan): I/O has stalled, \
>> 		no activity for 600 seconds (last sent=mail, last recv=mail (EOL))
>> 	| Apr  2 18:59:03 <mail.err> server1.lan dovecot: doveadm: Error: Timeout during state=sync_mails \
>> 		(send=mails recv=recv_last_common)
>> I cannot see in my personal account any missing replications, *but* I haven't tested this thoroughly enough. I do have customers being serviced at these productive servers, *thus* I'm back to 2.2.35 until I do understand or have learned what is going on.

In my reply to this statement of mine I mentioned that I have seen those timeouts quite some times during the past year. Thus, I upgraded to 2.3.1 again, and boom: after some hours I ended up in hanging processes [1] like (see Remko's mail in addition) ...

	doveadm-server: [IP4/6 <USER1> SOME/MAILBOX import:0/0] (doveadm-server)

… at server2 paired with a file like …

	-rw------- 1 vmail dovecot uarch 0 Apr 3 16:52 /home/to/USER1/.dovecot-sync.lock 

Corresponding logfile entries at server2 are like …

  Apr  3 17:10:49 <mail.err> server2.lan dovecot: doveadm: Error: Couldn't lock /home/to/USER1/.dovecot-sync.lock: \
  fcntl(/home/to/USER1/.dovecot-sync.lock, write-lock, F_SETLKW) locking failed: Timed out after 30 seconds \
  (WRITE lock held by pid 51110)

[1] Even stopping dovecot will not end those processes. One has to manually kill those before restarting dovecot.

After one day of testing 2.3.1 with a couple of those episodes of locking/timeout, and now missing mails depending with server your MUA will connect to, I went back to 2.2.35. After two days at that version I never had such an episode again.

> It's not just you.  This issue hit me recently, and it was impacting
> replication noticeably.  I am following git master-2.3 .
[...]
> There is also a second issue of a long standing race with replication
> occurring somewhere whereby if a mail comes in, is written to disk, is
> replicated and then deleted in short succession, it will reappear
> again to the MUA.  I suspect the mail is being replicated back from
> the remote.  A few people have reported it over the years but it's not
> reliable or consistent, so it has never been fixed.
> And lastly there has been an ongoing but seemingly minor issue
> relating to locking timing out after 30s particularly on the remote
> host that is being replicated to.  I rarely see the problem on my
> local disk where almost all of the mail comes in, it's almost always
> occurring on the replicate/remote system.

It might be time to describe our setups in order to possibly find common grounds that might trigger this issue you describe and Rimko and myself ran into as well.

Servers:	Cloud Instances (both identical), around 25ms latency apart.
		Intel Core Processor (Haswell, no TSX) (3092.91-MHz K8-class CPU)
		Both servers are connected via IPsec/racoon tunnels
OS:		FreeBSD 11.1-STABLE (both servers)
Filesystem:	ZFS
MTA:		postfix 3.4-20180401 (postfix delivers via dovecot's LMTP)
IMAP:		dovecot running in FreeBSD jails (issues with 2.3.1, fine with 2.2.35)
Replication:	unsecured tcp / master-master
MUA:		mainly iOS or macOS mail.app, rarely roundcube

I believe it is worthwhile to mention here that I run a poor man's fail-over approach (round-robin DNS) as follows:

DNS:		mail.server.tld resolves to one IP4 and one IP6 address of each server, thus 4 IP addresses in total

According it's MX priority one server (server1) will receive most mail, thus dovecot will mostly replicate mail from server1 to server2. Server2 is the one showing that deadlocking issues you see in your setup as well.

But connecting MUAs will hop quite frequently between server1 and server2, and sometimes will connect to both servers simultaneously, even mixing IP4 and IP6, because MUA do not follow MX priorities (IIRC). Normally I would believe that this shouldn't be an issue for dovecot, but let me ask dovecot's developers: 

Can those simultaneous connects and modifications of \SEEN et al. be a reason for my issues regarding deadlocking?

> For me it seems very unlikely that on an unloaded/idle VPS there are
> legitimate problems obtaining a lock in under 30s.  This is with the
> default locking configuration.  This problem started happening a lot
> more after the breakage in (1) above.
> These replication issues are similar, and could possibly be related.

Yes, I do think so as well.

> My system is Gentoo Linux keeping up with the latest kernels, and on
> an EXT4 FS.  I am using TCPS based replication.  My remote replicate
> is also on Gentoo Linux with EXT4 but on a Linode VPS (around 250ms
> latency away).

Ok, your and my setup differs to some extend. What about your MUAs?

> I know in a later post you've said that you don't think it has
> anything to do with dovecot-2.3.1, so I'd be interested to know what
> really is the cause in that case.

Thanks for bringing this issue up again, and I am eager to hear from the developers what they have to say. Now at least three of us users are reporting about "failing" replications in 2.3.1.

Thanks and regards,
Michael