2.3.1 Replication is throwing scary errors

Michael Grimm trashcan at ellael.org
Sun May 6 22:34:04 EEST 2018


Hi Reuben,

thank you for digging into this issue reported by some users, now.

I wonder if there are users out there using master/master replication with 2.3.1 dovecot that do *NOT* observe the errors mentioned in this thread? Please report, please. And, pleas have a closer look for 

It would be interesting to know if these failures are seen by a subset of users, only.

And please have a look for processes like:
	doveadm-server: [IP4 <user> INBOX import:1/3] (doveadm-server)

These processes will "survive" a dovecot reboot ...

With kind regards,
Michael

> On 6. May 2018, at 16:11, Reuben Farrelly <reuben-dovecot at reub.net> wrote:
> 
> Ah yes.  The hash I put below is wrong but the commit message I quoted was right. This is the last known working good commit, which was on Mar 6:
> 
> https://github.com/dovecot/core/commit/ff5d02182573b1d4306c2a8c36605c98f217ef3b
> "doveadm: Unref header search context after use
> Fixes memory leak, found by valgrind"
> 
> I've subsequently been able to determine that the bad commit was one of the dozens which were committed in a big batch on Mar 12, 2018.  I definitely see the problem happening when running with commit 6418419ec282c887b67469dbe3f541fc4873f7f0 (last one in that batch).
> 
> In other words it's one of the commits between those two hashes that is the problem.   Just need to work out which one...........
> 
> The problem is also seen with -git as of a day or so ago.  I'm tracking master-2.3 so I imagine I'm picking up all the 2.3 fixes as they go straight in (including the rc fixes).  So this doesn't appear to be fixed yet.
> 
> Reuben
> 
> 
> 
> On 6/05/2018 11:41 pm, Aki Tuomi wrote:
>> Seems you got the hash wrong...
>> You probably mean 4a1e2e8e85ad5d4ab43496a5f3cd6b7ff6ac5955, which fixes a memory leak, but does not release sync lock.
>> Interesting if this commit fixes the issue.
>> Looking at the sync code, where the lock is made / released, there is only one commit recently, which is 8077d714e11388a294f1583e706152396972acce and that one creates home directory if missing.
>> Would be interesting if you could test with 2.3.2rc1, which was released few days ago, whether this issue goes away?
>> Aki
>>> On 06 May 2018 at 15:21 Reuben Farrelly <reuben-dovecot at reub.net> wrote:
>>> 
>>> 
>>> Hi Andy,
>>> 
>>> Funny you say that - I've been testing this very problem for some time
>>> today to narrow down the commits where I think the problem started
>>> occurring.
>>> 
>>> So far I have been able to roll forward to commit from git:
>>> 74ef8506dcb946acc42722683e49fdcb964eed19
>>> 
>>>  >> "doveadm: Unref header search context after use" .  So far I have
>>> been running on that code for a few hours and I'm not yet seeing the
>>> problem, which indicates the issue is probably subsequent to this.
>>> 
>>> I'm of the view that the problem started occurring somewhere between
>>> that commit above on 6 March and the final 2.3.1 release on 27 March.
>>> There are a lot of changes between those two dates, and testing to
>>> narrow down to one specific commit is extremely time consuming.
>>> 
>>> It would be really helpful if even one of the Dovecot developers could
>>> comment on this and give us some direction (or candidate commits to
>>> investigate) or just let us know if the root cause has been found.  This
>>> bug is a showstopper and has stopped me tracking master-2.3 for over 3
>>> months now, as I can't test later builds or even upgrade to the 2.3.1
>>> release while replication is so broken.
>>> 
>>> Reuben
>>> 
>>> 
>>> 
>>>> Message: 1
>>>> Date: Sun, 6 May 2018 13:21:57 +1000
>>>> From: Andy Weal <andy at bizemail.com.au>
>>>> To: dovecot at dovecot.org
>>>> Subject: 2.3.1 Replication is throwing scary errors
>>>> Message-ID: <f4274128-350b-9a74-a344-8e6caeffa112 at bizemail.com.au>
>>>> Content-Type: text/plain; charset="utf-8"; Format="flowed"
>>>> 
>>>> Hi all,
>>>> 
>>>> New to the mailing lists but have joined up because of above */2.3.1
>>>> Replication is throwing scary errors
>>>> 
>>>> 
>>>> /*Brief system configuration
>>>>   ??? MX1 - Main
>>>>   ??? ??? Freebsd 11.1-Release-p9
>>>>   ??? ??? Hosted on a Vultr VM in Sydney AU
>>>>   ??? ??? MTA = Postfix 3.4-20180401
>>>>   ??? ??? Dovecot = 2.3.1
>>>>   ??? ??? File system = ufs
>>>>   ??? MX2 - Backup
>>>>   ??? ??? Freebsd 11.1-Release-p9
>>>>   ??? ???? Running on bare metal - no VM or jails
>>>>   ??? ??? MTA = Postfix 3.4-20180401
>>>>   ??? ??? Dovecot = 2.3.1
>>>>   ??? ??? File system = ufs ( on SSD)
>>>> 
>>>> */
>>>> /*Brief sequence of events
>>>> 
>>>>   1.  ??? apx 10 days back upgraded both mx1 and mx2 to dovecot 2.3.1_2
>>>>      from 2.3.0 ? (service dovecot stop, portmaster upgrade, service
>>>>      dovecot start)
>>>>   2.  ??? both systems ran ok with no errors for 10 days.
>>>>   3.  ??? Last night I shutdown mx2 and restarted it a few hours later
>>>>   4.  ??? within minutes i was getting the following types of errors on mx2
>>>> 
>>>>   ???????? May 06 12:56:29 doveadm: Error: Couldn't lock
>>>> /var/mail/vhosts/example.net/user1/.dovecot-sync.lock:
>>>> fcntl(/var/mail/vhosts/example.net/user1/.dovecot-sync.lock, write-lock,
>>>> F_SETLKW) locking failed: Timed out after 30 seconds (WRITE lock held by
>>>> pid 1960)
>>>> 
>>>>   ? Before i venture down the rabbit hole of fault finding and excess
>>>> coffee consumption I was wondering if any of you had any updates on the
>>>> problems discussed below.
>>>> 
>>>> 
>>>> Cheers for now,
>>>> Andy
>>>> 
>>>> 
>>>> 
>>>> Hi,
>>>> 
>>>> [Formatting is a bit rough, replying from a trimmed digest email]
>>>> 
>>>>> /Message: 1 />/Date: Fri, 6 Apr 2018 15:04:35 +0200 />/From: Michael Grimm <trashcan at ellael.org
>>>> <https://dovecot.org/mailman/listinfo/dovecot>> />/To: Dovecot Mailing List <dovecot at dovecot.org
>>>> <https://dovecot.org/mailman/listinfo/dovecot>> />/Subject: Re: 2.3.1 Replication is throwing scary errors />/Message-ID: <E7E7A927-68F8-443F-BA59-E66CED8FE878 at ellael.org
>>>> <https://dovecot.org/mailman/listinfo/dovecot>> />/Content-Type: text/plain; charset=utf-8 />//>/Reuben Farrelly wrote: />>/From: Michael Grimm <trashcan at ellael.org
>>>> <https://dovecot.org/mailman/listinfo/dovecot>> />//>>>/[This is Dovecot 2.3.1 at FreeBSD STABLE-11.1 running in two jails at
>>>> distinct servers.] />>>/I did upgrade from 2.2.35 to 2.3.1 today, and I do become pounded by
>>>> error messages at server1 (and vice versa at server2) as follows: />>>/| Apr 2 17:12:18 <mail.err> server1.lan dovecot: doveadm: Error:
>>>> dsync(server2.lan): I/O has stalled, \ />>>/no activity for 600 seconds (last sent=mail_change, last
>>>> recv=mail_change (EOL)) />>>/| Apr 2 17:12:18 <mail.err> server1.lan dovecot: doveadm: Error: Timeout
>>>> during state=sync_mails \ />>>/(send=changes recv=mail_requests) />>>/[?] />>>/| Apr 2 18:59:03 <mail.err> server1.lan dovecot: doveadm: Error:
>>>> dsync(server2.lan): I/O has stalled, \ />>>/no activity for 600 seconds (last sent=mail, last recv=mail (EOL)) />>>/| Apr 2 18:59:03 <mail.err> server1.lan dovecot: doveadm: Error: Timeout
>>>> during state=sync_mails \ />>>/(send=mails recv=recv_last_common) />>>/I cannot see in my personal account any missing replications, *but* I
>>>> haven't tested this thoroughly enough. I do have customers being
>>>> serviced at these productive servers, *thus* I'm back to 2.2.35 until I
>>>> do understand or have learned what is going on. />//>/In my reply to this statement of mine I mentioned that I have seen those
>>>> timeouts quite some times during the past year. Thus, I upgraded to
>>>> 2.3.1 again, and boom: after some hours I ended up in hanging processes
>>>> [1] like (see Remko's mail in addition) ... />//>/doveadm-server: [IP4/6 <USER1> SOME/MAILBOX import:0/0] (doveadm-server) />//>/? at server2 paired with a file like ? />//>/-rw------- 1 vmail dovecot uarch 0 Apr 3 16:52
>>>> /home/to/USER1/.dovecot-sync.lock />//>/Corresponding logfile entries at server2 are like ? />//>/Apr 3 17:10:49 <mail.err> server2.lan dovecot: doveadm: Error: Couldn't
>>>> lock /home/to/USER1/.dovecot-sync.lock: \ />/fcntl(/home/to/USER1/.dovecot-sync.lock, write-lock, F_SETLKW) locking
>>>> failed: Timed out after 30 seconds \ />/(WRITE lock held by pid 51110) />//>/[1] Even stopping dovecot will not end those processes. One has to
>>>> manually kill those before restarting dovecot. />//>/After one day of testing 2.3.1 with a couple of those episodes of
>>>> locking/timeout, and now missing mails depending with server your MUA
>>>> will connect to, I went back to 2.2.35. After two days at that version I
>>>> never had such an episode again. />//>>/It's not just you. This issue hit me recently, and it was impacting />>/replication noticeably. I am following git master-2.3 . />/[...] />>/There is also a second issue of a long standing race with replication />>/occurring somewhere whereby if a mail comes in, is written to disk, is />>/replicated and then deleted in short succession, it will reappear />>/again to the MUA. I suspect the mail is being replicated back from />>/the remote. A few people have reported it over the years but it's not />>/reliable or consistent, so it has never been fixed. />>/And lastly there has been an ongoing but seemingly minor issue />>/relating to locking timing out after 30s particularly on the remote />>/host that is being replicated to. I rarely see the problem on my />>/local disk where almost all of the mail comes in, it's almost always />>/occurring on the replicate/remote system. />//>/It might be time to describe our setups in order to possibly find co
>>>>   mmon
>>>> grounds that might trigger this issue you describe and Rimko and myself
>>>> ran into as well. />//>/Servers: Cloud Instances (both identical), around 25ms latency apart. />/Intel Core Processor (Haswell, no TSX) (3092.91-MHz K8-class CPU) />/Both servers are connected via IPsec/racoon tunnels />/OS: FreeBSD 11.1-STABLE (both servers) />/Filesystem: ZFS />/MTA: postfix 3.4-20180401 (postfix delivers via dovecot's LMTP) />/IMAP: dovecot running in FreeBSD jails (issues with 2.3.1, fine with
>>>> 2.2.35) />/Replication: unsecured tcp / master-master />/MUA: mainly iOS or macOS mail.app, rarely roundcube /
>>>> For me:
>>>> 
>>>> Servers:	Main: VM on a VMWare ESXi local system (light
>>>> 		load), local SSD disks (no NFS)
>>>> 		Redundant: Linode VM in Singapore, around 250ms away
>>>> 		Also no NFS.  Linode use SSDs for IO.
>>>> 		There is native IPv6 connectivity between both VMs.
>>>> 		As I am using TCPs I don't have a VPN between them -
>>>> 		just raw IPv6 end to end.
>>>> OS:		Gentoo Linux x86_64 kept well up to date
>>>> Filesystem:	EXT4 for both
>>>> MTA:		Postfix 3.4-x (same as you)
>>>> IMAP:		Dovecot running natively on the machine (no jail/chroot)
>>>> Replication:	Secured TCP / master-master (but only one master is
>>>> 		actively used for SMTP and MUA access)
>>>> MUA:		Thunderbird, Maildroid, MS Mail, Gmail client,
>>>> 		Roundcube, LG Mail client on handset etc
>>>> 
>>>>> /I believe it is worthwhile to mention here that I run a poor man's
>>>> fail-over approach (round-robin DNS) as follows: />//>/DNS: mail.server.tld resolves to one IP4 and one IP6 address of each
>>>> server, thus 4 IP addresses in total /
>>>> For me I have a simple pair of servers that replicate mail to each
>>>> other.  Clients connect to the main server only, which is where mail
>>>> comes in.  The second one that experiences the problem with locking is
>>>> only a redundant server which is a secondary MX, that only gets
>>>> occasional SMTP deliveries from time to time.  So the amount of
>>>> replication back from the "secondary" is very minimal; the vast bulk of
>>>> the replication is the primary replicating changes out.
>>>> 
>>>> Reuben
>>>> 
> 



More information about the dovecot mailing list