2.3.1 Replication is throwing scary errors

Reuben Farrelly reuben-dovecot at reub.net
Sun May 6 17:11:56 EEST 2018


Ah yes.  The hash I put below is wrong but the commit message I quoted 
was right. This is the last known working good commit, which was on Mar 6:

https://github.com/dovecot/core/commit/ff5d02182573b1d4306c2a8c36605c98f217ef3b
"doveadm: Unref header search context after use
Fixes memory leak, found by valgrind"

I've subsequently been able to determine that the bad commit was one of 
the dozens which were committed in a big batch on Mar 12, 2018.  I 
definitely see the problem happening when running with commit 
6418419ec282c887b67469dbe3f541fc4873f7f0 (last one in that batch).

In other words it's one of the commits between those two hashes that is 
the problem.   Just need to work out which one...........

The problem is also seen with -git as of a day or so ago.  I'm tracking 
master-2.3 so I imagine I'm picking up all the 2.3 fixes as they go 
straight in (including the rc fixes).  So this doesn't appear to be 
fixed yet.

Reuben



On 6/05/2018 11:41 pm, Aki Tuomi wrote:
> Seems you got the hash wrong...
> 
> You probably mean 4a1e2e8e85ad5d4ab43496a5f3cd6b7ff6ac5955, which fixes a memory leak, but does not release sync lock.
> 
> Interesting if this commit fixes the issue.
> 
> Looking at the sync code, where the lock is made / released, there is only one commit recently, which is 8077d714e11388a294f1583e706152396972acce and that one creates home directory if missing.
> 
> Would be interesting if you could test with 2.3.2rc1, which was released few days ago, whether this issue goes away?
> 
> Aki
> 
>> On 06 May 2018 at 15:21 Reuben Farrelly <reuben-dovecot at reub.net> wrote:
>>
>>
>> Hi Andy,
>>
>> Funny you say that - I've been testing this very problem for some time
>> today to narrow down the commits where I think the problem started
>> occurring.
>>
>> So far I have been able to roll forward to commit from git:
>> 74ef8506dcb946acc42722683e49fdcb964eed19
>>
>>   >> "doveadm: Unref header search context after use" .  So far I have
>> been running on that code for a few hours and I'm not yet seeing the
>> problem, which indicates the issue is probably subsequent to this.
>>
>> I'm of the view that the problem started occurring somewhere between
>> that commit above on 6 March and the final 2.3.1 release on 27 March.
>> There are a lot of changes between those two dates, and testing to
>> narrow down to one specific commit is extremely time consuming.
>>
>> It would be really helpful if even one of the Dovecot developers could
>> comment on this and give us some direction (or candidate commits to
>> investigate) or just let us know if the root cause has been found.  This
>> bug is a showstopper and has stopped me tracking master-2.3 for over 3
>> months now, as I can't test later builds or even upgrade to the 2.3.1
>> release while replication is so broken.
>>
>> Reuben
>>
>>
>>
>>> Message: 1
>>> Date: Sun, 6 May 2018 13:21:57 +1000
>>> From: Andy Weal <andy at bizemail.com.au>
>>> To: dovecot at dovecot.org
>>> Subject: 2.3.1 Replication is throwing scary errors
>>> Message-ID: <f4274128-350b-9a74-a344-8e6caeffa112 at bizemail.com.au>
>>> Content-Type: text/plain; charset="utf-8"; Format="flowed"
>>>
>>> Hi all,
>>>
>>> New to the mailing lists but have joined up because of above */2.3.1
>>> Replication is throwing scary errors
>>>
>>>
>>> /*Brief system configuration
>>>    ??? MX1 - Main
>>>    ??? ??? Freebsd 11.1-Release-p9
>>>    ??? ??? Hosted on a Vultr VM in Sydney AU
>>>    ??? ??? MTA = Postfix 3.4-20180401
>>>    ??? ??? Dovecot = 2.3.1
>>>    ??? ??? File system = ufs
>>>    ??? MX2 - Backup
>>>    ??? ??? Freebsd 11.1-Release-p9
>>>    ??? ???? Running on bare metal - no VM or jails
>>>    ??? ??? MTA = Postfix 3.4-20180401
>>>    ??? ??? Dovecot = 2.3.1
>>>    ??? ??? File system = ufs ( on SSD)
>>>
>>> */
>>> /*Brief sequence of events
>>>
>>>    1.  ??? apx 10 days back upgraded both mx1 and mx2 to dovecot 2.3.1_2
>>>       from 2.3.0 ? (service dovecot stop, portmaster upgrade, service
>>>       dovecot start)
>>>    2.  ??? both systems ran ok with no errors for 10 days.
>>>    3.  ??? Last night I shutdown mx2 and restarted it a few hours later
>>>    4.  ??? within minutes i was getting the following types of errors on mx2
>>>
>>>    ???????? May 06 12:56:29 doveadm: Error: Couldn't lock
>>> /var/mail/vhosts/example.net/user1/.dovecot-sync.lock:
>>> fcntl(/var/mail/vhosts/example.net/user1/.dovecot-sync.lock, write-lock,
>>> F_SETLKW) locking failed: Timed out after 30 seconds (WRITE lock held by
>>> pid 1960)
>>>
>>>    ? Before i venture down the rabbit hole of fault finding and excess
>>> coffee consumption I was wondering if any of you had any updates on the
>>> problems discussed below.
>>>
>>>
>>> Cheers for now,
>>> Andy
>>>
>>>
>>>
>>> Hi,
>>>
>>> [Formatting is a bit rough, replying from a trimmed digest email]
>>>
>>>> /Message: 1 />/Date: Fri, 6 Apr 2018 15:04:35 +0200 />/From: Michael Grimm <trashcan at ellael.org
>>> <https://dovecot.org/mailman/listinfo/dovecot>> />/To: Dovecot Mailing List <dovecot at dovecot.org
>>> <https://dovecot.org/mailman/listinfo/dovecot>> />/Subject: Re: 2.3.1 Replication is throwing scary errors />/Message-ID: <E7E7A927-68F8-443F-BA59-E66CED8FE878 at ellael.org
>>> <https://dovecot.org/mailman/listinfo/dovecot>> />/Content-Type: text/plain; charset=utf-8 />//>/Reuben Farrelly wrote: />>/From: Michael Grimm <trashcan at ellael.org
>>> <https://dovecot.org/mailman/listinfo/dovecot>> />//>>>/[This is Dovecot 2.3.1 at FreeBSD STABLE-11.1 running in two jails at
>>> distinct servers.] />>>/I did upgrade from 2.2.35 to 2.3.1 today, and I do become pounded by
>>> error messages at server1 (and vice versa at server2) as follows: />>>/| Apr 2 17:12:18 <mail.err> server1.lan dovecot: doveadm: Error:
>>> dsync(server2.lan): I/O has stalled, \ />>>/no activity for 600 seconds (last sent=mail_change, last
>>> recv=mail_change (EOL)) />>>/| Apr 2 17:12:18 <mail.err> server1.lan dovecot: doveadm: Error: Timeout
>>> during state=sync_mails \ />>>/(send=changes recv=mail_requests) />>>/[?] />>>/| Apr 2 18:59:03 <mail.err> server1.lan dovecot: doveadm: Error:
>>> dsync(server2.lan): I/O has stalled, \ />>>/no activity for 600 seconds (last sent=mail, last recv=mail (EOL)) />>>/| Apr 2 18:59:03 <mail.err> server1.lan dovecot: doveadm: Error: Timeout
>>> during state=sync_mails \ />>>/(send=mails recv=recv_last_common) />>>/I cannot see in my personal account any missing replications, *but* I
>>> haven't tested this thoroughly enough. I do have customers being
>>> serviced at these productive servers, *thus* I'm back to 2.2.35 until I
>>> do understand or have learned what is going on. />//>/In my reply to this statement of mine I mentioned that I have seen those
>>> timeouts quite some times during the past year. Thus, I upgraded to
>>> 2.3.1 again, and boom: after some hours I ended up in hanging processes
>>> [1] like (see Remko's mail in addition) ... />//>/doveadm-server: [IP4/6 <USER1> SOME/MAILBOX import:0/0] (doveadm-server) />//>/? at server2 paired with a file like ? />//>/-rw------- 1 vmail dovecot uarch 0 Apr 3 16:52
>>> /home/to/USER1/.dovecot-sync.lock />//>/Corresponding logfile entries at server2 are like ? />//>/Apr 3 17:10:49 <mail.err> server2.lan dovecot: doveadm: Error: Couldn't
>>> lock /home/to/USER1/.dovecot-sync.lock: \ />/fcntl(/home/to/USER1/.dovecot-sync.lock, write-lock, F_SETLKW) locking
>>> failed: Timed out after 30 seconds \ />/(WRITE lock held by pid 51110) />//>/[1] Even stopping dovecot will not end those processes. One has to
>>> manually kill those before restarting dovecot. />//>/After one day of testing 2.3.1 with a couple of those episodes of
>>> locking/timeout, and now missing mails depending with server your MUA
>>> will connect to, I went back to 2.2.35. After two days at that version I
>>> never had such an episode again. />//>>/It's not just you. This issue hit me recently, and it was impacting />>/replication noticeably. I am following git master-2.3 . />/[...] />>/There is also a second issue of a long standing race with replication />>/occurring somewhere whereby if a mail comes in, is written to disk, is />>/replicated and then deleted in short succession, it will reappear />>/again to the MUA. I suspect the mail is being replicated back from />>/the remote. A few people have reported it over the years but it's not />>/reliable or consistent, so it has never been fixed. />>/And lastly there has been an ongoing but seemingly minor issue />>/relating to locking timing out after 30s particularly on the remote />>/host that is being replicated to. I rarely see the problem on my />>/local disk where almost all of the mail comes in, it's almost always />>/occurring on the replicate/remote system. />//>/It might be time to describe our setups in order to possibly find co
>>>    mmon
>>> grounds that might trigger this issue you describe and Rimko and myself
>>> ran into as well. />//>/Servers: Cloud Instances (both identical), around 25ms latency apart. />/Intel Core Processor (Haswell, no TSX) (3092.91-MHz K8-class CPU) />/Both servers are connected via IPsec/racoon tunnels />/OS: FreeBSD 11.1-STABLE (both servers) />/Filesystem: ZFS />/MTA: postfix 3.4-20180401 (postfix delivers via dovecot's LMTP) />/IMAP: dovecot running in FreeBSD jails (issues with 2.3.1, fine with
>>> 2.2.35) />/Replication: unsecured tcp / master-master />/MUA: mainly iOS or macOS mail.app, rarely roundcube /
>>> For me:
>>>
>>> Servers:	Main: VM on a VMWare ESXi local system (light
>>> 		load), local SSD disks (no NFS)
>>> 		Redundant: Linode VM in Singapore, around 250ms away
>>> 		Also no NFS.  Linode use SSDs for IO.
>>> 		There is native IPv6 connectivity between both VMs.
>>> 		As I am using TCPs I don't have a VPN between them -
>>> 		just raw IPv6 end to end.
>>> OS:		Gentoo Linux x86_64 kept well up to date
>>> Filesystem:	EXT4 for both
>>> MTA:		Postfix 3.4-x (same as you)
>>> IMAP:		Dovecot running natively on the machine (no jail/chroot)
>>> Replication:	Secured TCP / master-master (but only one master is
>>> 		actively used for SMTP and MUA access)
>>> MUA:		Thunderbird, Maildroid, MS Mail, Gmail client,
>>> 		Roundcube, LG Mail client on handset etc
>>>
>>>> /I believe it is worthwhile to mention here that I run a poor man's
>>> fail-over approach (round-robin DNS) as follows: />//>/DNS: mail.server.tld resolves to one IP4 and one IP6 address of each
>>> server, thus 4 IP addresses in total /
>>> For me I have a simple pair of servers that replicate mail to each
>>> other.  Clients connect to the main server only, which is where mail
>>> comes in.  The second one that experiences the problem with locking is
>>> only a redundant server which is a secondary MX, that only gets
>>> occasional SMTP deliveries from time to time.  So the amount of
>>> replication back from the "secondary" is very minimal; the vast bulk of
>>> the replication is the primary replicating changes out.
>>>
>>> Reuben
>>>



More information about the dovecot mailing list