2.3.1 Replication is throwing scary errors
Andy Weal
andy at bizemail.com.au
Mon May 7 04:46:32 EEST 2018
Hi all,
Firstly, thank you for both your feedback and suggestions.
My current work around is to use rsync to keep my standby mx "close" to
up-to-date.
As my servers are not commercial and production so to seek i am more
than happy to carry out some tests if that would prove useful to the team.
I have some spare hardware and am more than willing to spin up an extra
vm if required.
regards,
Andy
On 7/05/2018 5:34 AM, Michael Grimm wrote:
> Hi Reuben,
>
> thank you for digging into this issue reported by some users, now.
>
> I wonder if there are users out there using master/master replication with 2.3.1 dovecot that do *NOT* observe the errors mentioned in this thread? Please report, please. And, pleas have a closer look for
>
> It would be interesting to know if these failures are seen by a subset of users, only.
>
> And please have a look for processes like:
> doveadm-server: [IP4 <user> INBOX import:1/3] (doveadm-server)
>
> These processes will "survive" a dovecot reboot ...
>
> With kind regards,
> Michael
>
>> On 6. May 2018, at 16:11, Reuben Farrelly <reuben-dovecot at reub.net> wrote:
>>
>> Ah yes. The hash I put below is wrong but the commit message I quoted was right. This is the last known working good commit, which was on Mar 6:
>>
>> https://github.com/dovecot/core/commit/ff5d02182573b1d4306c2a8c36605c98f217ef3b
>> "doveadm: Unref header search context after use
>> Fixes memory leak, found by valgrind"
>>
>> I've subsequently been able to determine that the bad commit was one of the dozens which were committed in a big batch on Mar 12, 2018. I definitely see the problem happening when running with commit 6418419ec282c887b67469dbe3f541fc4873f7f0 (last one in that batch).
>>
>> In other words it's one of the commits between those two hashes that is the problem. Just need to work out which one...........
>>
>> The problem is also seen with -git as of a day or so ago. I'm tracking master-2.3 so I imagine I'm picking up all the 2.3 fixes as they go straight in (including the rc fixes). So this doesn't appear to be fixed yet.
>>
>> Reuben
>>
>>
>>
>> On 6/05/2018 11:41 pm, Aki Tuomi wrote:
>>> Seems you got the hash wrong...
>>> You probably mean 4a1e2e8e85ad5d4ab43496a5f3cd6b7ff6ac5955, which fixes a memory leak, but does not release sync lock.
>>> Interesting if this commit fixes the issue.
>>> Looking at the sync code, where the lock is made / released, there is only one commit recently, which is 8077d714e11388a294f1583e706152396972acce and that one creates home directory if missing.
>>> Would be interesting if you could test with 2.3.2rc1, which was released few days ago, whether this issue goes away?
>>> Aki
>>>> On 06 May 2018 at 15:21 Reuben Farrelly <reuben-dovecot at reub.net> wrote:
>>>>
>>>>
>>>> Hi Andy,
>>>>
>>>> Funny you say that - I've been testing this very problem for some time
>>>> today to narrow down the commits where I think the problem started
>>>> occurring.
>>>>
>>>> So far I have been able to roll forward to commit from git:
>>>> 74ef8506dcb946acc42722683e49fdcb964eed19
>>>>
>>>> >> "doveadm: Unref header search context after use" . So far I have
>>>> been running on that code for a few hours and I'm not yet seeing the
>>>> problem, which indicates the issue is probably subsequent to this.
>>>>
>>>> I'm of the view that the problem started occurring somewhere between
>>>> that commit above on 6 March and the final 2.3.1 release on 27 March.
>>>> There are a lot of changes between those two dates, and testing to
>>>> narrow down to one specific commit is extremely time consuming.
>>>>
>>>> It would be really helpful if even one of the Dovecot developers could
>>>> comment on this and give us some direction (or candidate commits to
>>>> investigate) or just let us know if the root cause has been found. This
>>>> bug is a showstopper and has stopped me tracking master-2.3 for over 3
>>>> months now, as I can't test later builds or even upgrade to the 2.3.1
>>>> release while replication is so broken.
>>>>
>>>> Reuben
>>>>
>>>>
>>>>
>>>>> Message: 1
>>>>> Date: Sun, 6 May 2018 13:21:57 +1000
>>>>> From: Andy Weal <andy at bizemail.com.au>
>>>>> To: dovecot at dovecot.org
>>>>> Subject: 2.3.1 Replication is throwing scary errors
>>>>> Message-ID: <f4274128-350b-9a74-a344-8e6caeffa112 at bizemail.com.au>
>>>>> Content-Type: text/plain; charset="utf-8"; Format="flowed"
>>>>>
>>>>> Hi all,
>>>>>
>>>>> New to the mailing lists but have joined up because of above */2.3.1
>>>>> Replication is throwing scary errors
>>>>>
>>>>>
>>>>> /*Brief system configuration
>>>>> ??? MX1 - Main
>>>>> ??? ??? Freebsd 11.1-Release-p9
>>>>> ??? ??? Hosted on a Vultr VM in Sydney AU
>>>>> ??? ??? MTA = Postfix 3.4-20180401
>>>>> ??? ??? Dovecot = 2.3.1
>>>>> ??? ??? File system = ufs
>>>>> ??? MX2 - Backup
>>>>> ??? ??? Freebsd 11.1-Release-p9
>>>>> ??? ???? Running on bare metal - no VM or jails
>>>>> ??? ??? MTA = Postfix 3.4-20180401
>>>>> ??? ??? Dovecot = 2.3.1
>>>>> ??? ??? File system = ufs ( on SSD)
>>>>>
>>>>> */
>>>>> /*Brief sequence of events
>>>>>
>>>>> 1. ??? apx 10 days back upgraded both mx1 and mx2 to dovecot 2.3.1_2
>>>>> from 2.3.0 ? (service dovecot stop, portmaster upgrade, service
>>>>> dovecot start)
>>>>> 2. ??? both systems ran ok with no errors for 10 days.
>>>>> 3. ??? Last night I shutdown mx2 and restarted it a few hours later
>>>>> 4. ??? within minutes i was getting the following types of errors on mx2
>>>>>
>>>>> ???????? May 06 12:56:29 doveadm: Error: Couldn't lock
>>>>> /var/mail/vhosts/example.net/user1/.dovecot-sync.lock:
>>>>> fcntl(/var/mail/vhosts/example.net/user1/.dovecot-sync.lock, write-lock,
>>>>> F_SETLKW) locking failed: Timed out after 30 seconds (WRITE lock held by
>>>>> pid 1960)
>>>>>
>>>>> ? Before i venture down the rabbit hole of fault finding and excess
>>>>> coffee consumption I was wondering if any of you had any updates on the
>>>>> problems discussed below.
>>>>>
>>>>>
>>>>> Cheers for now,
>>>>> Andy
>>>>>
>>>>>
>>>>>
>>>>> Hi,
>>>>>
>>>>> [Formatting is a bit rough, replying from a trimmed digest email]
>>>>>
>>>>>> /Message: 1 />/Date: Fri, 6 Apr 2018 15:04:35 +0200 />/From: Michael Grimm <trashcan at ellael.org
>>>>> <https://dovecot.org/mailman/listinfo/dovecot>> />/To: Dovecot Mailing List <dovecot at dovecot.org
>>>>> <https://dovecot.org/mailman/listinfo/dovecot>> />/Subject: Re: 2.3.1 Replication is throwing scary errors />/Message-ID: <E7E7A927-68F8-443F-BA59-E66CED8FE878 at ellael.org
>>>>> <https://dovecot.org/mailman/listinfo/dovecot>> />/Content-Type: text/plain; charset=utf-8 />//>/Reuben Farrelly wrote: />>/From: Michael Grimm <trashcan at ellael.org
>>>>> <https://dovecot.org/mailman/listinfo/dovecot>> />//>>>/[This is Dovecot 2.3.1 at FreeBSD STABLE-11.1 running in two jails at
>>>>> distinct servers.] />>>/I did upgrade from 2.2.35 to 2.3.1 today, and I do become pounded by
>>>>> error messages at server1 (and vice versa at server2) as follows: />>>/| Apr 2 17:12:18 <mail.err> server1.lan dovecot: doveadm: Error:
>>>>> dsync(server2.lan): I/O has stalled, \ />>>/no activity for 600 seconds (last sent=mail_change, last
>>>>> recv=mail_change (EOL)) />>>/| Apr 2 17:12:18 <mail.err> server1.lan dovecot: doveadm: Error: Timeout
>>>>> during state=sync_mails \ />>>/(send=changes recv=mail_requests) />>>/[?] />>>/| Apr 2 18:59:03 <mail.err> server1.lan dovecot: doveadm: Error:
>>>>> dsync(server2.lan): I/O has stalled, \ />>>/no activity for 600 seconds (last sent=mail, last recv=mail (EOL)) />>>/| Apr 2 18:59:03 <mail.err> server1.lan dovecot: doveadm: Error: Timeout
>>>>> during state=sync_mails \ />>>/(send=mails recv=recv_last_common) />>>/I cannot see in my personal account any missing replications, *but* I
>>>>> haven't tested this thoroughly enough. I do have customers being
>>>>> serviced at these productive servers, *thus* I'm back to 2.2.35 until I
>>>>> do understand or have learned what is going on. />//>/In my reply to this statement of mine I mentioned that I have seen those
>>>>> timeouts quite some times during the past year. Thus, I upgraded to
>>>>> 2.3.1 again, and boom: after some hours I ended up in hanging processes
>>>>> [1] like (see Remko's mail in addition) ... />//>/doveadm-server: [IP4/6 <USER1> SOME/MAILBOX import:0/0] (doveadm-server) />//>/? at server2 paired with a file like ? />//>/-rw------- 1 vmail dovecot uarch 0 Apr 3 16:52
>>>>> /home/to/USER1/.dovecot-sync.lock />//>/Corresponding logfile entries at server2 are like ? />//>/Apr 3 17:10:49 <mail.err> server2.lan dovecot: doveadm: Error: Couldn't
>>>>> lock /home/to/USER1/.dovecot-sync.lock: \ />/fcntl(/home/to/USER1/.dovecot-sync.lock, write-lock, F_SETLKW) locking
>>>>> failed: Timed out after 30 seconds \ />/(WRITE lock held by pid 51110) />//>/[1] Even stopping dovecot will not end those processes. One has to
>>>>> manually kill those before restarting dovecot. />//>/After one day of testing 2.3.1 with a couple of those episodes of
>>>>> locking/timeout, and now missing mails depending with server your MUA
>>>>> will connect to, I went back to 2.2.35. After two days at that version I
>>>>> never had such an episode again. />//>>/It's not just you. This issue hit me recently, and it was impacting />>/replication noticeably. I am following git master-2.3 . />/[...] />>/There is also a second issue of a long standing race with replication />>/occurring somewhere whereby if a mail comes in, is written to disk, is />>/replicated and then deleted in short succession, it will reappear />>/again to the MUA. I suspect the mail is being replicated back from />>/the remote. A few people have reported it over the years but it's not />>/reliable or consistent, so it has never been fixed. />>/And lastly there has been an ongoing but seemingly minor issue />>/relating to locking timing out after 30s particularly on the remote />>/host that is being replicated to. I rarely see the problem on my />>/local disk where almost all of the mail comes in, it's almost always />>/occurring on the replicate/remote system. />//>/It might be time to describe our setups in order to possibly find co
>>>>> mmon
>>>>> grounds that might trigger this issue you describe and Rimko and myself
>>>>> ran into as well. />//>/Servers: Cloud Instances (both identical), around 25ms latency apart. />/Intel Core Processor (Haswell, no TSX) (3092.91-MHz K8-class CPU) />/Both servers are connected via IPsec/racoon tunnels />/OS: FreeBSD 11.1-STABLE (both servers) />/Filesystem: ZFS />/MTA: postfix 3.4-20180401 (postfix delivers via dovecot's LMTP) />/IMAP: dovecot running in FreeBSD jails (issues with 2.3.1, fine with
>>>>> 2.2.35) />/Replication: unsecured tcp / master-master />/MUA: mainly iOS or macOS mail.app, rarely roundcube /
>>>>> For me:
>>>>>
>>>>> Servers: Main: VM on a VMWare ESXi local system (light
>>>>> load), local SSD disks (no NFS)
>>>>> Redundant: Linode VM in Singapore, around 250ms away
>>>>> Also no NFS. Linode use SSDs for IO.
>>>>> There is native IPv6 connectivity between both VMs.
>>>>> As I am using TCPs I don't have a VPN between them -
>>>>> just raw IPv6 end to end.
>>>>> OS: Gentoo Linux x86_64 kept well up to date
>>>>> Filesystem: EXT4 for both
>>>>> MTA: Postfix 3.4-x (same as you)
>>>>> IMAP: Dovecot running natively on the machine (no jail/chroot)
>>>>> Replication: Secured TCP / master-master (but only one master is
>>>>> actively used for SMTP and MUA access)
>>>>> MUA: Thunderbird, Maildroid, MS Mail, Gmail client,
>>>>> Roundcube, LG Mail client on handset etc
>>>>>
>>>>>> /I believe it is worthwhile to mention here that I run a poor man's
>>>>> fail-over approach (round-robin DNS) as follows: />//>/DNS: mail.server.tld resolves to one IP4 and one IP6 address of each
>>>>> server, thus 4 IP addresses in total /
>>>>> For me I have a simple pair of servers that replicate mail to each
>>>>> other. Clients connect to the main server only, which is where mail
>>>>> comes in. The second one that experiences the problem with locking is
>>>>> only a redundant server which is a secondary MX, that only gets
>>>>> occasional SMTP deliveries from time to time. So the amount of
>>>>> replication back from the "secondary" is very minimal; the vast bulk of
>>>>> the replication is the primary replicating changes out.
>>>>>
>>>>> Reuben
>>>>>
More information about the dovecot
mailing list