Re: 2.3.1 Replication is throwing scary errors
Hi Andy,
Funny you say that - I've been testing this very problem for some time today to narrow down the commits where I think the problem started occurring.
So far I have been able to roll forward to commit from git: 74ef8506dcb946acc42722683e49fdcb964eed19
"doveadm: Unref header search context after use" . So far I have been running on that code for a few hours and I'm not yet seeing the problem, which indicates the issue is probably subsequent to this.
I'm of the view that the problem started occurring somewhere between that commit above on 6 March and the final 2.3.1 release on 27 March. There are a lot of changes between those two dates, and testing to narrow down to one specific commit is extremely time consuming.
It would be really helpful if even one of the Dovecot developers could comment on this and give us some direction (or candidate commits to investigate) or just let us know if the root cause has been found. This bug is a showstopper and has stopped me tracking master-2.3 for over 3 months now, as I can't test later builds or even upgrade to the 2.3.1 release while replication is so broken.
Reuben
Message: 1 Date: Sun, 6 May 2018 13:21:57 +1000 From: Andy Weal andy@bizemail.com.au To: dovecot@dovecot.org Subject: 2.3.1 Replication is throwing scary errors Message-ID: f4274128-350b-9a74-a344-8e6caeffa112@bizemail.com.au Content-Type: text/plain; charset="utf-8"; Format="flowed"
Hi all,
New to the mailing lists but have joined up because of above */2.3.1 Replication is throwing scary errors
/*Brief system configuration ??? MX1 - Main ??? ??? Freebsd 11.1-Release-p9 ??? ??? Hosted on a Vultr VM in Sydney AU ??? ??? MTA = Postfix 3.4-20180401 ??? ??? Dovecot = 2.3.1 ??? ??? File system = ufs ??? MX2 - Backup ??? ??? Freebsd 11.1-Release-p9 ??? ???? Running on bare metal - no VM or jails ??? ??? MTA = Postfix 3.4-20180401 ??? ??? Dovecot = 2.3.1 ??? ??? File system = ufs ( on SSD)
*/ /*Brief sequence of events
- ??? apx 10 days back upgraded both mx1 and mx2 to dovecot 2.3.1_2 from 2.3.0 ? (service dovecot stop, portmaster upgrade, service dovecot start)
- ??? both systems ran ok with no errors for 10 days.
- ??? Last night I shutdown mx2 and restarted it a few hours later
- ??? within minutes i was getting the following types of errors on mx2
???????? May 06 12:56:29 doveadm: Error: Couldn't lock /var/mail/vhosts/example.net/user1/.dovecot-sync.lock: fcntl(/var/mail/vhosts/example.net/user1/.dovecot-sync.lock, write-lock, F_SETLKW) locking failed: Timed out after 30 seconds (WRITE lock held by pid 1960)
? Before i venture down the rabbit hole of fault finding and excess coffee consumption I was wondering if any of you had any updates on the problems discussed below.
Cheers for now, Andy
Hi,
[Formatting is a bit rough, replying from a trimmed digest email]
/Message: 1 />/Date: Fri, 6 Apr 2018 15:04:35 +0200 />/From: Michael Grimm
https://dovecot.org/mailman/listinfo/dovecot> />/To: Dovecot Mailing List https://dovecot.org/mailman/listinfo/dovecot> />/Subject: Re: 2.3.1 Replication is throwing scary errors />/Message-ID: https://dovecot.org/mailman/listinfo/dovecot> />/Content-Type: text/plain; charset=utf-8 />//>/Reuben Farrelly wrote: />>/From: Michael Grimm https://dovecot.org/mailman/listinfo/dovecot> />//>>>/[This is Dovecot 2.3.1 at FreeBSD STABLE-11.1 running in two jails at distinct servers.] />>>/I did upgrade from 2.2.35 to 2.3.1 today, and I do become pounded by error messages at server1 (and vice versa at server2) as follows: />>>/| Apr 2 17:12:18 server1.lan dovecot: doveadm: Error: dsync(server2.lan): I/O has stalled, \ />>>/no activity for 600 seconds (last sent=mail_change, last recv=mail_change (EOL)) />>>/| Apr 2 17:12:18 server1.lan dovecot: doveadm: Error: Timeout during state=sync_mails \ />>>/(send=changes recv=mail_requests) />>>/[?] />>>/| Apr 2 18:59:03 server1.lan dovecot: doveadm: Error: dsync(server2.lan): I/O has stalled, \ />>>/no activity for 600 seconds (last sent=mail, last recv=mail (EOL)) />>>/| Apr 2 18:59:03 server1.lan dovecot: doveadm: Error: Timeout during state=sync_mails \ />>>/(send=mails recv=recv_last_common) />>>/I cannot see in my personal account any missing replications, *but* I haven't tested this thoroughly enough. I do have customers being serviced at these productive servers, *thus* I'm back to 2.2.35 until I do understand or have learned what is going on. />//>/In my reply to this statement of mine I mentioned that I have seen those timeouts quite some times during the past year. Thus, I upgraded to 2.3.1 again, and boom: after some hours I ended up in hanging processes [1] like (see Remko's mail in addition) ... />//>/doveadm-server: [IP4/6 <USER1> SOME/MAILBOX import:0/0] (doveadm-server) />//>/? at server2 paired with a file like ? />//>/-rw------- 1 vmail dovecot uarch 0 Apr 3 16:52 /home/to/USER1/.dovecot-sync.lock />//>/Corresponding logfile entries at server2 are like ? />//>/Apr 3 17:10:49 server2.lan dovecot: doveadm: Error: Couldn't lock /home/to/USER1/.dovecot-sync.lock: \ />/fcntl(/home/to/USER1/.dovecot-sync.lock, write-lock, F_SETLKW) locking failed: Timed out after 30 seconds \ />/(WRITE lock held by pid 51110) />//>/[1] Even stopping dovecot will not end those processes. One has to manually kill those before restarting dovecot. />//>/After one day of testing 2.3.1 with a couple of those episodes of locking/timeout, and now missing mails depending with server your MUA will connect to, I went back to 2.2.35. After two days at that version I never had such an episode again. />//>>/It's not just you. This issue hit me recently, and it was impacting />>/replication noticeably. I am following git master-2.3 . />/[...] />>/There is also a second issue of a long standing race with replication />>/occurring somewhere whereby if a mail comes in, is written to disk, is />>/replicated and then deleted in short succession, it will reappear />>/again to the MUA. I suspect the mail is being replicated back from />>/the remote. A few people have reported it over the years but it's not />>/reliable or consistent, so it has never been fixed. />>/And lastly there has been an ongoing but seemingly minor issue />>/relating to locking timing out after 30s particularly on the remote />>/host that is being replicated to. I rarely see the problem on my />>/local disk where almost all of the mail comes in, it's almost always />>/occurring on the replicate/remote system. />//>/It might be time to describe our setups in order to possibly find co mmon grounds that might trigger this issue you describe and Rimko and myself ran into as well. />//>/Servers: Cloud Instances (both identical), around 25ms latency apart. />/Intel Core Processor (Haswell, no TSX) (3092.91-MHz K8-class CPU) />/Both servers are connected via IPsec/racoon tunnels />/OS: FreeBSD 11.1-STABLE (both servers) />/Filesystem: ZFS />/MTA: postfix 3.4-20180401 (postfix delivers via dovecot's LMTP) />/IMAP: dovecot running in FreeBSD jails (issues with 2.3.1, fine with 2.2.35) />/Replication: unsecured tcp / master-master />/MUA: mainly iOS or macOS mail.app, rarely roundcube / For me: Servers: Main: VM on a VMWare ESXi local system (light load), local SSD disks (no NFS) Redundant: Linode VM in Singapore, around 250ms away Also no NFS. Linode use SSDs for IO. There is native IPv6 connectivity between both VMs. As I am using TCPs I don't have a VPN between them - just raw IPv6 end to end. OS: Gentoo Linux x86_64 kept well up to date Filesystem: EXT4 for both MTA: Postfix 3.4-x (same as you) IMAP: Dovecot running natively on the machine (no jail/chroot) Replication: Secured TCP / master-master (but only one master is actively used for SMTP and MUA access) MUA: Thunderbird, Maildroid, MS Mail, Gmail client, Roundcube, LG Mail client on handset etc
/I believe it is worthwhile to mention here that I run a poor man's fail-over approach (round-robin DNS) as follows: />//>/DNS: mail.server.tld resolves to one IP4 and one IP6 address of each server, thus 4 IP addresses in total / For me I have a simple pair of servers that replicate mail to each other. Clients connect to the main server only, which is where mail comes in. The second one that experiences the problem with locking is only a redundant server which is a secondary MX, that only gets occasional SMTP deliveries from time to time. So the amount of replication back from the "secondary" is very minimal; the vast bulk of the replication is the primary replicating changes out.
Reuben
Seems you got the hash wrong...
You probably mean 4a1e2e8e85ad5d4ab43496a5f3cd6b7ff6ac5955, which fixes a memory leak, but does not release sync lock.
Interesting if this commit fixes the issue.
Looking at the sync code, where the lock is made / released, there is only one commit recently, which is 8077d714e11388a294f1583e706152396972acce and that one creates home directory if missing.
Would be interesting if you could test with 2.3.2rc1, which was released few days ago, whether this issue goes away?
Aki
On 06 May 2018 at 15:21 Reuben Farrelly reuben-dovecot@reub.net wrote:
Hi Andy,
Funny you say that - I've been testing this very problem for some time today to narrow down the commits where I think the problem started occurring.
So far I have been able to roll forward to commit from git: 74ef8506dcb946acc42722683e49fdcb964eed19
"doveadm: Unref header search context after use" . So far I have been running on that code for a few hours and I'm not yet seeing the problem, which indicates the issue is probably subsequent to this.
I'm of the view that the problem started occurring somewhere between that commit above on 6 March and the final 2.3.1 release on 27 March. There are a lot of changes between those two dates, and testing to narrow down to one specific commit is extremely time consuming.
It would be really helpful if even one of the Dovecot developers could comment on this and give us some direction (or candidate commits to investigate) or just let us know if the root cause has been found. This bug is a showstopper and has stopped me tracking master-2.3 for over 3 months now, as I can't test later builds or even upgrade to the 2.3.1 release while replication is so broken.
Reuben
Message: 1 Date: Sun, 6 May 2018 13:21:57 +1000 From: Andy Weal andy@bizemail.com.au To: dovecot@dovecot.org Subject: 2.3.1 Replication is throwing scary errors Message-ID: f4274128-350b-9a74-a344-8e6caeffa112@bizemail.com.au Content-Type: text/plain; charset="utf-8"; Format="flowed"
Hi all,
New to the mailing lists but have joined up because of above */2.3.1 Replication is throwing scary errors
/*Brief system configuration ??? MX1 - Main ??? ??? Freebsd 11.1-Release-p9 ??? ??? Hosted on a Vultr VM in Sydney AU ??? ??? MTA = Postfix 3.4-20180401 ??? ??? Dovecot = 2.3.1 ??? ??? File system = ufs ??? MX2 - Backup ??? ??? Freebsd 11.1-Release-p9 ??? ???? Running on bare metal - no VM or jails ??? ??? MTA = Postfix 3.4-20180401 ??? ??? Dovecot = 2.3.1 ??? ??? File system = ufs ( on SSD)
*/ /*Brief sequence of events
- ??? apx 10 days back upgraded both mx1 and mx2 to dovecot 2.3.1_2 from 2.3.0 ? (service dovecot stop, portmaster upgrade, service dovecot start)
- ??? both systems ran ok with no errors for 10 days.
- ??? Last night I shutdown mx2 and restarted it a few hours later
- ??? within minutes i was getting the following types of errors on mx2
???????? May 06 12:56:29 doveadm: Error: Couldn't lock /var/mail/vhosts/example.net/user1/.dovecot-sync.lock: fcntl(/var/mail/vhosts/example.net/user1/.dovecot-sync.lock, write-lock, F_SETLKW) locking failed: Timed out after 30 seconds (WRITE lock held by pid 1960)
? Before i venture down the rabbit hole of fault finding and excess coffee consumption I was wondering if any of you had any updates on the problems discussed below.
Cheers for now, Andy
Hi,
[Formatting is a bit rough, replying from a trimmed digest email]
/Message: 1 />/Date: Fri, 6 Apr 2018 15:04:35 +0200 />/From: Michael Grimm
https://dovecot.org/mailman/listinfo/dovecot> />/To: Dovecot Mailing List https://dovecot.org/mailman/listinfo/dovecot> />/Subject: Re: 2.3.1 Replication is throwing scary errors />/Message-ID: https://dovecot.org/mailman/listinfo/dovecot> />/Content-Type: text/plain; charset=utf-8 />//>/Reuben Farrelly wrote: />>/From: Michael Grimm https://dovecot.org/mailman/listinfo/dovecot> />//>>>/[This is Dovecot 2.3.1 at FreeBSD STABLE-11.1 running in two jails at distinct servers.] />>>/I did upgrade from 2.2.35 to 2.3.1 today, and I do become pounded by error messages at server1 (and vice versa at server2) as follows: />>>/| Apr 2 17:12:18 server1.lan dovecot: doveadm: Error: dsync(server2.lan): I/O has stalled, \ />>>/no activity for 600 seconds (last sent=mail_change, last recv=mail_change (EOL)) />>>/| Apr 2 17:12:18 server1.lan dovecot: doveadm: Error: Timeout during state=sync_mails \ />>>/(send=changes recv=mail_requests) />>>/[?] />>>/| Apr 2 18:59:03 server1.lan dovecot: doveadm: Error: dsync(server2.lan): I/O has stalled, \ />>>/no activity for 600 seconds (last sent=mail, last recv=mail (EOL)) />>>/| Apr 2 18:59:03 server1.lan dovecot: doveadm: Error: Timeout during state=sync_mails \ />>>/(send=mails recv=recv_last_common) />>>/I cannot see in my personal account any missing replications, *but* I haven't tested this thoroughly enough. I do have customers being serviced at these productive servers, *thus* I'm back to 2.2.35 until I do understand or have learned what is going on. />//>/In my reply to this statement of mine I mentioned that I have seen those timeouts quite some times during the past year. Thus, I upgraded to 2.3.1 again, and boom: after some hours I ended up in hanging processes [1] like (see Remko's mail in addition) ... />//>/doveadm-server: [IP4/6 <USER1> SOME/MAILBOX import:0/0] (doveadm-server) />//>/? at server2 paired with a file like ? />//>/-rw------- 1 vmail dovecot uarch 0 Apr 3 16:52 /home/to/USER1/.dovecot-sync.lock />//>/Corresponding logfile entries at server2 are like ? />//>/Apr 3 17:10:49 server2.lan dovecot: doveadm: Error: Couldn't lock /home/to/USER1/.dovecot-sync.lock: \ />/fcntl(/home/to/USER1/.dovecot-sync.lock, write-lock, F_SETLKW) locking failed: Timed out after 30 seconds \ />/(WRITE lock held by pid 51110) />//>/[1] Even stopping dovecot will not end those processes. One has to manually kill those before restarting dovecot. />//>/After one day of testing 2.3.1 with a couple of those episodes of locking/timeout, and now missing mails depending with server your MUA will connect to, I went back to 2.2.35. After two days at that version I never had such an episode again. />//>>/It's not just you. This issue hit me recently, and it was impacting />>/replication noticeably. I am following git master-2.3 . />/[...] />>/There is also a second issue of a long standing race with replication />>/occurring somewhere whereby if a mail comes in, is written to disk, is />>/replicated and then deleted in short succession, it will reappear />>/again to the MUA. I suspect the mail is being replicated back from />>/the remote. A few people have reported it over the years but it's not />>/reliable or consistent, so it has never been fixed. />>/And lastly there has been an ongoing but seemingly minor issue />>/relating to locking timing out after 30s particularly on the remote />>/host that is being replicated to. I rarely see the problem on my />>/local disk where almost all of the mail comes in, it's almost always />>/occurring on the replicate/remote system. />//>/It might be time to describe our setups in order to possibly find co mmon grounds that might trigger this issue you describe and Rimko and myself ran into as well. />//>/Servers: Cloud Instances (both identical), around 25ms latency apart. />/Intel Core Processor (Haswell, no TSX) (3092.91-MHz K8-class CPU) />/Both servers are connected via IPsec/racoon tunnels />/OS: FreeBSD 11.1-STABLE (both servers) />/Filesystem: ZFS />/MTA: postfix 3.4-20180401 (postfix delivers via dovecot's LMTP) />/IMAP: dovecot running in FreeBSD jails (issues with 2.3.1, fine with 2.2.35) />/Replication: unsecured tcp / master-master />/MUA: mainly iOS or macOS mail.app, rarely roundcube / For me: Servers: Main: VM on a VMWare ESXi local system (light load), local SSD disks (no NFS) Redundant: Linode VM in Singapore, around 250ms away Also no NFS. Linode use SSDs for IO. There is native IPv6 connectivity between both VMs. As I am using TCPs I don't have a VPN between them - just raw IPv6 end to end. OS: Gentoo Linux x86_64 kept well up to date Filesystem: EXT4 for both MTA: Postfix 3.4-x (same as you) IMAP: Dovecot running natively on the machine (no jail/chroot) Replication: Secured TCP / master-master (but only one master is actively used for SMTP and MUA access) MUA: Thunderbird, Maildroid, MS Mail, Gmail client, Roundcube, LG Mail client on handset etc
/I believe it is worthwhile to mention here that I run a poor man's fail-over approach (round-robin DNS) as follows: />//>/DNS: mail.server.tld resolves to one IP4 and one IP6 address of each server, thus 4 IP addresses in total / For me I have a simple pair of servers that replicate mail to each other. Clients connect to the main server only, which is where mail comes in. The second one that experiences the problem with locking is only a redundant server which is a secondary MX, that only gets occasional SMTP deliveries from time to time. So the amount of replication back from the "secondary" is very minimal; the vast bulk of the replication is the primary replicating changes out.
Reuben
Ah yes. The hash I put below is wrong but the commit message I quoted was right. This is the last known working good commit, which was on Mar 6:
https://github.com/dovecot/core/commit/ff5d02182573b1d4306c2a8c36605c98f217e... "doveadm: Unref header search context after use Fixes memory leak, found by valgrind"
I've subsequently been able to determine that the bad commit was one of the dozens which were committed in a big batch on Mar 12, 2018. I definitely see the problem happening when running with commit 6418419ec282c887b67469dbe3f541fc4873f7f0 (last one in that batch).
In other words it's one of the commits between those two hashes that is the problem. Just need to work out which one...........
The problem is also seen with -git as of a day or so ago. I'm tracking master-2.3 so I imagine I'm picking up all the 2.3 fixes as they go straight in (including the rc fixes). So this doesn't appear to be fixed yet.
Reuben
On 6/05/2018 11:41 pm, Aki Tuomi wrote:
Seems you got the hash wrong...
You probably mean 4a1e2e8e85ad5d4ab43496a5f3cd6b7ff6ac5955, which fixes a memory leak, but does not release sync lock.
Interesting if this commit fixes the issue.
Looking at the sync code, where the lock is made / released, there is only one commit recently, which is 8077d714e11388a294f1583e706152396972acce and that one creates home directory if missing.
Would be interesting if you could test with 2.3.2rc1, which was released few days ago, whether this issue goes away?
Aki
On 06 May 2018 at 15:21 Reuben Farrelly reuben-dovecot@reub.net wrote:
Hi Andy,
Funny you say that - I've been testing this very problem for some time today to narrow down the commits where I think the problem started occurring.
So far I have been able to roll forward to commit from git: 74ef8506dcb946acc42722683e49fdcb964eed19
"doveadm: Unref header search context after use" . So far I have been running on that code for a few hours and I'm not yet seeing the problem, which indicates the issue is probably subsequent to this.
I'm of the view that the problem started occurring somewhere between that commit above on 6 March and the final 2.3.1 release on 27 March. There are a lot of changes between those two dates, and testing to narrow down to one specific commit is extremely time consuming.
It would be really helpful if even one of the Dovecot developers could comment on this and give us some direction (or candidate commits to investigate) or just let us know if the root cause has been found. This bug is a showstopper and has stopped me tracking master-2.3 for over 3 months now, as I can't test later builds or even upgrade to the 2.3.1 release while replication is so broken.
Reuben
Message: 1 Date: Sun, 6 May 2018 13:21:57 +1000 From: Andy Weal andy@bizemail.com.au To: dovecot@dovecot.org Subject: 2.3.1 Replication is throwing scary errors Message-ID: f4274128-350b-9a74-a344-8e6caeffa112@bizemail.com.au Content-Type: text/plain; charset="utf-8"; Format="flowed"
Hi all,
New to the mailing lists but have joined up because of above */2.3.1 Replication is throwing scary errors
/*Brief system configuration ??? MX1 - Main ??? ??? Freebsd 11.1-Release-p9 ??? ??? Hosted on a Vultr VM in Sydney AU ??? ??? MTA = Postfix 3.4-20180401 ??? ??? Dovecot = 2.3.1 ??? ??? File system = ufs ??? MX2 - Backup ??? ??? Freebsd 11.1-Release-p9 ??? ???? Running on bare metal - no VM or jails ??? ??? MTA = Postfix 3.4-20180401 ??? ??? Dovecot = 2.3.1 ??? ??? File system = ufs ( on SSD)
*/ /*Brief sequence of events
- ??? apx 10 days back upgraded both mx1 and mx2 to dovecot 2.3.1_2 from 2.3.0 ? (service dovecot stop, portmaster upgrade, service dovecot start)
- ??? both systems ran ok with no errors for 10 days.
- ??? Last night I shutdown mx2 and restarted it a few hours later
- ??? within minutes i was getting the following types of errors on mx2
???????? May 06 12:56:29 doveadm: Error: Couldn't lock /var/mail/vhosts/example.net/user1/.dovecot-sync.lock: fcntl(/var/mail/vhosts/example.net/user1/.dovecot-sync.lock, write-lock, F_SETLKW) locking failed: Timed out after 30 seconds (WRITE lock held by pid 1960)
? Before i venture down the rabbit hole of fault finding and excess coffee consumption I was wondering if any of you had any updates on the problems discussed below.
Cheers for now, Andy
Hi,
[Formatting is a bit rough, replying from a trimmed digest email]
/Message: 1 />/Date: Fri, 6 Apr 2018 15:04:35 +0200 />/From: Michael Grimm
https://dovecot.org/mailman/listinfo/dovecot> />/To: Dovecot Mailing List https://dovecot.org/mailman/listinfo/dovecot> />/Subject: Re: 2.3.1 Replication is throwing scary errors />/Message-ID: https://dovecot.org/mailman/listinfo/dovecot> />/Content-Type: text/plain; charset=utf-8 />//>/Reuben Farrelly wrote: />>/From: Michael Grimm https://dovecot.org/mailman/listinfo/dovecot> />//>>>/[This is Dovecot 2.3.1 at FreeBSD STABLE-11.1 running in two jails at distinct servers.] />>>/I did upgrade from 2.2.35 to 2.3.1 today, and I do become pounded by error messages at server1 (and vice versa at server2) as follows: />>>/| Apr 2 17:12:18 server1.lan dovecot: doveadm: Error: dsync(server2.lan): I/O has stalled, \ />>>/no activity for 600 seconds (last sent=mail_change, last recv=mail_change (EOL)) />>>/| Apr 2 17:12:18 server1.lan dovecot: doveadm: Error: Timeout during state=sync_mails \ />>>/(send=changes recv=mail_requests) />>>/[?] />>>/| Apr 2 18:59:03 server1.lan dovecot: doveadm: Error: dsync(server2.lan): I/O has stalled, \ />>>/no activity for 600 seconds (last sent=mail, last recv=mail (EOL)) />>>/| Apr 2 18:59:03 server1.lan dovecot: doveadm: Error: Timeout during state=sync_mails \ />>>/(send=mails recv=recv_last_common) />>>/I cannot see in my personal account any missing replications, *but* I haven't tested this thoroughly enough. I do have customers being serviced at these productive servers, *thus* I'm back to 2.2.35 until I do understand or have learned what is going on. />//>/In my reply to this statement of mine I mentioned that I have seen those timeouts quite some times during the past year. Thus, I upgraded to 2.3.1 again, and boom: after some hours I ended up in hanging processes [1] like (see Remko's mail in addition) ... />//>/doveadm-server: [IP4/6 <USER1> SOME/MAILBOX import:0/0] (doveadm-server) />//>/? at server2 paired with a file like ? />//>/-rw------- 1 vmail dovecot uarch 0 Apr 3 16:52 /home/to/USER1/.dovecot-sync.lock />//>/Corresponding logfile entries at server2 are like ? />//>/Apr 3 17:10:49 server2.lan dovecot: doveadm: Error: Couldn't lock /home/to/USER1/.dovecot-sync.lock: \ />/fcntl(/home/to/USER1/.dovecot-sync.lock, write-lock, F_SETLKW) locking failed: Timed out after 30 seconds \ />/(WRITE lock held by pid 51110) />//>/[1] Even stopping dovecot will not end those processes. One has to manually kill those before restarting dovecot. />//>/After one day of testing 2.3.1 with a couple of those episodes of locking/timeout, and now missing mails depending with server your MUA will connect to, I went back to 2.2.35. After two days at that version I never had such an episode again. />//>>/It's not just you. This issue hit me recently, and it was impacting />>/replication noticeably. I am following git master-2.3 . />/[...] />>/There is also a second issue of a long standing race with replication />>/occurring somewhere whereby if a mail comes in, is written to disk, is />>/replicated and then deleted in short succession, it will reappear />>/again to the MUA. I suspect the mail is being replicated back from />>/the remote. A few people have reported it over the years but it's not />>/reliable or consistent, so it has never been fixed. />>/And lastly there has been an ongoing but seemingly minor issue />>/relating to locking timing out after 30s particularly on the remote />>/host that is being replicated to. I rarely see the problem on my />>/local disk where almost all of the mail comes in, it's almost always />>/occurring on the replicate/remote system. />//>/It might be time to describe our setups in order to possibly find co mmon grounds that might trigger this issue you describe and Rimko and myself ran into as well. />//>/Servers: Cloud Instances (both identical), around 25ms latency apart. />/Intel Core Processor (Haswell, no TSX) (3092.91-MHz K8-class CPU) />/Both servers are connected via IPsec/racoon tunnels />/OS: FreeBSD 11.1-STABLE (both servers) />/Filesystem: ZFS />/MTA: postfix 3.4-20180401 (postfix delivers via dovecot's LMTP) />/IMAP: dovecot running in FreeBSD jails (issues with 2.3.1, fine with 2.2.35) />/Replication: unsecured tcp / master-master />/MUA: mainly iOS or macOS mail.app, rarely roundcube / For me: Servers: Main: VM on a VMWare ESXi local system (light load), local SSD disks (no NFS) Redundant: Linode VM in Singapore, around 250ms away Also no NFS. Linode use SSDs for IO. There is native IPv6 connectivity between both VMs. As I am using TCPs I don't have a VPN between them - just raw IPv6 end to end. OS: Gentoo Linux x86_64 kept well up to date Filesystem: EXT4 for both MTA: Postfix 3.4-x (same as you) IMAP: Dovecot running natively on the machine (no jail/chroot) Replication: Secured TCP / master-master (but only one master is actively used for SMTP and MUA access) MUA: Thunderbird, Maildroid, MS Mail, Gmail client, Roundcube, LG Mail client on handset etc
/I believe it is worthwhile to mention here that I run a poor man's fail-over approach (round-robin DNS) as follows: />//>/DNS: mail.server.tld resolves to one IP4 and one IP6 address of each server, thus 4 IP addresses in total / For me I have a simple pair of servers that replicate mail to each other. Clients connect to the main server only, which is where mail comes in. The second one that experiences the problem with locking is only a redundant server which is a secondary MX, that only gets occasional SMTP deliveries from time to time. So the amount of replication back from the "secondary" is very minimal; the vast bulk of the replication is the primary replicating changes out.
Reuben
Hi Reuben,
thank you for digging into this issue reported by some users, now.
I wonder if there are users out there using master/master replication with 2.3.1 dovecot that do *NOT* observe the errors mentioned in this thread? Please report, please. And, pleas have a closer look for
It would be interesting to know if these failures are seen by a subset of users, only.
And please have a look for processes like: doveadm-server: [IP4 <user> INBOX import:1/3] (doveadm-server)
These processes will "survive" a dovecot reboot ...
With kind regards, Michael
On 6. May 2018, at 16:11, Reuben Farrelly reuben-dovecot@reub.net wrote:
Ah yes. The hash I put below is wrong but the commit message I quoted was right. This is the last known working good commit, which was on Mar 6:
https://github.com/dovecot/core/commit/ff5d02182573b1d4306c2a8c36605c98f217e... "doveadm: Unref header search context after use Fixes memory leak, found by valgrind"
I've subsequently been able to determine that the bad commit was one of the dozens which were committed in a big batch on Mar 12, 2018. I definitely see the problem happening when running with commit 6418419ec282c887b67469dbe3f541fc4873f7f0 (last one in that batch).
In other words it's one of the commits between those two hashes that is the problem. Just need to work out which one...........
The problem is also seen with -git as of a day or so ago. I'm tracking master-2.3 so I imagine I'm picking up all the 2.3 fixes as they go straight in (including the rc fixes). So this doesn't appear to be fixed yet.
Reuben
On 6/05/2018 11:41 pm, Aki Tuomi wrote:
Seems you got the hash wrong... You probably mean 4a1e2e8e85ad5d4ab43496a5f3cd6b7ff6ac5955, which fixes a memory leak, but does not release sync lock. Interesting if this commit fixes the issue. Looking at the sync code, where the lock is made / released, there is only one commit recently, which is 8077d714e11388a294f1583e706152396972acce and that one creates home directory if missing. Would be interesting if you could test with 2.3.2rc1, which was released few days ago, whether this issue goes away? Aki
On 06 May 2018 at 15:21 Reuben Farrelly reuben-dovecot@reub.net wrote:
Hi Andy,
Funny you say that - I've been testing this very problem for some time today to narrow down the commits where I think the problem started occurring.
So far I have been able to roll forward to commit from git: 74ef8506dcb946acc42722683e49fdcb964eed19
"doveadm: Unref header search context after use" . So far I have been running on that code for a few hours and I'm not yet seeing the problem, which indicates the issue is probably subsequent to this.
I'm of the view that the problem started occurring somewhere between that commit above on 6 March and the final 2.3.1 release on 27 March. There are a lot of changes between those two dates, and testing to narrow down to one specific commit is extremely time consuming.
It would be really helpful if even one of the Dovecot developers could comment on this and give us some direction (or candidate commits to investigate) or just let us know if the root cause has been found. This bug is a showstopper and has stopped me tracking master-2.3 for over 3 months now, as I can't test later builds or even upgrade to the 2.3.1 release while replication is so broken.
Reuben
Message: 1 Date: Sun, 6 May 2018 13:21:57 +1000 From: Andy Weal andy@bizemail.com.au To: dovecot@dovecot.org Subject: 2.3.1 Replication is throwing scary errors Message-ID: f4274128-350b-9a74-a344-8e6caeffa112@bizemail.com.au Content-Type: text/plain; charset="utf-8"; Format="flowed"
Hi all,
New to the mailing lists but have joined up because of above */2.3.1 Replication is throwing scary errors
/*Brief system configuration ??? MX1 - Main ??? ??? Freebsd 11.1-Release-p9 ??? ??? Hosted on a Vultr VM in Sydney AU ??? ??? MTA = Postfix 3.4-20180401 ??? ??? Dovecot = 2.3.1 ??? ??? File system = ufs ??? MX2 - Backup ??? ??? Freebsd 11.1-Release-p9 ??? ???? Running on bare metal - no VM or jails ??? ??? MTA = Postfix 3.4-20180401 ??? ??? Dovecot = 2.3.1 ??? ??? File system = ufs ( on SSD)
*/ /*Brief sequence of events
- ??? apx 10 days back upgraded both mx1 and mx2 to dovecot 2.3.1_2 from 2.3.0 ? (service dovecot stop, portmaster upgrade, service dovecot start)
- ??? both systems ran ok with no errors for 10 days.
- ??? Last night I shutdown mx2 and restarted it a few hours later
- ??? within minutes i was getting the following types of errors on mx2
???????? May 06 12:56:29 doveadm: Error: Couldn't lock /var/mail/vhosts/example.net/user1/.dovecot-sync.lock: fcntl(/var/mail/vhosts/example.net/user1/.dovecot-sync.lock, write-lock, F_SETLKW) locking failed: Timed out after 30 seconds (WRITE lock held by pid 1960)
? Before i venture down the rabbit hole of fault finding and excess coffee consumption I was wondering if any of you had any updates on the problems discussed below.
Cheers for now, Andy
Hi,
[Formatting is a bit rough, replying from a trimmed digest email]
/Message: 1 />/Date: Fri, 6 Apr 2018 15:04:35 +0200 />/From: Michael Grimm
https://dovecot.org/mailman/listinfo/dovecot> />/To: Dovecot Mailing List https://dovecot.org/mailman/listinfo/dovecot> />/Subject: Re: 2.3.1 Replication is throwing scary errors />/Message-ID: https://dovecot.org/mailman/listinfo/dovecot> />/Content-Type: text/plain; charset=utf-8 />//>/Reuben Farrelly wrote: />>/From: Michael Grimm https://dovecot.org/mailman/listinfo/dovecot> />//>>>/[This is Dovecot 2.3.1 at FreeBSD STABLE-11.1 running in two jails at distinct servers.] />>>/I did upgrade from 2.2.35 to 2.3.1 today, and I do become pounded by error messages at server1 (and vice versa at server2) as follows: />>>/| Apr 2 17:12:18 server1.lan dovecot: doveadm: Error: dsync(server2.lan): I/O has stalled, \ />>>/no activity for 600 seconds (last sent=mail_change, last recv=mail_change (EOL)) />>>/| Apr 2 17:12:18 server1.lan dovecot: doveadm: Error: Timeout during state=sync_mails \ />>>/(send=changes recv=mail_requests) />>>/[?] />>>/| Apr 2 18:59:03 server1.lan dovecot: doveadm: Error: dsync(server2.lan): I/O has stalled, \ />>>/no activity for 600 seconds (last sent=mail, last recv=mail (EOL)) />>>/| Apr 2 18:59:03 server1.lan dovecot: doveadm: Error: Timeout during state=sync_mails \ />>>/(send=mails recv=recv_last_common) />>>/I cannot see in my personal account any missing replications, *but* I haven't tested this thoroughly enough. I do have customers being serviced at these productive servers, *thus* I'm back to 2.2.35 until I do understand or have learned what is going on. />//>/In my reply to this statement of mine I mentioned that I have seen those timeouts quite some times during the past year. Thus, I upgraded to 2.3.1 again, and boom: after some hours I ended up in hanging processes [1] like (see Remko's mail in addition) ... />//>/doveadm-server: [IP4/6 <USER1> SOME/MAILBOX import:0/0] (doveadm-server) />//>/? at server2 paired with a file like ? />//>/-rw------- 1 vmail dovecot uarch 0 Apr 3 16:52 /home/to/USER1/.dovecot-sync.lock />//>/Corresponding logfile entries at server2 are like ? />//>/Apr 3 17:10:49 server2.lan dovecot: doveadm: Error: Couldn't lock /home/to/USER1/.dovecot-sync.lock: \ />/fcntl(/home/to/USER1/.dovecot-sync.lock, write-lock, F_SETLKW) locking failed: Timed out after 30 seconds \ />/(WRITE lock held by pid 51110) />//>/[1] Even stopping dovecot will not end those processes. One has to manually kill those before restarting dovecot. />//>/After one day of testing 2.3.1 with a couple of those episodes of locking/timeout, and now missing mails depending with server your MUA will connect to, I went back to 2.2.35. After two days at that version I never had such an episode again. />//>>/It's not just you. This issue hit me recently, and it was impacting />>/replication noticeably. I am following git master-2.3 . />/[...] />>/There is also a second issue of a long standing race with replication />>/occurring somewhere whereby if a mail comes in, is written to disk, is />>/replicated and then deleted in short succession, it will reappear />>/again to the MUA. I suspect the mail is being replicated back from />>/the remote. A few people have reported it over the years but it's not />>/reliable or consistent, so it has never been fixed. />>/And lastly there has been an ongoing but seemingly minor issue />>/relating to locking timing out after 30s particularly on the remote />>/host that is being replicated to. I rarely see the problem on my />>/local disk where almost all of the mail comes in, it's almost always />>/occurring on the replicate/remote system. />//>/It might be time to describe our setups in order to possibly find co mmon grounds that might trigger this issue you describe and Rimko and myself ran into as well. />//>/Servers: Cloud Instances (both identical), around 25ms latency apart. />/Intel Core Processor (Haswell, no TSX) (3092.91-MHz K8-class CPU) />/Both servers are connected via IPsec/racoon tunnels />/OS: FreeBSD 11.1-STABLE (both servers) />/Filesystem: ZFS />/MTA: postfix 3.4-20180401 (postfix delivers via dovecot's LMTP) />/IMAP: dovecot running in FreeBSD jails (issues with 2.3.1, fine with 2.2.35) />/Replication: unsecured tcp / master-master />/MUA: mainly iOS or macOS mail.app, rarely roundcube / For me: Servers: Main: VM on a VMWare ESXi local system (light load), local SSD disks (no NFS) Redundant: Linode VM in Singapore, around 250ms away Also no NFS. Linode use SSDs for IO. There is native IPv6 connectivity between both VMs. As I am using TCPs I don't have a VPN between them - just raw IPv6 end to end. OS: Gentoo Linux x86_64 kept well up to date Filesystem: EXT4 for both MTA: Postfix 3.4-x (same as you) IMAP: Dovecot running natively on the machine (no jail/chroot) Replication: Secured TCP / master-master (but only one master is actively used for SMTP and MUA access) MUA: Thunderbird, Maildroid, MS Mail, Gmail client, Roundcube, LG Mail client on handset etc
/I believe it is worthwhile to mention here that I run a poor man's fail-over approach (round-robin DNS) as follows: />//>/DNS: mail.server.tld resolves to one IP4 and one IP6 address of each server, thus 4 IP addresses in total / For me I have a simple pair of servers that replicate mail to each other. Clients connect to the main server only, which is where mail comes in. The second one that experiences the problem with locking is only a redundant server which is a secondary MX, that only gets occasional SMTP deliveries from time to time. So the amount of replication back from the "secondary" is very minimal; the vast bulk of the replication is the primary replicating changes out.
Reuben
Hey all,
I've been affected by these replication issues too and finally downgraded back to 2.2.35 since some newly created virtual domains/mailboxes weren't replicated *at all* due to the bug(s).
My setup is more like a master-slave, where I only have a rather small virtual machine as the slave host, which is also only MX 20. The idea was to replicate all mails through dovecot and perform individual (independent) backups on each host.
The clients use a CNAME with a low TTL of 60s so in case my "master" (physical dedicated machine) goes down for a longer period I can simply switch to the slave.
In order for this concept to work, the replication has to work without any issue. Otherwise clients might notice missing mails or it might even result in conflicts when the master cames back online if the slave was out of sync beforehand.
On 06.05.18 - 21:34, Michael Grimm wrote:
And please have a look for processes like: doveadm-server: [IP4 <user> INBOX import:1/3] (doveadm-server)
These processes will "survive" a dovecot reboot ...
This is indeed the case. Once the replication processes
(doveadm-server) get stuck I had to resort to kill -9
to get rid of
them. Something is really wrong there.
As stated multiple times in the #dovecot irc channel I'm happy to test any patches for the 2.3 series in my setup and provide further details if required.
Thanks to all who are participating in this thread and finally these issues get some attention :)
Cheers, Thore
-- Thore Bödecker
GPG ID: 0xD622431AF8DB80F3 GPG FP: 0F96 559D 3556 24FC 2226 A864 D622 431A F8DB 80F3
Hi,
Checking in - this is still an issue with 2.3-master as of today (2.3.devel (3a6537d59)).
I haven't been able to narrow the problem down to a specific commit. The best I have been able to get to is that this commit is relatively good (not perfect but good enough):
d9a1a7cbec19f4c6a47add47688351f8c3a0e372 (from Feb 19, 2018)
whereas this commit:
6418419ec282c887b67469dbe3f541fc4873f7f0 (From Mar 12, 2018)
is pretty bad. Somewhere in between some commit has caused the problem (which may have been introduced earlier) to get much worse.
There seem to be a handful of us with broken systems who are prepared to assist in debugging this and put in our own time to patch, test and get to the bottom of it, but it is starting to look like we're basically on our own.
What sort of debugging, short of bisecting 100+ patches between the commits above, can we do to progress this?
Reuben
On 7/05/2018 5:54 am, Thore Bödecker wrote:
Hey all,
I've been affected by these replication issues too and finally downgraded back to 2.2.35 since some newly created virtual domains/mailboxes weren't replicated *at all* due to the bug(s).
My setup is more like a master-slave, where I only have a rather small virtual machine as the slave host, which is also only MX 20. The idea was to replicate all mails through dovecot and perform individual (independent) backups on each host.
The clients use a CNAME with a low TTL of 60s so in case my "master" (physical dedicated machine) goes down for a longer period I can simply switch to the slave.
In order for this concept to work, the replication has to work without any issue. Otherwise clients might notice missing mails or it might even result in conflicts when the master cames back online if the slave was out of sync beforehand.
On 06.05.18 - 21:34, Michael Grimm wrote:
And please have a look for processes like: doveadm-server: [IP4 <user> INBOX import:1/3] (doveadm-server)
These processes will "survive" a dovecot reboot ...
This is indeed the case. Once the replication processes (doveadm-server) get stuck I had to resort to
kill -9
to get rid of them. Something is really wrong there.As stated multiple times in the #dovecot irc channel I'm happy to test any patches for the 2.3 series in my setup and provide further details if required.
Thanks to all who are participating in this thread and finally these issues get some attention :)
Cheers, Thore
Reuben Farrelly reuben-dovecot@reub.net wrote:
Checking in - this is still an issue with 2.3-master as of today (2.3.devel (3a6537d59)).
That doesn't sound good, because I did hope that someone has been working on this issue ...
I haven't been able to narrow the problem down to a specific commit. The best I have been able to get to is that this commit is relatively good (not perfect but good enough):
d9a1a7cbec19f4c6a47add47688351f8c3a0e372 (from Feb 19, 2018)
whereas this commit:
6418419ec282c887b67469dbe3f541fc4873f7f0 (From Mar 12, 2018)
is pretty bad. Somewhere in between some commit has caused the problem (which may have been introduced earlier) to get much worse.
Thanks for the info.
There seem to be a handful of us with broken systems who are prepared to assist in debugging this and put in our own time to patch, test and get to the bottom of it, but it is starting to look like we're basically on our own.
I wonder if there is anyone running a 2.3 master-master replication scheme *without* running into this issue? Please let us know: yes, 2.3 master-master replication does run as rock-stable as in 2.2.
Anyone?
I would love to get some feedback from the developers regarding:
#) are commercial customers of yours running 2.3 master-master replication without those issues reported in this thread? #) do you get reports about these issues outside this ML as well? #) and ...
What sort of debugging, short of bisecting 100+ patches between the commits above, can we do to progress this?
… what kind of debugging do you suggest?
Regards, Michael
On 31 May 2018, at 17:52, Michael Grimm trashcan@ellael.org wrote:
Reuben Farrelly reuben-dovecot@reub.net wrote:
Checking in - this is still an issue with 2.3-master as of today (2.3.devel (3a6537d59)).
That doesn't sound good, because I did hope that someone has been working on this issue ...
I haven't been able to narrow the problem down to a specific commit. The best I have been able to get to is that this commit is relatively good (not perfect but good enough):
d9a1a7cbec19f4c6a47add47688351f8c3a0e372 (from Feb 19, 2018)
whereas this commit:
6418419ec282c887b67469dbe3f541fc4873f7f0 (From Mar 12, 2018)
is pretty bad. Somewhere in between some commit has caused the problem (which may have been introduced earlier) to get much worse.
Thanks for the info.
There seem to be a handful of us with broken systems who are prepared to assist in debugging this and put in our own time to patch, test and get to the bottom of it, but it is starting to look like we're basically on our own.
I wonder if there is anyone running a 2.3 master-master replication scheme *without* running into this issue? Please let us know: yes, 2.3 master-master replication does run as rock-stable as in 2.2.
Anyone?
I would love to get some feedback from the developers regarding:
#) are commercial customers of yours running 2.3 master-master replication without those issues reported in this thread? #) do you get reports about these issues outside this ML as well? #) and ...
What sort of debugging, short of bisecting 100+ patches between the commits above, can we do to progress this?
… what kind of debugging do you suggest?
Aki sent me over some patches recently and I have build a custom package for it for FreeBSD. It’s in my pkg repo which I can forward you if you want it. You need to add some lines to the logging thing and then trace those and collaborate with the dovecot community/developers. I did not have yet found the time to actively persue this due to other things on my head. Sorry for that. I hope to do this “soon” but I dont want to pin myself to a commitment that I might not be able to make :)
Cheers Remko
Regards, Michael
On 31. May 2018, at 18:09, Remko Lodder remko@FreeBSD.org wrote:
On 31 May 2018, at 17:52, Michael Grimm trashcan@ellael.org wrote:
I would love to get some feedback from the developers regarding:
#) are commercial customers of yours running 2.3 master-master replication without those issues reported in this thread? #) do you get reports about these issues outside this ML as well? #) and ...
What sort of debugging, short of bisecting 100+ patches between the commits above, can we do to progress this?
… what kind of debugging do you suggest?
Aki sent me over some patches recently and I have build a custom package for it for FreeBSD. It’s in my pkg repo which I can forward you if you want it.
Great news. I'd love to test it, thus, could you forward it to me? Thanks.
You need to add some lines to the logging thing and then trace those and collaborate with the dovecot community/developers.
And, please let me know, which config is needed for those logging lines as well.
I did not have yet found the time to actively persue this due to other things on my head. Sorry for that. I hope to do this “soon” but I dont want to pin myself to a commitment that I might not be able to make :)
Well, I will give it a try.
But: more testers might see more in those logging lines ;-)
Regards, Michael
On 1/06/2018 2:47 AM, Michael Grimm wrote:
On 31. May 2018, at 18:09, Remko Lodder remko@FreeBSD.org wrote:
On 31 May 2018, at 17:52, Michael Grimm trashcan@ellael.org wrote: I would love to get some feedback from the developers regarding:
#) are commercial customers of yours running 2.3 master-master replication without those issues reported in this thread? #) do you get reports about these issues outside this ML as well? #) and ...
What sort of debugging, short of bisecting 100+ patches between the commits above, can we do to progress this? … what kind of debugging do you suggest? Aki sent me over some patches recently and I have build a custom package for it for FreeBSD. It’s in my pkg repo which I can forward you if you want it. Great news. I'd love to test it, thus, could you forward it to me? Thanks. Very good news. I too would love to test and am more than happy to share config, system setup and logs . Please forward the the pkg repo if you wouldn't mind.
You need to add some lines to the logging thing and then trace those and collaborate with the dovecot community/developers. And, please let me know, which config is needed for those logging lines as well. As above please
I did not have yet found the time to actively persue this due to other things on my head. Sorry for that. I hope to do this “soon” but I dont want to pin myself to a commitment that I might not be able to make :) Well, I will give it a try.
But: more testers might see more in those logging lines ;-)
Regards, Michael
More than willing to test.
Regards, Andy
Hi Timo,
Yes this seems to work fine so far. I’ll ask the people to add it to the current FreeBSD version..
Cheers Remko
On 6 Jun 2018, at 19:34, Timo Sirainen tss@iki.fi wrote:
Should be fixed by https://github.com/dovecot/core/commit/a952e178943a5944255cb7c053d970f8e6d49... https://github.com/dovecot/core/commit/a952e178943a5944255cb7c053d970f8e6d49...
Still not quite right for me.
Jun 7 15:11:33 thunderstorm.reub.net dovecot: doveadm: Error: dsync(lightning.reub.net): I/O has stalled, no activity for 600 seconds (last sent=mail, last recv=mail (EOL)) Jun 7 15:11:33 thunderstorm.reub.net dovecot: doveadm: Error: Timeout during state=sync_mails (send=mails recv=recv_last_common)
I'm not sure if there is an underlying replication error or if the message is just cosmetic, though.
Reuben
On 7/06/2018 4:55 AM, Remko Lodder wrote:
Hi Timo,
Yes this seems to work fine so far. I’ll ask the people to add it to the current FreeBSD version..
Cheers Remko
On 6 Jun 2018, at 19:34, Timo Sirainen
mailto:tss@iki.fi> wrote: Should be fixed by https://github.com/dovecot/core/commit/a952e178943a5944255cb7c053d970f8e6d49...
On 7 Jun 2018, at 07:21, Reuben Farrelly reuben-dovecot@reub.net wrote:
Still not quite right for me.
Jun 7 15:11:33 thunderstorm.reub.net dovecot: doveadm: Error: dsync(lightning.reub.net): I/O has stalled, no activity for 600 seconds (last sent=mail, last recv=mail (EOL)) Jun 7 15:11:33 thunderstorm.reub.net dovecot: doveadm: Error: Timeout during state=sync_mails (send=mails recv=recv_last_common)
I'm not sure if there is an underlying replication error or if the message is just cosmetic, though.
Reuben
Hi,
Admittedly I have had a few occurences of this behaviour as well last night. It happens more sporadic now and seems to be a conflict with my user settings. (My users get added twice to the system, user-domain.tld and user@domain.tld mailto:user@domain.tld, both are being replicated, the noreplicate flag is not yet honored in the version I am using so I cannot bypass that yet).
I do see messages that came on the other machine on the machine that I am using to read these emails. So replication seems to work in that regard (where it obviously did not do that well before).
Cheers Remko
On 7/06/2018 4:55 AM, Remko Lodder wrote:
Hi Timo, Yes this seems to work fine so far. I’ll ask the people to add it to the current FreeBSD version.. Cheers Remko
On 6 Jun 2018, at 19:34, Timo Sirainen
mailto:tss@iki.fi> wrote: Should be fixed by https://github.com/dovecot/core/commit/a952e178943a5944255cb7c053d970f8e6d49...
Am 2018-06-07 07:34, schrieb Remko Lodder:
On 7 Jun 2018, at 07:21, Reuben Farrelly reuben-dovecot@reub.net wrote:
Still not quite right for me.
Jun 7 15:11:33 thunderstorm.reub.net dovecot: doveadm: Error: dsync(lightning.reub.net): I/O has stalled, no activity for 600 seconds (last sent=mail, last recv=mail (EOL)) Jun 7 15:11:33 thunderstorm.reub.net dovecot: doveadm: Error: Timeout during state=sync_mails (send=mails recv=recv_last_common)
I'm not sure if there is an underlying replication error or if the message is just cosmetic, though.
Admittedly I have had a few occurences of this behaviour as well last night. It happens more sporadic now and seems to be a conflict with my user settings. (My users get added twice to the system, user-domain.tld and user@domain.tld, both are being replicated, the noreplicate flag is not yet honored in the version I am using so I cannot bypass that yet).
I do see messages that came on the other machine on the machine that I am using to read these emails. So replication seems to work in that regard (where it obviously did not do that well before).
First of all: Major improvement by this patch applied to 2.3.1, there are no more hanging processes.
But: I do find quite a number of error messages like:
Jun 7 06:34:20 mail dovecot: doveadm: Error: Failed to lock mailbox
NAME for dsyncing:
file_create_locked(/.../USER/mailboxes/NAME/dbox-Mails/.dovecot-box-sync.lock)
failed:
fcntl(/.../USER/mailboxes/NAME/dbox-Mails/.dovecot-box-sync.lock,
write-lock, F_SETLKW)
locking failed: Timed out after 30 seconds (WRITE lock held by pid
79452)
These messages are only found at that server which is normally receiving synced messages (because almost all mail is received via the other master due to MX priorities).
Conclusion: After 12 hours of running a patched FBSD port I do get those error messages but replictaion seems to work now. But, I still have the feeling that there might something else going wrong.
@Timo: Wouldn't it be worth to add replicator/aggreator error messages to head like Aki sent to Remko? That might add some light into replication issues today and in the future.
Regards, Michael
Michael Grimm trashcan@ellael.org wrote:
First of all: Major improvement by this patch applied to 2.3.1, there are no more hanging processes.
From my point of view: the recent commit from Timo did not only fix those hanging processes ...
But: I do find quite a number of error messages like:
Jun 7 06:34:20 mail dovecot: doveadm: Error: Failed to lock mailbox NAME for dsyncing:
file_create_locked(/.../USER/mailboxes/NAME/dbox-Mails/.dovecot-box-sync.lock)
failed: fcntl(/.../USER/mailboxes/NAME/dbox-Mails/.dovecot-box-sync.lock, write-lock, F_SETLKW)
locking failed: Timed out after 30 seconds (WRITE lock held by pid 79452)
… it fixed it finally!
These messages are only found at that server which is normally receiving synced messages (because almost all mail is received via the other master due to MX priorities).
No wonder why: It was completely my fault. I had had "mail_replica =" pointing to itself :-( Copying configs from one server to the other without thinking is bad …
Now, after having fixed this stupid configuration mistake, I can report, after some hours, that from my point of view replication is back to its 2.2.x performance!
I do have to apologise for the noise, sorry.
Regards, Michael
Regarding my comment below - it looks like a false alarm on my part. The commit referenced below hasn't gone into master-2.3 yet which meant it wasn't included when I rebuilt earlier today. That was was an incorrect assumption I made.
I have since manually patched it into master-2.3 and it looks to be OK so far - touch wood - with 4 hours testing so far.
Reuben
On 7/06/2018 3:21 pm, Reuben Farrelly wrote:
Still not quite right for me.
Jun 7 15:11:33 thunderstorm.reub.net dovecot: doveadm: Error: dsync(lightning.reub.net): I/O has stalled, no activity for 600 seconds (last sent=mail, last recv=mail (EOL)) Jun 7 15:11:33 thunderstorm.reub.net dovecot: doveadm: Error: Timeout during state=sync_mails (send=mails recv=recv_last_common)
I'm not sure if there is an underlying replication error or if the message is just cosmetic, though.
Reuben
On 7/06/2018 4:55 AM, Remko Lodder wrote:
Hi Timo,
Yes this seems to work fine so far. I’ll ask the people to add it to the current FreeBSD version..
Cheers Remko
On 6 Jun 2018, at 19:34, Timo Sirainen
mailto:tss@iki.fi> wrote: Should be fixed by https://github.com/dovecot/core/commit/a952e178943a5944255cb7c053d970f8e6d49...
Timo Sirainen:
Should be fixed by https://github.com/dovecot/core/commit/a952e178943a5944255cb7c053d970f8e6d49...
please ignore my ignorance but shouldn't one add this commit regarding src/doveadm/client-connection-tcp.c ...
https://github.com/dovecot/core/commit/2a3b7083ce4e62a8bd836f9983b223e98e9bc...
... to a vanilla 2.3.1 source tree as well?
I do have to admit that I am absolutely clueless in understanding dovecot's source code, but I came accross this commit because it has been committed the very same day as the one solving those hanging processes.
I could test it by myself, but I am not sure if that commit would break my productive dovecot instances.
Regards, Michael
On 6/7/18, 3:43 AM, "dovecot on behalf of Michael Grimm"
Timo Sirainen:
> Should be fixed by
> https://github.com/dovecot/core/commit/a952e178943a5944255cb7c053d970f8e6d49336
please ignore my ignorance but shouldn't one add this commit regarding
src/doveadm/client-connection-tcp.c ...
https://github.com/dovecot/core/commit/2a3b7083ce4e62a8bd836f9983b223e98e9bc157
... to a vanilla 2.3.1 source tree as well?
I do have to admit that I am absolutely clueless in understanding
dovecot's source code, but I came accross this commit because it has
been committed the very same day as the one solving those hanging
processes.
I could test it by myself, but I am not sure if that commit would break
my productive dovecot instances.
Regards,
Michael
I'm happy to add https://github.com/dovecot/core/commit/2a3b7083ce4e62a8bd836f9983b223e98e9bc... to The FreeBSD port if folks think it should help.
-- Larry Rosenman http://www.lerctr.org/~ler Phone: +1 214-642-9640 E-Mail: larryrtx@gmail.com US Mail: 5708 Sabbia Drive, Round Rock, TX 78665-2106
On 7 Jun 2018, at 11.43, Michael Grimm trashcan@ellael.org wrote:
Timo Sirainen:
Should be fixed by https://github.com/dovecot/core/commit/a952e178943a5944255cb7c053d970f8e6d49...
please ignore my ignorance but shouldn't one add this commit regarding src/doveadm/client-connection-tcp.c ...
https://github.com/dovecot/core/commit/2a3b7083ce4e62a8bd836f9983b223e98e9bc...
... to a vanilla 2.3.1 source tree as well?
That's a code simplification / cleanup commit. It doesn't fix anything.
Hi all,
Firstly, thank you for both your feedback and suggestions. My current work around is to use rsync to keep my standby mx "close" to up-to-date. As my servers are not commercial and production so to seek i am more than happy to carry out some tests if that would prove useful to the team. I have some spare hardware and am more than willing to spin up an extra vm if required.
regards, Andy
On 7/05/2018 5:34 AM, Michael Grimm wrote:
Hi Reuben,
thank you for digging into this issue reported by some users, now.
I wonder if there are users out there using master/master replication with 2.3.1 dovecot that do *NOT* observe the errors mentioned in this thread? Please report, please. And, pleas have a closer look for
It would be interesting to know if these failures are seen by a subset of users, only.
And please have a look for processes like: doveadm-server: [IP4 <user> INBOX import:1/3] (doveadm-server)
These processes will "survive" a dovecot reboot ...
With kind regards, Michael
On 6. May 2018, at 16:11, Reuben Farrelly reuben-dovecot@reub.net wrote:
Ah yes. The hash I put below is wrong but the commit message I quoted was right. This is the last known working good commit, which was on Mar 6:
https://github.com/dovecot/core/commit/ff5d02182573b1d4306c2a8c36605c98f217e... "doveadm: Unref header search context after use Fixes memory leak, found by valgrind"
I've subsequently been able to determine that the bad commit was one of the dozens which were committed in a big batch on Mar 12, 2018. I definitely see the problem happening when running with commit 6418419ec282c887b67469dbe3f541fc4873f7f0 (last one in that batch).
In other words it's one of the commits between those two hashes that is the problem. Just need to work out which one...........
The problem is also seen with -git as of a day or so ago. I'm tracking master-2.3 so I imagine I'm picking up all the 2.3 fixes as they go straight in (including the rc fixes). So this doesn't appear to be fixed yet.
Reuben
On 6/05/2018 11:41 pm, Aki Tuomi wrote:
Seems you got the hash wrong... You probably mean 4a1e2e8e85ad5d4ab43496a5f3cd6b7ff6ac5955, which fixes a memory leak, but does not release sync lock. Interesting if this commit fixes the issue. Looking at the sync code, where the lock is made / released, there is only one commit recently, which is 8077d714e11388a294f1583e706152396972acce and that one creates home directory if missing. Would be interesting if you could test with 2.3.2rc1, which was released few days ago, whether this issue goes away? Aki
On 06 May 2018 at 15:21 Reuben Farrelly reuben-dovecot@reub.net wrote:
Hi Andy,
Funny you say that - I've been testing this very problem for some time today to narrow down the commits where I think the problem started occurring.
So far I have been able to roll forward to commit from git: 74ef8506dcb946acc42722683e49fdcb964eed19
"doveadm: Unref header search context after use" . So far I have been running on that code for a few hours and I'm not yet seeing the problem, which indicates the issue is probably subsequent to this.
I'm of the view that the problem started occurring somewhere between that commit above on 6 March and the final 2.3.1 release on 27 March. There are a lot of changes between those two dates, and testing to narrow down to one specific commit is extremely time consuming.
It would be really helpful if even one of the Dovecot developers could comment on this and give us some direction (or candidate commits to investigate) or just let us know if the root cause has been found. This bug is a showstopper and has stopped me tracking master-2.3 for over 3 months now, as I can't test later builds or even upgrade to the 2.3.1 release while replication is so broken.
Reuben
Message: 1 Date: Sun, 6 May 2018 13:21:57 +1000 From: Andy Weal andy@bizemail.com.au To: dovecot@dovecot.org Subject: 2.3.1 Replication is throwing scary errors Message-ID: f4274128-350b-9a74-a344-8e6caeffa112@bizemail.com.au Content-Type: text/plain; charset="utf-8"; Format="flowed"
Hi all,
New to the mailing lists but have joined up because of above */2.3.1 Replication is throwing scary errors
/*Brief system configuration ??? MX1 - Main ??? ??? Freebsd 11.1-Release-p9 ??? ??? Hosted on a Vultr VM in Sydney AU ??? ??? MTA = Postfix 3.4-20180401 ??? ??? Dovecot = 2.3.1 ??? ??? File system = ufs ??? MX2 - Backup ??? ??? Freebsd 11.1-Release-p9 ??? ???? Running on bare metal - no VM or jails ??? ??? MTA = Postfix 3.4-20180401 ??? ??? Dovecot = 2.3.1 ??? ??? File system = ufs ( on SSD)
*/ /*Brief sequence of events
- ??? apx 10 days back upgraded both mx1 and mx2 to dovecot 2.3.1_2 from 2.3.0 ? (service dovecot stop, portmaster upgrade, service dovecot start)
- ??? both systems ran ok with no errors for 10 days.
- ??? Last night I shutdown mx2 and restarted it a few hours later
- ??? within minutes i was getting the following types of errors on mx2
???????? May 06 12:56:29 doveadm: Error: Couldn't lock /var/mail/vhosts/example.net/user1/.dovecot-sync.lock: fcntl(/var/mail/vhosts/example.net/user1/.dovecot-sync.lock, write-lock, F_SETLKW) locking failed: Timed out after 30 seconds (WRITE lock held by pid 1960)
? Before i venture down the rabbit hole of fault finding and excess coffee consumption I was wondering if any of you had any updates on the problems discussed below.
Cheers for now, Andy
Hi,
[Formatting is a bit rough, replying from a trimmed digest email]
/Message: 1 />/Date: Fri, 6 Apr 2018 15:04:35 +0200 />/From: Michael Grimm
https://dovecot.org/mailman/listinfo/dovecot> />/To: Dovecot Mailing List https://dovecot.org/mailman/listinfo/dovecot> />/Subject: Re: 2.3.1 Replication is throwing scary errors />/Message-ID: https://dovecot.org/mailman/listinfo/dovecot> />/Content-Type: text/plain; charset=utf-8 />//>/Reuben Farrelly wrote: />>/From: Michael Grimm https://dovecot.org/mailman/listinfo/dovecot> />//>>>/[This is Dovecot 2.3.1 at FreeBSD STABLE-11.1 running in two jails at distinct servers.] />>>/I did upgrade from 2.2.35 to 2.3.1 today, and I do become pounded by error messages at server1 (and vice versa at server2) as follows: />>>/| Apr 2 17:12:18 server1.lan dovecot: doveadm: Error: dsync(server2.lan): I/O has stalled, \ />>>/no activity for 600 seconds (last sent=mail_change, last recv=mail_change (EOL)) />>>/| Apr 2 17:12:18 server1.lan dovecot: doveadm: Error: Timeout during state=sync_mails \ />>>/(send=changes recv=mail_requests) />>>/[?] />>>/| Apr 2 18:59:03 server1.lan dovecot: doveadm: Error: dsync(server2.lan): I/O has stalled, \ />>>/no activity for 600 seconds (last sent=mail, last recv=mail (EOL)) />>>/| Apr 2 18:59:03 server1.lan dovecot: doveadm: Error: Timeout during state=sync_mails \ />>>/(send=mails recv=recv_last_common) />>>/I cannot see in my personal account any missing replications, *but* I haven't tested this thoroughly enough. I do have customers being serviced at these productive servers, *thus* I'm back to 2.2.35 until I do understand or have learned what is going on. />//>/In my reply to this statement of mine I mentioned that I have seen those timeouts quite some times during the past year. Thus, I upgraded to 2.3.1 again, and boom: after some hours I ended up in hanging processes [1] like (see Remko's mail in addition) ... />//>/doveadm-server: [IP4/6 <USER1> SOME/MAILBOX import:0/0] (doveadm-server) />//>/? at server2 paired with a file like ? />//>/-rw------- 1 vmail dovecot uarch 0 Apr 3 16:52 /home/to/USER1/.dovecot-sync.lock />//>/Corresponding logfile entries at server2 are like ? />//>/Apr 3 17:10:49 server2.lan dovecot: doveadm: Error: Couldn't lock /home/to/USER1/.dovecot-sync.lock: \ />/fcntl(/home/to/USER1/.dovecot-sync.lock, write-lock, F_SETLKW) locking failed: Timed out after 30 seconds \ />/(WRITE lock held by pid 51110) />//>/[1] Even stopping dovecot will not end those processes. One has to manually kill those before restarting dovecot. />//>/After one day of testing 2.3.1 with a couple of those episodes of locking/timeout, and now missing mails depending with server your MUA will connect to, I went back to 2.2.35. After two days at that version I never had such an episode again. />//>>/It's not just you. This issue hit me recently, and it was impacting />>/replication noticeably. I am following git master-2.3 . />/[...] />>/There is also a second issue of a long standing race with replication />>/occurring somewhere whereby if a mail comes in, is written to disk, is />>/replicated and then deleted in short succession, it will reappear />>/again to the MUA. I suspect the mail is being replicated back from />>/the remote. A few people have reported it over the years but it's not />>/reliable or consistent, so it has never been fixed. />>/And lastly there has been an ongoing but seemingly minor issue />>/relating to locking timing out after 30s particularly on the remote />>/host that is being replicated to. I rarely see the problem on my />>/local disk where almost all of the mail comes in, it's almost always />>/occurring on the replicate/remote system. />//>/It might be time to describe our setups in order to possibly find co mmon grounds that might trigger this issue you describe and Rimko and myself ran into as well. />//>/Servers: Cloud Instances (both identical), around 25ms latency apart. />/Intel Core Processor (Haswell, no TSX) (3092.91-MHz K8-class CPU) />/Both servers are connected via IPsec/racoon tunnels />/OS: FreeBSD 11.1-STABLE (both servers) />/Filesystem: ZFS />/MTA: postfix 3.4-20180401 (postfix delivers via dovecot's LMTP) />/IMAP: dovecot running in FreeBSD jails (issues with 2.3.1, fine with 2.2.35) />/Replication: unsecured tcp / master-master />/MUA: mainly iOS or macOS mail.app, rarely roundcube / For me: Servers: Main: VM on a VMWare ESXi local system (light load), local SSD disks (no NFS) Redundant: Linode VM in Singapore, around 250ms away Also no NFS. Linode use SSDs for IO. There is native IPv6 connectivity between both VMs. As I am using TCPs I don't have a VPN between them - just raw IPv6 end to end. OS: Gentoo Linux x86_64 kept well up to date Filesystem: EXT4 for both MTA: Postfix 3.4-x (same as you) IMAP: Dovecot running natively on the machine (no jail/chroot) Replication: Secured TCP / master-master (but only one master is actively used for SMTP and MUA access) MUA: Thunderbird, Maildroid, MS Mail, Gmail client, Roundcube, LG Mail client on handset etc
/I believe it is worthwhile to mention here that I run a poor man's fail-over approach (round-robin DNS) as follows: />//>/DNS: mail.server.tld resolves to one IP4 and one IP6 address of each server, thus 4 IP addresses in total / For me I have a simple pair of servers that replicate mail to each other. Clients connect to the main server only, which is where mail comes in. The second one that experiences the problem with locking is only a redundant server which is a secondary MX, that only gets occasional SMTP deliveries from time to time. So the amount of replication back from the "secondary" is very minimal; the vast bulk of the replication is the primary replicating changes out.
Reuben
participants (8)
-
Aki Tuomi
-
Andy Weal
-
Larry Rosenman
-
Michael Grimm
-
Remko Lodder
-
Reuben Farrelly
-
Thore Bödecker
-
Timo Sirainen