Occasional lock timeouts on Linode VM with Dovecot Replication
I've been seeing periodic entries in my dovecot logs like this:
dovecot[3464]: dsync-server(kaylene): Error: Couldn't lock /home/kaylene/.dovecot-sync.lock: Timed out after 30 seconds: 3 Time(s) dovecot[3464]: dsync-server(reuben): Error: Couldn't lock /home/reuben/.dovecot-sync.lock: Timed out after 30 seconds: 1 Time(s)
They occur several times per day, but don't appear to have any obvious cause and I am not aware of any problems this is causing. [They could be the cause of some reappearing UID type messages that also periodically are logged, but I can't be sure]
They occur on a lightly loaded Linode VM, KVM Paravirtualised and with only local SSD disk storage. The VM is a Gentoo Linux VM running the latest kernels that Linode provide. I also saw this problem under Xen.
The dovecot setup is a dsync replication between two hosts, there is about 150ms of latency between them. The host where I am seeing these messages on (lightning) is a dovecot replica of another system (thunderstorm). I am using Maildir storage.
Thunderstorm sees the vast majority of the client side reads and writes and lightning just functions as a not-so-active replica.
Thunderstorm is also a VM but on VMware (also on SSDs). This system has never had this problem.
I've had this across many dovecot versions going back many months now so it's impossible to pinpoint when this started. I am currently running dovecot -git master-2.2 branch at the moment.
I've never seen disk latency in excess of 30s on any system either so I doubt that raw IO is the cause.
I don't have any settings specified in 10-mail.conf in the Mail processes section relating to locking or mmap.
Has anyone else experienced this and/or any ideas about where to look next to determine the root cause?
Is this a common warning to see in cloud hosted/shared environments?
Reuben
Reuben,
On Sunday, July 17, 2016 04:18:45 PM Reuben Farrelly wrote:
I've been seeing periodic entries in my dovecot logs like this:
dovecot[3464]: dsync-server(kaylene): Error: Couldn't lock /home/kaylene/.dovecot-sync.lock: Timed out after 30 seconds: 3 Time(s) dovecot[3464]: dsync-server(reuben): Error: Couldn't lock /home/reuben/.dovecot-sync.lock: Timed out after 30 seconds: 1 Time(s)
They occur several times per day, but don't appear to have any obvious cause and I am not aware of any problems this is causing. [They could be the cause of some reappearing UID type messages that also periodically are logged, but I can't be sure]
They occur on a lightly loaded Linode VM, KVM Paravirtualised and with only local SSD disk storage. The VM is a Gentoo Linux VM running the latest kernels that Linode provide. I also saw this problem under Xen.
I am running the same, Gentoo, replicating Dovecot, on Linode VMs. Only difference is I am using NFS, it seems you are using local disk. I have never had issues like your experiencing. My mail VMs get pretty loaded at times due ASSP and mail volume. I would not think it to be load related what so ever.
If you feel it might be specific to that VM you might request Linode move it to a new host machine. I have had one of my mail servers have some issues before and it was host related. Linode opened a ticket and migrated it about the time I got the first Nagios notification. If you get Linode to migrate the VM and it continues, you can rule out the host at least.
Is this a common warning to see in cloud hosted/shared environments?
Not to my knowledge, I have never seen that error before.
-- William L. Thomson Jr. Obsidian-Studios, Inc. http://www.obsidian-studios.com
Hi again,
Thanks for your response William, answers inline:
On 21/07/2016 1:58 AM, William L. Thomson Jr. wrote:
Reuben,
On Sunday, July 17, 2016 04:18:45 PM Reuben Farrelly wrote:
I've been seeing periodic entries in my dovecot logs like this:
dovecot[3464]: dsync-server(kaylene): Error: Couldn't lock /home/kaylene/.dovecot-sync.lock: Timed out after 30 seconds: 3 Time(s) dovecot[3464]: dsync-server(reuben): Error: Couldn't lock /home/reuben/.dovecot-sync.lock: Timed out after 30 seconds: 1 Time(s)
They occur several times per day, but don't appear to have any obvious cause and I am not aware of any problems this is causing. [They could be the cause of some reappearing UID type messages that also periodically are logged, but I can't be sure]
They occur on a lightly loaded Linode VM, KVM Paravirtualised and with only local SSD disk storage. The VM is a Gentoo Linux VM running the latest kernels that Linode provide. I also saw this problem under Xen.
I am running the same, Gentoo, replicating Dovecot, on Linode VMs. Only difference is I am using NFS, it seems you are using local disk. I have never had issues like your experiencing. My mail VMs get pretty loaded at times due ASSP and mail volume. I would not think it to be load related what so ever.
Thanks - yes - looks to be unrelated to load then.
If you feel it might be specific to that VM you might request Linode move it to a new host machine. I have had one of my mail servers have some issues before and it was host related. Linode opened a ticket and migrated it about the time I got the first Nagios notification. If you get Linode to migrate the VM and it continues, you can rule out the host at least.
I've already ruled out the host. I had this Linode in the Freemont farm all of last year, and migrated it to Singapore earlier this year. The errors remained, which to me more or less rules out the hardware on the host as a problem (I suppose it is possible both were about equally impacted but it's not so likely). I've also moved from Xen to KVM and the problem didn't go away either.
Is this a common warning to see in cloud hosted/shared environments?
Not to my knowledge, I have never seen that error before.
I am not seeing it on VMware here on my main host (I don't think the error has ever been logged here. It has the same filesystem, same version of dovecot, same arch, the only difference that I can think of is the latency of about 130ms between the two replica hosts.
Can anyone advise what I can do to further debug the problem? The error message isn't helping much determine where to look next.
Thanks, Reuben
participants (2)
-
Reuben Farrelly
-
William L. Thomson Jr.