On Tue, 25 Jan 2011 00:13:53 +0200, Timo Sirainen tss@iki.fi wrote:
On 24.1.2011, at 23.06, list@airstreamcomm.net wrote:
We tested the patch you suggested with no success. We are seeing timestamps straying into the 100's of seconds of difference, which does not reflect the perceivable drift that shows on our systems clock (which is negligible, sub 1 second). .. Created dotlock file's timestamp is different than current time (1295900137 vs 1295899898): /mail/username/Maildir/dovecot-uidlist
I see you have:
dotlock_use_excl: no
There was another reason why the above error could happen with that setting. http://hg.dovecot.org/dovecot-1.2/rev/ab81fbb195e2 fixes it.
But anyway, the main problem you have is that Dovecot is waiting for hundreds of seconds for the uidlist lock. It shouldn't. Meaning your NFS server is answering way too slowly to Dovecot.
Timo,
Thanks for the quick reply.
Per the dovecot wiki on nfs http://wiki.dovecot.org/NFS:
Multi-server setup that tries to flush NFS caches:
mmap_disable = yes dotlock_use_excl = no # only needed with NFSv2, NFSv3+ supports O_EXCL and it's faster mail_nfs_storage = yes # v1.1+ only mail_nfs_index = yes # v1.1+ only
We configured our systems to match these settings, as we are trying to use a multi-server setup.
We certainly have not excluded our NFS server as the culprit for the long wait, but at this point we are running two server's against that NFS share, a postfix machine and a dovecot machine, both accessing the disk without error or any latency. It's only when we add a second dovecot machine to the mix when we start to see the dotlock errors. In our configuration we have a load balancer that is distributing traffic automatically to the pop and imap services on our cluster of machines. Once we add a new server to the mix the load balancer automatically directs new traffic, and we start seeing the errors:
Jan 24 13:20:39 10.123.128.105 dovecot: pop3-login: Login: user=<mpbixler>, method=PLAIN, rip=68.65.35.174, lip=64.33.128.105 Jan 24 13:20:42 10.123.128.105 dovecot: POP3(mpbixler): Disconnected: Logged out top=0/0, retr=0/0, del=0/8234, size=1450555980 Jan 24 13:40:46 10.123.128.105 dovecot: pop3-login: Login: user=<mpbixler>, method=PLAIN, rip=68.65.35.174, lip=64.33.128.105 Jan 24 13:40:47 10.123.128.105 dovecot: POP3(mpbixler): Disconnected: Logged out top=0/0, retr=0/0, del=0/8234, size=1450555980 Jan 24 14:14:28 10.123.128.108 dovecot: pop3-login: Login: user=<mpbixler>, method=PLAIN, rip=68.65.35.174, lip=64.33.128.108 Jan 24 14:14:28 10.123.128.108 dovecot: POP3(mpbixler): Created dotlock file's timestamp is different than current time (1295900068 vs 1295899828): /mail/m/p/mpbixler/Maildir/dovecot-uidlist Jan 24 14:14:28 10.123.128.108 dovecot: POP3(mpbixler): Created dotlock file's timestamp is different than current time (1295900068 vs 1295899828): /mail/m/p/mpbixler/Maildir/dovecot-uidlist Jan 24 14:14:32 10.123.128.108 dovecot: POP3(mpbixler): Created dotlock file's timestamp is different than current time (1295900072 vs 1295899833): /mail/m/p/mpbixler/Maildir/dovecot-uidlist Jan 24 14:14:33 10.123.128.108 dovecot: POP3(mpbixler): Disconnected: Logged out top=0/0, retr=1/7449, del=0/8235, size=1450563412 Jan 24 14:34:36 10.123.128.105 dovecot: pop3-login: Login: user=<mpbixler>, method=PLAIN, rip=68.65.35.174, lip=64.33.128.105 Jan 24 14:34:40 10.123.128.105 dovecot: POP3(mpbixler): Disconnected: Logged out top=0/0, retr=1/10774, del=0/8236, size=1450574168 Jan 24 14:54:53 10.123.128.105 dovecot: pop3-login: Login: user=<mpbixler>, method=PLAIN, rip=68.65.35.174, lip=64.33.128.105 Jan 24 14:54:54 10.123.128.105 dovecot: POP3(mpbixler): Disconnected: Logged out top=0/0, retr=0/0, del=0/8236, size=1450574168
As you can see here, our current mail server 10.123.128.105 is handling the traffic without incident, and once we add 10.123.128.108 to the cluster the dovecot service errors out. From the last time the user was logged in at 13:40:47 to the login on the new server at 14:14:28 is about 34 minutes, which from my understanding of the code is not possible for a lock file to exist that long, and does not reflect in the time stamps in the error log.
Pretty stumped at this point, as our NFS server appears to be running smoothly. All the packet captures we have run show that all the NFS related traffic is moving back and forth quickly and without error.
Thanks,
Michael