[Dovecot] Sudden, large numbers of "Timeout while waiting for lock for transaction log ..."
Hi,
Up until yesterday, our environment which consists an NFS maildir file store with multiple front end servers, was working fine. We've verified that the server clocks and machines clocks are in sync.
Starting yesterday afternoon, We are getting ~850 log entries of the form 'Timeout while waiting for lock for transaction log files' or 'Our dotlock file was modified, assuming it wasn't overridden (kept it 180 sec)
Based on packet capture, just one of these index files shows 28553 GETATTRs queries in in a one minute.
At this happened at exactly the same time on all of our servers, it is pretty clear that the back-end system (or network) is a major factor although nobody will confess that they made any changes.
It would be helpful, and very appreciated, to get more information about what this might be so that we can nudge the correct people to undo whatever it is that they didn't do.
We are running currently dovecot 1.1.3
an 6 01:21:02 earth-griffen dovecot: IMAP(zabala): Timeout while waiting for lock for transaction log file /var/spool/dovecot/indexes/z/zabala/.INBOX/dovecot.index.log Jan 6 01:21:02 earth-griffen dovecot: IMAP(zabala): Our dotlock file /var/spool/maildir/z/zabala/dovecot-uidlist.lock was deleted (kept it 180 secs) Jan 6 01:21:02 earth-griffen dovecot: IMAP(zabala): Timeout while waiting for lock for transaction log file /var/spool/dovecot/indexes/z/zabala/.INBOX/dovecot.index.log Jan 6 01:21:02 earth-griffen dovecot: IMAP(zabala): Our dotlock file /var/spool/maildir/z/zabala/dovecot-uidlist.lock was modified (1231233482 vs 1231233662), assuming it wasn't overridden (kept it 180 secs) Jan 6 01:21:02 earth-griffen dovecot: IMAP(zabala): Timeout while waiting for lock for transaction log file /var/spool/dovecot/indexes/z/zabala/.INBOX/dovecot.index.log Jan 6 01:21:02 earth-griffen dovecot: IMAP(zabala): Our dotlock file /var/spool/maildir/z/zabala/dovecot-uidlist.lock was modified (1231233482 vs 1231233662), assuming it wasn't overridden (kept it 180 secs) Jan 6 01:21:04 fire-griffen dovecot: IMAP(sukwon): Timeout while waiting for lock for transaction log file /var/spool/dovecot/indexes/s/sukwon/.Trash/dovecot.index.log
-- Jack Stewart Academic Computing Services, IMSS, California Institute of Technology jstewart@caltech.edu 626-395-4690 office
On Tue, 2009-01-06 at 23:11 -0800, Jack Stewart wrote:
an 6 01:21:02 earth-griffen dovecot: IMAP(zabala): Timeout while waiting for lock for transaction log file /var/spool/dovecot/indexes/z/zabala/.INBOX/dovecot.index.log
This is the main problem. So indexes are also on NFS? What locking are they using (dovecot -n output)?
Timo Sirainen wrote:
On Tue, 2009-01-06 at 23:11 -0800, Jack Stewart wrote:
an 6 01:21:02 earth-griffen dovecot: IMAP(zabala): Timeout while waiting for lock for transaction log file /var/spool/dovecot/indexes/z/zabala/.INBOX/dovecot.index.log
This is the main problem. So indexes are also on NFS? What locking are they using (dovecot -n output)?
Yes, the indexes are also on NFS.
The locking is fcntl() - the default.
Below is the dovecot -n output.
One of the guys that I work with appears to clear up the problem for now
- I think it was via manual deletion of the indexes and processes that were logging errors but I won't know for sure until he gets in.
1.1.3: /etc/dovecot.d/imap-server-its-caltech-edu.cfg
base_dir: /var/run/dovecot/imap-server-its/ syslog_facility: local4 protocols: imap imaps listen: *:10143 ssl_listen: *:10993 ssl_ca_file: /etc/pki/dovecot/certs/caltech-ca.pem ssl_cert_file: /etc/pki/dovecot/certs/imap-server.its.caltech.edu.pem ssl_key_file: /etc/pki/dovecot/private/imap-server.its.caltech.edu.key disable_plaintext_auth: yes login_dir: /var/run/dovecot/imap-server-its//login login_executable: /opt/dovecot/libexec/dovecot/imap-login login_processes_count: 16 login_max_processes_count: 2048 max_mail_processes: 4096 mail_max_userip_connections: 1024 verbose_proctitle: yes mail_location: maildir:/var/spool/maildir/%1Ln/%Ln:INDEX=/var/spool/dovecot/indexes/%1Ln/%Ln mail_debug: yes mail_full_filesystem_access: yes mmap_disable: yes mail_nfs_storage: yes mail_nfs_index: yes mail_plugins: fts fts_squat imap_capability: IMAP4rev1 SASL-IR SORT THREAD=REFERENCES MULTIAPPEND UNSELECT LITERAL+ CHILDREN NAMESPACE LOGIN-REFERRALS UIDPLUS LIST-EXTENDED I18NLEVEL=1 imap_client_workarounds: delay-newmail netscape-eoh pop3_reuse_xuidl: yes pop3_uidl_format: UID%v-%u namespace: type: private separator: . prefix: Mail. inbox: yes list: yes subscriptions: yes auth default: mechanisms: plain login verbose: yes passdb: driver: ldap args: /etc/dovecot.conf-ldap userdb: driver: static args: uid=vmail gid=mail home=/var/spool/maildir/%1Ln/%Ln socket: type: listen master: path: /var/run/dovecot/imap-server-its/auth-master mode: 384 user: vmail group: mail
On Wed, 2009-01-07 at 08:32 -0800, Jack Stewart wrote:
Timo Sirainen wrote:
On Tue, 2009-01-06 at 23:11 -0800, Jack Stewart wrote:
an 6 01:21:02 earth-griffen dovecot: IMAP(zabala): Timeout while waiting for lock for transaction log file /var/spool/dovecot/indexes/z/zabala/.INBOX/dovecot.index.log
This is the main problem. So indexes are also on NFS? What locking are they using (dovecot -n output)?
Yes, the indexes are also on NFS.
The locking is fcntl() - the default.
I'm guessing that's the problem. NFS locking seems to break/hang randomly sometimes. Can you somehow restart the NFS server locking daemon?
Timo Sirainen wrote:
On Wed, 2009-01-07 at 08:32 -0800, Jack Stewart wrote:
Timo Sirainen wrote:
an 6 01:21:02 earth-griffen dovecot: IMAP(zabala): Timeout while waiting for lock for transaction log file /var/spool/dovecot/indexes/z/zabala/.INBOX/dovecot.index.log This is the main problem. So indexes are also on NFS? What locking are
On Tue, 2009-01-06 at 23:11 -0800, Jack Stewart wrote: they using (dovecot -n output)?
Yes, the indexes are also on NFS.
The locking is fcntl() - the default.
I'm guessing that's the problem. NFS locking seems to break/hang randomly sometimes. Can you somehow restart the NFS server locking daemon?
restarting the server would be tough but I'll see what can be done. It's an EMC Celerra with head failover so in theory it should be straight forward.
There is another red flag. The Celerra is configured for snapshots and the problem started around the time one of the snapshots occurred. I'm fairly sure that the snapshots use file or volume locking.
Jack Stewart wrote:
Yes, the indexes are also on NFS.
The locking is fcntl() - the default.
I'm guessing that's the problem. NFS locking seems to break/hang randomly sometimes. Can you somehow restart the NFS server locking daemon?
I changed the /etc/hosts.allow so that any connection from the server is allowed (i.e. ALL: server_ip). This may only be specific to redhat enterprise 5. Since making this change the error message in the log files has gone away - not only for our IMAP servers but also for some Xen servers that were logging the same errors.
The core problem appears to be that portmapper/nlockmgr uses tcpwrappers or /etc/hosts.allow to determine if connections are accepted.
On occasion, the NFS server initiates a connection to nlockmgr on the client. It appears this communication is used to make sure locking information is accurate.
I have not found a bullet proof method of restricting the ports for nlockmgr so 'ALL: server_ip' ensures that a communication to nlockmgr is possible. IP Tables still apply so the risk to the system is minimal.
A strange problem, but I figure that people should know. Not as annoying as the out of the box telnet vulnerability in Solaris 10, but close.
---Jack
participants (2)
-
Jack Stewart
-
Timo Sirainen