Greetings dovecot mailing list.
I have implemented a relatively big dovecot setup (250k users) and overall I am very pleased with dovecot functionality and performance.
Setup description:
- dovecot 1.0.x
- FreeBSD 6.3
- Postfix (using dovecot deliver as LDA).
- OpenLdap backend
- Storage is NFS (Clariion EMC NFSd for Maildir, and FreeBSD NFSd for Indexes).
- Locking is fcntl using RPC.lockd.
- Users are accessing mail using POP3 and IMAP (IMAP mainly via Squirrelmail, but also direct)
- 3 frontends for POP/SMTP and 2 frontends for IMAP (webmail). Round Robin DNS
My problem:
I am having issues where POP3, IMAP and DELIVER processes gets stuck, apparently waiting for device.
fstat shows:
bash# fstat -p 93522 USER CMD PID FD MOUNT INUM MODE SZ|DV R/W 302870 pop3 93522 root / 2 drwxr-xr-x 512 r 302870 pop3 93522 wd /home/mnt5 51592 drwxr-xr-x 80 r 302870 pop3 93522 text /usr 121619 -r-xr-xr-x 436616 r 302870 pop3 93522 0* internet stream tcp 302870 pop3 93522 1* internet stream tcp 302870 pop3 93522 2* pipe c778aa48 <-> c778a990 0 rw 302870 pop3 93522 3 /dev 24 crw-rw-rw- random r 302870 pop3 93522 5* pipe ce440b28 <-> ce440be0 0 rw 302870 pop3 93522 6* pipe ce440be0 <-> ce440b28 0 rw 302870 pop3 93522 7 /home/mnt5 9010290 -rw------- 1493 rw 302870 pop3 93522 8 - - bad - 302870 pop3 93522 9 - - bad - 302870 pop3 93522 10 - - bad
And the inode in question on /home/mnt5 is a dot-nfs file, indicating stale lock:
bash# ls -li | grep 9010290 9010290 -rw------- 1 302870 42 1493 Apr 3 18:05 .nfs.0668c236.6d524.4
ktrace on the pid shows absolutely no activity.
The pop3 process is un-killable, and I end up stacking up pop3 processes from the user, as well as deliver to the user. Not healthy.. I was under the impression that POP3 would exit when a lock is set, preventing more than one pop3 processes pr. user, but it doesn't seem to be the case.
Stopping dovecot entirely, leaves these stale pop3/imap/deliver processes hanging, even with shutdown_clients = yes
The windows-problem-solution (reboot) seems to be the only way to get rid of the locked processes.
So: Has anyone else observed this behavior, and eventually found the magic cure ?
I wonder if there was a way to implement a "max wall-clock time" per dovecot process type (i.e.. terminate process after for example 120 sec. delivery, 600 sec pop3 etc...), as a crude "garbage-collector".
Any hints/suggestions is welcome.
-- Søren Schrøder