[Dovecot] Defunct processes w/ v1.2.1
I have limited information at the moment (currently distracted by other projects) but I'm hoping that someone has already found and solved this problem.
I have a v1.2.1 installed on my development box waiting for me to tinker with the configuration. On several occasions, I have noticed that all the dovecot processes are listed as "defunct". I don't recall seeing anything significant in the logs the last time I checked. The system is not under any load.
Has anyone else encountered this before? If not, I will dig deeper. BTW, ignore the fd limit warning below; limits are set in the init script.
# /local/bin/dovecot -n # 1.2.1: /local/pkg/dovecot-1.2.1/root/etc/dovecot.conf Warning: fd limit 1024 is lower than what Dovecot can use under full load (more than 10240). Either grow the limit or change login_max_processes_count and max_mail_processes settings # OS: Linux 2.4.21-58.ELsmp i686 Red Hat Enterprise Linux AS release 3 (Taroon Update 9) listen: localhost ssl_listen: * ssl_cert_file: /usr/share/ssl/certs/mail.encs.pem ssl_key_file: /usr/share/ssl/certs/mail.encs.pem login_dir: /var/run/dovecot/login login_executable: /local/pkg/dovecot-1.2.1/root/libexec/dovecot/imap-login login_processes_count: 32 login_max_processes_count: 2048 max_mail_processes: 8192 first_valid_uid: 200 mail_location: mbox:~/mail:INBOX=/var/spool/mail/%1u/% u:INDEX=/local/data/dovecot/indexes/mail/%1u/%u namespace: type: private separator: / inbox: yes list: yes subscriptions: yes namespace: type: private separator: / prefix: mail/ hidden: yes list: no subscriptions: yes namespace: type: private separator: / prefix: ~/mail/ hidden: yes list: no subscriptions: yes namespace: type: private separator: / prefix: ~%u/mail/ hidden: yes list: no subscriptions: yes auth default: passdb: driver: pam userdb: driver: passwd
-- Chris O'Regan chris@encs.concordia.ca Senior Unix Systems Administrator, Academic IT Services Faculty of Engineering and Computer Science Concordia University, Montreal, Canada
On Tue, 2009-07-21 at 14:45 -0400, Chris O'Regan wrote:
I have a v1.2.1 installed on my development box waiting for me to tinker with the configuration. On several occasions, I have noticed that all the dovecot processes are listed as "defunct". I don't recall seeing anything significant in the logs the last time I checked. The system is not under any load.
What exactly is "all"? Do you mean just dovecot-auth processes, or do you really mean everything, including imap-login, dovecot-auth, imap, pop3, ..?
If it's only dovecot-auth processes, it's probably a PAM issue. If it's more than dovecot-auth, is everything still anyway working?
On Tue, 2009-07-21 at 15:08 -0400, Timo Sirainen wrote:
On Tue, 2009-07-21 at 14:45 -0400, Chris O'Regan wrote:
I have a v1.2.1 installed on my development box waiting for me to tinker with the configuration. On several occasions, I have noticed that all the dovecot processes are listed as "defunct". I don't recall seeing anything significant in the logs the last time I checked. The system is not under any load.
What exactly is "all"? Do you mean just dovecot-auth processes, or do you really mean everything, including imap-login, dovecot-auth, imap, pop3, ..?
If it's only dovecot-auth processes, it's probably a PAM issue. If it's more than dovecot-auth, is everything still anyway working?
It just happened again:
root 2904 1 0 14:32 ? 00:00:00 /local/bin/dovecot root 2907 2904 0 14:32 ? 00:00:00 dovecot-auth root 2919 2904 0 14:32 ? 00:00:00 dovecot-auth -w dovecot 7026 2904 0 15:02 ? 00:00:00 [imap-login <defunct>] [and many other imap-login processes, all defunct]
If I telnet to "imap", a connection is established but there is no greeting. I get "connection refused" if I try to access "imaps" from Thunderbird.
No errors in the logs, except for some messages that I believe are related to an internal Nessus scan (repeated a few times):
Jul 21 15:01:45 XXX dovecot: imap-login: Disconnected (no auth attempts): rip=XXX.XXX.XXX.XXX, lip=XXX.XXX.XXX.XXX, TLS handshaking: Disconnected Jul 21 15:02:20 XXX dovecot: imap-login: Disconnected (no auth attempts): rip=XXX.XXX.XXX.XXX, lip=XXX.XXX.XXX.XXX, TLS handshaking: SSL_accept() failed: error:1406B0CB:SSL routines:GET_CLIENT_MASTER_KEY:peer error no cipher
Hrmmm...look at the time of the imap-login process...that can't be a coincidence. Nessus is scanning our production IMAP servers, too, but as I mentioned, they are running an older version of Dovecot, namely v1.1.14.
-- Chris O'Regan chris@encs.concordia.ca Senior Unix Systems Administrator, Academic IT Services Faculty of Engineering and Computer Science Concordia University, Montreal, Canada
On Tue, 2009-07-21 at 15:32 -0400, Chris O'Regan wrote:
What exactly is "all"? Do you mean just dovecot-auth processes, or do you really mean everything, including imap-login, dovecot-auth, imap, pop3, ..?
If it's only dovecot-auth processes, it's probably a PAM issue. If it's more than dovecot-auth, is everything still anyway working?
It just happened again:
root 2904 1 0 14:32 ? 00:00:00 /local/bin/dovecot root 2907 2904 0 14:32 ? 00:00:00 dovecot-auth root 2919 2904 0 14:32 ? 00:00:00 dovecot-auth -w dovecot 7026 2904 0 15:02 ? 00:00:00 [imap-login <defunct>] [and many other imap-login processes, all defunct]
OK, so only imap-login processes are defunct.
If I telnet to "imap", a connection is established but there is no greeting.
But it anyway says "connected to localhost" and only after that hangs?
What does it say if you strace dovecot process?
It just happened again:
root 2904 1 0 14:32 ? 00:00:00 /local/bin/dovecot root 2907 2904 0 14:32 ? 00:00:00 dovecot-auth root 2919 2904 0 14:32 ? 00:00:00 dovecot-auth -w dovecot 7026 2904 0 15:02 ? 00:00:00 [imap-login <defunct>] [and many other imap-login processes, all defunct]
OK, so only imap-login processes are defunct.
That appears to be the case.
BTW, I have 74 connections from our Nessus server to imaps that are in the CLOSE_WAIT state.
If I telnet to "imap", a connection is established but there is no greeting.
But it anyway says "connected to localhost" and only after that hangs?
Correct:
# telnet 0 imap Trying 0.0.0.0... Connected to 0 (0.0.0.0). Escape character is '^]'.
No greeting, and not response if I try to send it commands.
What does it say if you strace dovecot process?
I can't strace the imap-login processes because they are zombies. I have traced the other processes, connecting to localhost:imap and then disconnecting.
# parent dovecot process
Process 2904 attached - interrupt to quit
write(9, "\21\0\0\0\0\0\0\0\1\0\0\0\253\33\0\0\1\1\0\0\0\0\0\0\2"...,
128
# dovecot-auth
Process 2907 attached - interrupt to quit
gettimeofday({1248207577, 195021}, {240, 0}) = 0
gettimeofday({1248207577, 195094}, NULL) = 0
poll([{fd=5, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=7,
events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=0, events=POLLIN|
POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=3, events=POLLIN|POLLPRI|POLLERR|
POLLHUP|POLLNVAL}], 4, 818) = 0
gettimeofday({1248207578, 23147}, {240, 0}) = 0
gettimeofday({1248207578, 23213}, NULL) = 0
poll([{fd=5, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=7,
events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=0, events=POLLIN|
POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=3, events=POLLIN|POLLPRI|POLLERR|
POLLHUP|POLLNVAL}], 4, 5000) = 0
gettimeofday({1248207583, 33171}, {240, 0}) = 0
gettimeofday({1248207583, 33238}, NULL) = 0
poll([{fd=5, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=7,
events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=0, events=POLLIN|
POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=3, events=POLLIN|POLLPRI|POLLERR|
POLLHUP|POLLNVAL}], 4, 5000) = 0
gettimeofday({1248207588, 43177}, {240, 0}) = 0
gettimeofday({1248207588, 43245}, NULL) = 0
poll([{fd=5, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=7,
events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=0, events=POLLIN|
POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=3, events=POLLIN|POLLPRI|POLLERR|
POLLHUP|POLLNVAL}], 4, 5000) = 0
gettimeofday({1248207593, 53183}, {240, 0}) = 0
gettimeofday({1248207593, 53249}, NULL) = 0
poll(
# dovecot-auth -w
Process 2919 attached - interrupt to quit
gettimeofday({1248207639, 375100}, {240, 0}) = 0
poll(
-- Chris O'Regan chris@encs.concordia.ca Senior Unix Systems Administrator, Academic IT Services Faculty of Engineering and Computer Science Concordia University, Montreal, Canada
On Tue, 2009-07-21 at 16:25 -0400, Chris O'Regan wrote:
# parent dovecot process Process 2904 attached - interrupt to quit write(9, "\21\0\0\0\0\0\0\0\1\0\0\0\253\33\0\0\1\1\0\0\0\0\0\0\2"..., 128
Process 2904 detached
Interesting. The dovecot master process is stuck. What fd is 9 (or whatever it's the next time it's hanging)? You can check it with:
ls -l /proc/2904/fd/9
On Tue, 2009-07-21 at 16:28 -0400, Timo Sirainen wrote:
On Tue, 2009-07-21 at 16:25 -0400, Chris O'Regan wrote:
# parent dovecot process Process 2904 attached - interrupt to quit write(9, "\21\0\0\0\0\0\0\0\1\0\0\0\253\33\0\0\1\1\0\0\0\0\0\0\2"..., 128
Process 2904 detached Interesting. The dovecot master process is stuck. What fd is 9 (or whatever it's the next time it's hanging)? You can check it with:
ls -l /proc/2904/fd/9
Or that'll probably just show "socket:[12345]"? gdb backtrace is probably more helpful:
gdb -p 2904 bt full
On Tue, 2009-07-21 at 16:25 -0400, Chris O'Regan wrote:
# parent dovecot process Process 2904 attached - interrupt to quit write(9, "\21\0\0\0\0\0\0\0\1\0\0\0\253\33\0\0\1\1\0\0\0\0\0\0\2"..., 128
Process 2904 detached
This should fix it: http://hg.dovecot.org/dovecot-1.2/rev/a878a0897eb9
Hopefully sigprocmask() is implemented widely enough nowadays.
On Tue, 2009-07-21 at 17:16 -0400, Timo Sirainen wrote:
On Tue, 2009-07-21 at 16:25 -0400, Chris O'Regan wrote:
# parent dovecot process Process 2904 attached - interrupt to quit write(9, "\21\0\0\0\0\0\0\0\1\0\0\0\253\33\0\0\1\1\0\0\0\0\0\0\2"..., 128
Process 2904 detached This should fix it: http://hg.dovecot.org/dovecot-1.2/rev/a878a0897eb9
Hopefully sigprocmask() is implemented widely enough nowadays.
Wow, 2.5 hours from problem report to patch! I will give this a try, but not right now. I seem to have the reverse-midas touch and I think it's best if I stop being root for the day. :-/
Many, many thanks!
-- Chris O'Regan chris@encs.concordia.ca Senior Unix Systems Administrator, Academic IT Services Faculty of Engineering and Computer Science Concordia University, Montreal, Canada
On 7/21/2009, Chris O'Regan (chris@encs.concordia.ca) wrote:
Wow, 2.5 hours from problem report to patch!
Yeah, I was a little disappointed in how long that one took. Timo is slowing down since he moved into his new job - they must have him working on other things now.
;)
j/k of course...
--
Best regards,
Charles
This should fix it: http://hg.dovecot.org/dovecot-1.2/rev/a878a0897eb9
Hopefully sigprocmask() is implemented widely enough nowadays.
I've confirmed that Nessus was triggering the problem. After applying the patch, Dovecot survived a Nessus scan! Many, many thanks for fixing this so quickly!
-- Chris O'Regan chris@encs.concordia.ca Senior Unix Systems Administrator, Academic IT Services Faculty of Engineering and Computer Science Concordia University, Montreal, Canada
participants (3)
-
Charles Marcus
-
Chris O'Regan
-
Timo Sirainen