[Dovecot] dovecot-auth stops responding
Running dovecot 1.2.4(imap only) on AIX out of inetd. I accept
passwords(PAM for auth) and kerberos tickets. Occasionally we see
dovecot-auth stop responding. I just kill it off and another process
is spawned which works fine. It doesn't look like it is hitting any
ulimits(11MB for memory). I issued a kill -11 and ran it through the
debugger. Here is what it says:
dbx dovecot-auth core Type 'help' for help. [using memory image in core] reading symbolic information ...
Segmentation fault in __fd_poll at 0x900000000117634 ($t1)
0x900000000117634 (__fd_poll+0x98) e8410028 ld r2,0x28(r1)
(dbx) where
__fd_poll(??, ??, ??) at 0x900000000117634
ioloop-poll.poll(__listptr = 0x00000001104dd330, __nfds = 145,
__timeout = 1051), line 121 in "poll.h"
io_loop_handler_run(ioloop = 0x000000011001bfb0), line 157 in "ioloop-
poll.c"
io_loop_run(ioloop = 0x000000011001bfb0), line 335 in "ioloop.c"
main(argc = 1, argv = 0x0ffffffffffffad8), line 347 in "main.c"
Here are my ulimits:
ulimit -a time(seconds) unlimited file(blocks) 2097151 data(kbytes) 1048576 stack(kbytes) 65536 memory(kbytes) 65536 coredump(blocks) 2097151 nofiles(descriptors) 32768
Any help would be appreciated. Thanks, Jonathan
On Wed, 2009-09-09 at 14:12 -0400, Jonathan Siegle wrote:
Running dovecot 1.2.4(imap only) on AIX out of inetd. I accept
passwords(PAM for auth) and kerberos tickets. Occasionally we see
dovecot-auth stop responding. I just kill it off and another process
is spawned which works fine.
If you use PAM then I guess you should have auth worker processes too? So one dovecot-auth and 1+ "dovecot-auth -w"? Does it help if you kill off the worker processes? What if you kill imap-login processes instead?
Segmentation fault in __fd_poll at 0x900000000117634 ($t1) 0x900000000117634 (__fd_poll+0x98) e8410028 ld r2,0x28(r1) (dbx) where __fd_poll(??, ??, ??) at 0x900000000117634 ioloop-poll.poll(__listptr = 0x00000001104dd330, __nfds = 145,
This just shows that it was waiting for something to do. Does AIX have something like strace or truss? Use it to see what's happening when a request comes.
On Sep 9, 2009, at 4:16 PM, Timo Sirainen wrote:
On Wed, 2009-09-09 at 14:12 -0400, Jonathan Siegle wrote:
Running dovecot 1.2.4(imap only) on AIX out of inetd. I accept passwords(PAM for auth) and kerberos tickets. Occasionally we see dovecot-auth stop responding. I just kill it off and another process is spawned which works fine.
If you use PAM then I guess you should have auth worker processes too?
Yes
So one dovecot-auth and 1+ "dovecot-auth -w"?
Yes. One of each.
Does it help if you kill off the worker processes?
It only helps when I kill dovecot-auth, not dovecot-auth -w.
What if you kill imap-login processes instead?
I don't have imap-login processes associated with inetd spawned
dovecot. imap-login is what I call in inetd. I don't see them in the
process table.
Segmentation fault in __fd_poll at 0x900000000117634 ($t1) 0x900000000117634 (__fd_poll+0x98) e8410028 ld r2,0x28(r1) (dbx) where __fd_poll(??, ??, ??) at 0x900000000117634 ioloop-poll.poll(__listptr = 0x00000001104dd330, __nfds = 145,
This just shows that it was waiting for something to do. Does AIX have something like strace or truss? Use it to see what's happening when a request comes.
Yes I have truss. So tomorrow when this happens I'll do truss -f -p on
the dovecot-auth process?
Oh my dovecot.conf file has this for pam:
passdb pam { args = max_requests=1
}
because of the expire password problem for kerberos.
thanks, Jonathan
On Thu, 2009-09-10 at 14:32 -0400, Jonathan Siegle wrote:
It only helps when I kill dovecot-auth, not dovecot-auth -w.
Interesting..
What if you kill imap-login processes instead?
I don't have imap-login processes associated with inetd spawned
dovecot.
Why do you use inetd? I'm currently wondering if I should bother adding inetd support to Dovecot v2.0.
Yes I have truss. So tomorrow when this happens I'll do truss -f -p on
the dovecot-auth process?
Yes. Also having auth_debug=yes enabled might show something useful. What does it log last before it stops responding?
On Thu, Sep 10, 2009 at 11:43 AM, Timo Sirainen <tss@iki.fi> wrote:
On Thu, 2009-09-10 at 14:32 -0400, Jonathan Siegle wrote:
It only helps when I kill dovecot-auth, not dovecot-auth -w.
Interesting..
What if you kill imap-login processes instead?
I don't have imap-login processes associated with inetd spawned dovecot.
Why do you use inetd? I'm currently wondering if I should bother adding inetd support to Dovecot v2.0.
Yes I have truss. So tomorrow when this happens I'll do truss -f -p on the dovecot-auth process?
Yes. Also having auth_debug=yes enabled might show something useful. What does it log last before it stops responding?
I've run across what seems to be the same issue with 1.2.4. I upgraded from a 1.0 release which was not having any issues; however, I'm very afraid to revert because the server gets killed if it has to rebuild caches/indexes. I don't have this issue on lesser loaded 1.2.4 installations, only this server which handles roughly 6000 mailboxes and maintains a higher overall load.
The difference is we aren't doing PAM, we have it disabled. We do SQL authentication only. Exact same symptoms, the server and all active connections remain online; however, new connections coming in via POP3/IMAP hang. The connection is made, but the banner is never shown.
The log, with auth_debug enabled doesn't seem to show anything useful, it only shows a ton of connections being dropped for remaining idle for too long when this begins happening. ie. POP3/IMAP clients are connecting, never getting a banner, and therefore never sending login credentials, then dovecot drops the connection eventually.
System affected is linux 2.6.18 (centos 5.2). Any help diagnosing and fixing this would be greatly appreciated. I'm not sure where to go from here.
- N
On Thu, 2009-09-10 at 12:22 -0700, Nathan M wrote:
The difference is we aren't doing PAM, we have it disabled. We do SQL authentication only. Exact same symptoms, the server and all active connections remain online; however, new connections coming in via POP3/IMAP hang. The connection is made, but the banner is never shown.
So killing dovecot-auth fixes the problem? What if you set login_process_per_connection=no?
On Thu, Sep 10, 2009 at 12:47 PM, Timo Sirainen <tss@iki.fi> wrote:
So killing dovecot-auth fixes the problem? What if you set login_process_per_connection=no?
Next time it happens I'll just try killing dovecot-auth. Thus far the fix has been fairly crude:
killall dovecot /usr/local/sbin/dovecot
That always gets it answering. I believe by killing dovecot it takes down dovecot-auth as well.
I will also set login_process_per_connection=no and see if that clears anything up.
The issue is very random, can't predict it at all, but typically 2-3 times per day, so hopefully I'll know quickly if we're making progress.
- N
On Thu, Sep 10, 2009 at 12:47 PM, Timo Sirainen <tss@iki.fi> wrote:
On Thu, 2009-09-10 at 12:22 -0700, Nathan M wrote:
The difference is we aren't doing PAM, we have it disabled. We do SQL authentication only. Exact same symptoms, the server and all active connections remain online; however, new connections coming in via POP3/IMAP hang. The connection is made, but the banner is never shown.
So killing dovecot-auth fixes the problem? What if you set login_process_per_connection=no?
Timo, setting login_process_per_connection=no significantly stablized the system. It was locking up 3-4 times a day, and now it seems to only lockup about once per month. Still something wrong, but much less significant. I also need to update to the latest release when time allows since two point releases have come out since this discussion.
- N
On Sep 10, 2009, at 2:43 PM, Timo Sirainen wrote:
On Thu, 2009-09-10 at 14:32 -0400, Jonathan Siegle wrote:
It only helps when I kill dovecot-auth, not dovecot-auth -w.
Interesting..
What if you kill imap-login processes instead?
I don't have imap-login processes associated with inetd spawned dovecot.
Why do you use inetd? I'm currently wondering if I should bother
adding inetd support to Dovecot v2.0.
We ran imap(Uwash) for years. Is there a reason to not run it from
inetd?
Yes I have truss. So tomorrow when this happens I'll do truss -f -p
on the dovecot-auth process?Yes. Also having auth_debug=yes enabled might show something useful. What does it log last before it stops responding?
I found something in syslog today:
local0.log.20090916:Sep 16 11:58:01 tr27n18.aset.psu.edu dovecot: auth (default): BUG: Worker sent reply with id 1, expected 2 local0.log.20090916:Sep 16 11:58:01 tr27n18.aset.psu.edu dovecot: auth (default): worker-server(mck2,146.186.125.214): Aborted: Worker is buggy
On Sep 16, 2009, at 1:35 PM, Jonathan Siegle wrote:
On Sep 10, 2009, at 2:43 PM, Timo Sirainen wrote:
Yes. Also having auth_debug=yes enabled might show something useful. What does it log last before it stops responding?
I found something in syslog today:
local0.log.20090916:Sep 16 11:58:01 dovecot: auth(default): BUG:
Worker sent reply with id 1, expected 2 local0.log.20090916:Sep 16 11:58:01 dovecot: auth(default): worker- server(foo,146.186.125.214): Aborted: Worker is buggy
Found thread from August("Aborted: Worker is buggy" ) and am applying
patch as suggested.
On Wed, 2009-09-16 at 13:50 -0400, Jonathan Siegle wrote:
I found something in syslog today:
local0.log.20090916:Sep 16 11:58:01 dovecot: auth(default): BUG:
Worker sent reply with id 1, expected 2 local0.log.20090916:Sep 16 11:58:01 dovecot: auth(default): worker- server(foo,146.186.125.214): Aborted: Worker is buggyFound thread from August("Aborted: Worker is buggy" ) and am applying
patch as suggested.
Did that fix dovecot-auth hanging?
On Oct 27, 2009, at 7:36 PM, Timo Sirainen wrote:
On Wed, 2009-09-16 at 13:50 -0400, Jonathan Siegle wrote:
I found something in syslog today:
local0.log.20090916:Sep 16 11:58:01 dovecot: auth(default): BUG: Worker sent reply with id 1, expected 2 local0.log.20090916:Sep 16 11:58:01 dovecot: auth(default): worker- server(foo,146.186.125.214): Aborted: Worker is buggy
Found thread from August("Aborted: Worker is buggy" ) and am applying patch as suggested.
Did that fix dovecot-auth hanging?
We have not tested it fully yet. It should be back on the todo list
next week.
-Jonathan
On Oct 28, 2009, at 10:29 AM, Jonathan Siegle wrote:
On Oct 27, 2009, at 7:36 PM, Timo Sirainen wrote:
On Wed, 2009-09-16 at 13:50 -0400, Jonathan Siegle wrote:
I found something in syslog today:
local0.log.20090916:Sep 16 11:58:01 dovecot: auth(default): BUG: Worker sent reply with id 1, expected 2 local0.log.20090916:Sep 16 11:58:01 dovecot: auth(default): worker- server(foo,146.186.125.214): Aborted: Worker is buggy
Found thread from August("Aborted: Worker is buggy" ) and am applying patch as suggested.
Did that fix dovecot-auth hanging?
We have not tested it fully yet. It should be back on the todo list next week.
-Jonathan
Timo, This patch is functioning without error for a few weeks now.
Thanks, Jonathan
participants (3)
-
Jonathan Siegle
-
Nathan M
-
Timo Sirainen