Timo Sirainen wrote:
On Apr 17, 2008, at 6:56 PM, richs@whidbey.net wrote:
We recently began seeing server crashes in our cluster related to "pop3-login", which is causing "oom-killer" to be invoked. The server only recovers after a reboot.
So oom-killer doesn't solve the issue? Then it's likely it has nothing to do with pop3-login, OOM killer just selects a bad target to kill (and Dovecot happily restarts a new pop3-login process) while the real memory-eating process stays alive. Can you check with ps what process(es) are eating all the memory?
That's a good point. Actually, oom-killer does solve the issue initially, but in every case the server eventually locks up (around 30 minutes later).
Unfortunately at this point "ps" and "top" can't run, so we haven't been able to collect much information. Here's a complete look at the "oom-killer" events:
Apr 17 07:48:42 mail2 kernel: pop3-login invoked oom-killer: gfp_mask=0xd0, order=0, oomkilladj=0 Apr 17 07:49:11 mail2 kernel: klogd invoked oom-killer: gfp_mask=0x4d0, order=0, oomkilladj=0 Apr 17 07:49:12 mail2 kernel: klogd invoked oom-killer: gfp_mask=0x4d0, order=0, oomkilladj=0 Apr 17 07:49:12 mail2 kernel: pop3-login invoked oom-killer: gfp_mask=0xd0, order=0, oomkilladj=0 Apr 17 07:49:12 mail2 kernel: klogd invoked oom-killer: gfp_mask=0x4d0, order=0, oomkilladj=0 Apr 17 07:49:13 mail2 kernel: pop3-login invoked oom-killer: gfp_mask=0xd0, order=0, oomkilladj=0 Apr 17 07:49:13 mail2 kernel: klogd invoked oom-killer: gfp_mask=0x4d0, order=0, oomkilladj=0 Apr 17 07:49:13 mail2 kernel: pop3-login invoked oom-killer: gfp_mask=0xd0, order=0, oomkilladj=0 Apr 17 07:49:13 mail2 kernel: klogd invoked oom-killer: gfp_mask=0x4d0, order=0, oomkilladj=0 Apr 17 07:49:13 mail2 kernel: pop3-login invoked oom-killer: gfp_mask=0xd0, order=0, oomkilladj=0 Apr 17 07:49:13 mail2 kernel: Out of memory: Killed process 20771 (clamd). Apr 17 07:49:13 mail2 kernel: pop3 invoked oom-killer: gfp_mask=0xd0, order=0, oomkilladj=0 Apr 17 07:49:14 mail2 kernel: Out of memory: Killed process 20825 (exim). Apr 17 07:49:14 mail2 kernel: imap-login invoked oom-killer: gfp_mask=0xd0, order=0, oomkilladj=0 Apr 17 07:49:14 mail2 kernel: Out of memory: Killed process 20678 (pop3-login). Apr 17 07:49:14 mail2 kernel: init invoked oom-killer: gfp_mask=0xd0, order=0, oomkilladj=0 Apr 17 07:49:14 mail2 kernel: Out of memory: Killed process 20958 (exim). Apr 17 07:49:14 mail2 kernel: imap-login invoked oom-killer: gfp_mask=0xd0, order=0, oomkilladj=0 Apr 17 08:22:42 mail2 kernel: pop3-login invoked oom-killer: gfp_mask=0xd0, order=0, oomkilladj=0
Only after this last entry, at 8:22, does Dovecot and the other processes stop responding (until we rebooted at 8:36).
It could be a coincidence this started after we moved to Dovecot 1.1rc3, but, why would "pop3-login" appear so often, and not other Dovecot processes (e.g. the plain "pop3" or "imap" workers that should be consuming much more memory?).
We'll see if we can get any more information the next time this happens (only about twice a week at the moment).
Thanks!
-Rich