[Dovecot] Dovecot 1.1rc3 "Out of Memory" crashes from pop3-login?
We recently began seeing server crashes in our cluster related to "pop3-login", which is causing "oom-killer" to be invoked. The server only recovers after a reboot.
From dovecot.conf:
login_process_per_connection = yes login_processes_count = 16 login_max_processes_count = 256
(Should we try switching processes_per_connection to no?)
From dovecot.log:
dovecot: Apr 17 07:48:39 Info: POP3(AAAAAAA): Disconnected: Logged out top=0/0, retr=0/0, del=0/154, size=10250932 dovecot: Apr 17 07:48:40 Info: POP3(BBBBBBB): Disconnected: Logged out top=0/0, retr=2/3729, del=2/2, size=3695 dovecot: Apr 17 07:48:40 Info: imap-login: Login: user=<CCCCCC>, method=PLAIN, rip=10.20.10.129 dovecot: Apr 17 07:48:41 Info: IMAP(CCCCCCC): Disconnected: Logged out bytes=551/7888 dovecot: Apr 17 07:49:06 Error: child 20678 (login) killed with signal 9 dovecot: Apr 17 07:49:12 Info: pop3-login: Login: user=<DDDDDDDD>, method=PLAIN, rip=66.165.6.233 dovecot: Apr 17 07:49:12 Info: pop3-login: Login: user=<EEEEEEEE>, method=PLAIN, rip=71.222.224.38
From /var/log/messages:
Apr 17 07:48:42 mail2 kernel: pop3-login invoked oom-killer: gfp_mask=0xd0, order=0, oomkilladj=0 Apr 17 07:48:42 mail2 kernel: [<c0450ace>] out_of_memory+0x3b/0x179 Apr 17 07:48:42 mail2 kernel: [<c0451f01>] __alloc_pages+0x1fe/0x27e Apr 17 07:48:42 mail2 kernel: [<c05c6696>] tcp_sendmsg+0x4e7/0x994 Apr 17 07:48:42 mail2 kernel: [<c05dd85d>] inet_sendmsg+0x35/0x3f Apr 17 07:48:42 mail2 kernel: [<c059967c>] do_sock_write+0xa3/0xaa Apr 17 07:48:42 mail2 kernel: [<c059b653>] sock_writev+0x9b/0xb4 Apr 17 07:48:42 mail2 kernel: [<c04352a1>] autoremove_wake_function+0x0/0x2d Apr 17 07:48:50 mail2 kernel: [<c059bd76>] lock_sock+0x8e/0x96 Apr 17 07:48:50 mail2 kernel: [<c048bba9>] inotify_d_instantiate+0x3c/0x5f Apr 17 07:48:50 mail2 kernel: [<c05fb6bf>] _spin_lock_bh+0x8/0x18 Apr 17 07:48:50 mail2 kernel: [<c059bc63>] release_sock+0xc/0x91 Apr 17 07:48:50 mail2 kernel: [<c05fb6bf>] _spin_lock_bh+0x8/0x18 Apr 17 07:49:11 mail2 kernel: [<c059bd76>] lock_sock+0x8e/0x96 Apr 17 07:49:11 mail2 kernel: [<c04063f2>] do_IRQ+0xa5/0xae Apr 17 07:49:11 mail2 kernel: [<c040492e>] common_interrupt+0x1a/0x20 Apr 17 07:49:11 mail2 kernel: [<c059b5b8>] sock_writev+0x0/0xb4 Apr 17 07:49:11 mail2 kernel: [<c046986a>] do_readv_writev+0x166/0x277 Apr 17 07:49:11 mail2 kernel: [<c048d240>] ep_ptable_queue_proc+0x55/0x65 Apr 17 07:49:11 mail2 kernel: [<c05c5d6b>] tcp_poll+0x19/0x127 Apr 17 07:49:11 mail2 kernel: [<c0471673>] sys_fstat64+0x1e/0x23 Apr 17 07:49:11 mail2 kernel: [<c04699b2>] vfs_writev+0x37/0x43 Apr 17 07:49:11 mail2 kernel: [<c0469df6>] sys_writev+0x3c/0x96 Apr 17 07:49:11 mail2 kernel: [<c0403e95>] sysenter_past_esp+0x56/0x79 Apr 17 07:49:11 mail2 kernel: ======================= Apr 17 07:49:11 mail2 kernel: Mem-info: Apr 17 07:49:11 mail2 kernel: DMA per-cpu: Apr 17 07:49:11 mail2 kernel: cpu 0 hot: high 0, batch 1 used:0 Apr 17 07:49:11 mail2 kernel: cpu 0 cold: high 0, batch 1 used:0 Apr 17 07:49:11 mail2 kernel: cpu 1 hot: high 0, batch 1 used:0 Apr 17 07:49:11 mail2 kernel: cpu 1 cold: high 0, batch 1 used:0 Apr 17 07:49:11 mail2 kernel: DMA32 per-cpu: empty Apr 17 07:49:11 mail2 kernel: Normal per-cpu: Apr 17 07:49:11 mail2 kernel: cpu 0 hot: high 186, batch 31 used:27 Apr 17 07:49:11 mail2 kernel: cpu 0 cold: high 62, batch 15 used:51 Apr 17 07:49:11 mail2 kernel: cpu 1 hot: high 186, batch 31 used:30 Apr 17 07:49:11 mail2 kernel: cpu 1 cold: high 62, batch 15 used:58 Apr 17 07:49:11 mail2 kernel: HighMem per-cpu: Apr 17 07:49:11 mail2 kernel: cpu 0 hot: high 186, batch 31 used:129 Apr 17 07:49:11 mail2 kernel: cpu 0 cold: high 62, batch 15 used:6 Apr 17 07:49:11 mail2 kernel: cpu 1 hot: high 186, batch 31 used:124 Apr 17 07:49:11 mail2 kernel: cpu 1 cold: high 62, batch 15 used:10 Apr 17 07:49:11 mail2 kernel: Free pages: 2871096kB (2863824kB HighMem) Apr 17 07:49:11 mail2 kernel: Active:90263 inactive:9647 dirty:30 writeback:0 unstable:0 free:717774 slab:215197 mapped:2666 pagetables:1253 Apr 17 07:49:11 mail2 kernel: DMA free:3588kB min:68kB low:84kB high:100kB active:0kB inactive:0kB present:16384kB pages_scanned:0 all_unreclaimable? yes Apr 17 07:49:11 mail2 kernel: lowmem_reserve[]: 0 0 880 4848 Apr 17 07:49:11 mail2 kernel: DMA32 free:0kB min:0kB low:0kB high:0kB active:0kB inactive:0kB present:0kB pages_scanned:0 all_unreclaimable? no Apr 17 07:49:11 mail2 kernel: lowmem_reserve[]: 0 0 880 4848 Apr 17 07:49:11 mail2 kernel: Normal free:3684kB min:3756kB low:4692kB high:5632kB active:184kB inactive:20kB present:901120kB pages_scanned:361 all_unreclaimable? yes Apr 17 07:49:11 mail2 kernel: lowmem_reserve[]: 0 0 0 31744 Apr 17 07:49:11 mail2 kernel: HighMem free:2863824kB min:512kB low:4748kB high:8988kB active:360868kB inactive:38568kB present:4063232kB pages_scanned:0 all_unreclaimable? no
-Rich
On Apr 17, 2008, at 6:56 PM, richs@whidbey.net wrote:
We recently began seeing server crashes in our cluster related to
"pop3-login", which is causing "oom-killer" to be invoked. The
server only recovers after a reboot.
So oom-killer doesn't solve the issue? Then it's likely it has nothing
to do with pop3-login, OOM killer just selects a bad target to kill
(and Dovecot happily restarts a new pop3-login process) while the real
memory-eating process stays alive. Can you check with ps what
process(es) are eating all the memory?
Timo Sirainen wrote:
On Apr 17, 2008, at 6:56 PM, richs@whidbey.net wrote:
We recently began seeing server crashes in our cluster related to "pop3-login", which is causing "oom-killer" to be invoked. The server only recovers after a reboot.
So oom-killer doesn't solve the issue? Then it's likely it has nothing to do with pop3-login, OOM killer just selects a bad target to kill (and Dovecot happily restarts a new pop3-login process) while the real memory-eating process stays alive. Can you check with ps what process(es) are eating all the memory?
That's a good point. Actually, oom-killer does solve the issue initially, but in every case the server eventually locks up (around 30 minutes later).
Unfortunately at this point "ps" and "top" can't run, so we haven't been able to collect much information. Here's a complete look at the "oom-killer" events:
Apr 17 07:48:42 mail2 kernel: pop3-login invoked oom-killer: gfp_mask=0xd0, order=0, oomkilladj=0 Apr 17 07:49:11 mail2 kernel: klogd invoked oom-killer: gfp_mask=0x4d0, order=0, oomkilladj=0 Apr 17 07:49:12 mail2 kernel: klogd invoked oom-killer: gfp_mask=0x4d0, order=0, oomkilladj=0 Apr 17 07:49:12 mail2 kernel: pop3-login invoked oom-killer: gfp_mask=0xd0, order=0, oomkilladj=0 Apr 17 07:49:12 mail2 kernel: klogd invoked oom-killer: gfp_mask=0x4d0, order=0, oomkilladj=0 Apr 17 07:49:13 mail2 kernel: pop3-login invoked oom-killer: gfp_mask=0xd0, order=0, oomkilladj=0 Apr 17 07:49:13 mail2 kernel: klogd invoked oom-killer: gfp_mask=0x4d0, order=0, oomkilladj=0 Apr 17 07:49:13 mail2 kernel: pop3-login invoked oom-killer: gfp_mask=0xd0, order=0, oomkilladj=0 Apr 17 07:49:13 mail2 kernel: klogd invoked oom-killer: gfp_mask=0x4d0, order=0, oomkilladj=0 Apr 17 07:49:13 mail2 kernel: pop3-login invoked oom-killer: gfp_mask=0xd0, order=0, oomkilladj=0 Apr 17 07:49:13 mail2 kernel: Out of memory: Killed process 20771 (clamd). Apr 17 07:49:13 mail2 kernel: pop3 invoked oom-killer: gfp_mask=0xd0, order=0, oomkilladj=0 Apr 17 07:49:14 mail2 kernel: Out of memory: Killed process 20825 (exim). Apr 17 07:49:14 mail2 kernel: imap-login invoked oom-killer: gfp_mask=0xd0, order=0, oomkilladj=0 Apr 17 07:49:14 mail2 kernel: Out of memory: Killed process 20678 (pop3-login). Apr 17 07:49:14 mail2 kernel: init invoked oom-killer: gfp_mask=0xd0, order=0, oomkilladj=0 Apr 17 07:49:14 mail2 kernel: Out of memory: Killed process 20958 (exim). Apr 17 07:49:14 mail2 kernel: imap-login invoked oom-killer: gfp_mask=0xd0, order=0, oomkilladj=0 Apr 17 08:22:42 mail2 kernel: pop3-login invoked oom-killer: gfp_mask=0xd0, order=0, oomkilladj=0
Only after this last entry, at 8:22, does Dovecot and the other processes stop responding (until we rebooted at 8:36).
It could be a coincidence this started after we moved to Dovecot 1.1rc3, but, why would "pop3-login" appear so often, and not other Dovecot processes (e.g. the plain "pop3" or "imap" workers that should be consuming much more memory?).
We'll see if we can get any more information the next time this happens (only about twice a week at the moment).
Thanks!
-Rich
On Apr 17, 2008, at 7:36 PM, richs@whidbey.net wrote:
Timo Sirainen wrote:
On Apr 17, 2008, at 6:56 PM, richs@whidbey.net wrote:
We recently began seeing server crashes in our cluster related to
"pop3-login", which is causing "oom-killer" to be invoked. The
server only recovers after a reboot. So oom-killer doesn't solve the issue? Then it's likely it has
nothing to do with pop3-login, OOM killer just selects a bad target
to kill (and Dovecot happily restarts a new pop3-login process)
while the real memory-eating process stays alive. Can you check
with ps what process(es) are eating all the memory?That's a good point. Actually, oom-killer does solve the issue
initially, but in every case the server eventually locks up (around
30 minutes later).Unfortunately at this point "ps" and "top" can't run, so we haven't
been able to collect much information. Here's a complete look at
the "oom-killer" events:
What do you have login_process_size set to? The default should be
64MB, so unless you changed that a single login process couldn't be
able to use up all memory (but all of them at once could, of course).
So maybe setting login_process_per_connection=no and
login_max_processes_count=(something small) could show something useful.
Timo Sirainen wrote:
On Apr 17, 2008, at 7:36 PM, richs@whidbey.net wrote:
Timo Sirainen wrote:
We recently began seeing server crashes in our cluster related to "pop3-login", which is causing "oom-killer" to be invoked. The server only recovers after a reboot. So oom-killer doesn't solve the issue? Then it's likely it has nothing to do with pop3-login, OOM killer just selects a bad target to kill (and Dovecot happily restarts a new pop3-login process) while
On Apr 17, 2008, at 6:56 PM, richs@whidbey.net wrote: the real memory-eating process stays alive. Can you check with ps what process(es) are eating all the memory?
That's a good point. Actually, oom-killer does solve the issue initially, but in every case the server eventually locks up (around 30 minutes later).
Unfortunately at this point "ps" and "top" can't run, so we haven't been able to collect much information. Here's a complete look at the "oom-killer" events:
What do you have login_process_size set to? The default should be 64MB, so unless you changed that a single login process couldn't be able to use up all memory (but all of them at once could, of course). So maybe setting login_process_per_connection=no and login_max_processes_count=(something small) could show something useful.
The login_process_size was commented out, so must've been 64MB. We'll try what you suggested and let you know what we see. Thanks Timo!
-Rich
participants (2)
-
richs@whidbey.net
-
Timo Sirainen