Hi all,
I am getting to you because I am pretty desperate by now. Please read this and share your ideas.
I set up a new Mail server with roughly 750 active accounts on it, based on the usual dovecot+postfix+amavis+sql based vmail setup. This is not the first time I did something like this, but admittedly my largest setup so far.
I took my time to setup everything pretty straight and stable and migrated all accounts in time with almost no glitches using imapc and pop3c. Things started out smoothly - until the first business morning when the log suddenly began to pop messages like these:
dovecot: master: Error: service(imap): fork() failed: Resource temporarily unavailable dovecot: master: Error: service(pop3): fork() failed: Resource temporarily unavailable
The error is not confined to a certain service. There are no other messages in dmesg or syslog which could help in closing in on the root cause.
The server always runs at load around 0.4 to 0.5 and has plenty of RAM and Disk, too.
Of course I checked resource limits and cranked up everything I could possibly think of to numbers that should easily suffice:
/etc/security/limits.conf: daemon soft nproc 2048 daemon hard nproc 3128 dovecot soft nproc 2048 dovecot hard nproc 3128 vmail soft nproc 2048 vmail hard nproc 3128 vmail soft nofile 32768 vmail hard nofile 49152
/etc/systemd/system/dovecot.service.d/ulimits.conf: [Service] LimitNOFILE=65534
However, all my attempts on resolving the problem were futile so far.
To work around the problem I implemented a log watcher which constantly looks for the fork failure and dumps some debug info before restarting dovecot to make the problem "go away" - until dovecot hits whatever barrier there is again.
Here are today's numbers I gathered from this:
forkdump_03-01_09-42: pop=3, imap=446, all=503, postfix:22, sys: 716 forkdump_03-01_10-09: pop=9, imap=444, all=510, postfix:28, sys: 741 forkdump_03-01_10-42: pop=5, imap=450, all=507, postfix:25, sys: 748 forkdump_03-01_11-18: pop=5, imap=451, all=512, postfix:29, sys: 757 forkdump_03-01_11-48: pop=3, imap=474, all=511, postfix:37, sys: 765 forkdump_03-01_12-23: pop=7, imap=444, all=511, postfix:22, sys: 752 forkdump_03-01_12-53: pop=3, imap=444, all=503, postfix:24, sys: 747 forkdump_03-01_13-21: pop=3, imap=458, all=504, postfix:31, sys: 766 forkdump_03-01_13-41: pop=4, imap=451, all=505, postfix:23, sys: 755 forkdump_03-01_14-35: pop=5, imap=436, all=502, postfix:31, sys: 768 forkdump_03-01_14-59: pop=3, imap=445, all=501, postfix:27, sys: 764 forkdump_03-01_15-28: pop=6, imap=452, all=507, postfix:25, sys: 769
There you see the number of pop3 and imap childs in the process list as well as the total number of all dovecot processes and all postfix processes, summarized directly after the fork problem appears in the log. The last column is the sum of all processes in the system.
Here is a run where the numbers are measured as the server approaches the last entry in the list above.
15:26:15: pop=3, imap=432, all=485, VSZ: 14353k, postfix:30, sys: 749 15:26:30: pop=4, imap=435, all=490, VSZ: 14497k, postfix:30, sys: 754 15:26:45: pop=3, imap=439, all=492, VSZ: 14560k, postfix:28, sys: 754 15:27:00: pop=3, imap=444, all=497, VSZ: 14711k, postfix:28, sys: 760 15:27:15: pop=4, imap=442, all=496, VSZ: 14673k, postfix:25, sys: 756 15:27:30: pop=4, imap=444, all=497, VSZ: 14721k, postfix:25, sys: 757 15:27:46: pop=3, imap=446, all=498, VSZ: 14756k, postfix:25, sys: 758 15:28:01: pop=3, imap=446, all=497, VSZ: 14718k, postfix:24, sys: 756 15:28:16: pop=3, imap=455, all=507, VSZ: 15018k, postfix:25, sys: 767 15:28:18: pop=3, imap=455, all=507, VSZ: 15018k, postfix:25, sys: 767 15:28:20: pop=3, imap=452, all=504, VSZ: 14931k, postfix:25, sys: 764 ---- service(pop3): fork() failed: Resource temporarily unavailable ---- 15:28:22: pop=4, imap=452, all=505, VSZ: 14954k, postfix:25, sys: 770 15:28:24: pop=4, imap=452, all=505, VSZ: 14954k, postfix:25, sys: 770 15:28:26: pop=4, imap=451, all=504, VSZ: 14925k, postfix:25, sys: 769 15:28:28: pop=5, imap=451, all=506, VSZ: 14980k, postfix:25, sys: 771 ---- dovecot restart complete ---- 15:28:30: pop=3, imap=11, all=22, VSZ: 485k, postfix:25, sys: 282 15:29:00: pop=4, imap=65, all=83, VSZ: 2293k, postfix:26, sys: 344 15:29:31: pop=3, imap=91, all=111, VSZ: 3129k, postfix:27, sys: 373
It always seems to happen shortly after the number of dovecot processes goes beyond 500. The VSZ column is the VSZ sum of all dovecot procs.
The number of files opened by vmail (lsof -u <vmailuid> | wc -l) at that moment usually is between 13k and 15k.
sysctl fs.file-nr mostly is around 16k total but the system limit is set way higher (at 1532853)
However, as you can see in the following dump, there is no process_limit or whatever being set to 500 or 512 or similar.
# --------- su vmail -c "ulimit -Ha:" ---------------
core file size (blocks, -c) unlimited data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 60026 max locked memory (kbytes, -l) 64 max memory size (kbytes, -m) unlimited open files (-n) 49152 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) unlimited cpu time (seconds, -t) unlimited max user processes (-u) 3128 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited
# ---------------- dovecot -n (mildly anonymized) ---------------
# 2.2.18: /etc/dovecot/dovecot.conf # Pigeonhole version 0.4.8 (0c4ae064f307+) # OS: Linux 4.4.0-8-generic x86_64 Ubuntu Xenial Xerus (development branch) xfs auth_cache_negative_ttl = 0 auth_cache_size = 50 M auth_mechanisms = plain login default_client_limit = 10000 dict { quota = mysql:/etc/dovecot/dovecot-dict-sql.conf.ext } hostname = my.mail.host imap_client_workarounds = delay-newmail lda_mailbox_autocreate = yes lda_mailbox_autosubscribe = yes lda_original_recipient_header = X-Original-To listen = * mail_location = sdbox:/mailsrv/users/%d/%u mail_plugins = quota notify mail_log fts managesieve_notify_capability = mailto managesieve_sieve_capability = fileinto reject envelope encoded-character vacation subaddress comparator-i;ascii-numeric relational regex imap4flags copy include variables body enotify environment mailbox date index ihave duplicate namespace inbox { inbox = yes location = mailbox Drafts { auto = subscribe special_use = \Drafts } mailbox Junk { auto = subscribe special_use = \Junk } mailbox Sent { auto = subscribe special_use = \Sent } mailbox Trash { auto = subscribe special_use = \Trash } prefix = } passdb { args = /etc/dovecot/dovecot-sql.conf.ext driver = sql } plugin { fts_autoindex = yes quota = dict:User quota::proxy::quota sieve = file:~/sieve;active=~/.dovecot.sieve sieve_dir = ~/sieve stats_refresh = 30s stats_track_cmds = yes } postmaster_address = postmaster@my.mail.host protocols = " imap lmtp sieve pop3" service anvil { client_limit = 11227 } service auth { client_limit = 11324 unix_listener /mailsrv/spool/private/auth { group = postfix mode = 0666 user = postfix } } service dict { unix_listener dict { mode = 0600 user = vmail } } service imap-login { process_limit = 10000 process_min_avail = 5 service_count = 0 vsz_limit = 2 G } service imap { process_limit = 2048 vsz_limit = 2 G } service lmtp { inet_listener lmtp { address = 127.0.0.1 port = 24 } } service managesieve-login { inet_listener sieve { port = 4190 } } service pop3-login { process_limit = 1024 process_min_avail = 3 service_count = 0 vsz_limit = 2 G } service pop3 { process_limit = 1024 } service stats { fifo_listener stats-mail { mode = 0600 user = vmail } } ssl_cert =
So, here I am, out of ideas for the moment. Do you have any?
-- kind regards, Jonathan