[Dovecot] Significant performance problems
Hi all,
I'm sure my issues are a result of misconfiguration, but I'm hoping someone can point me in the right direction. I'm getting pressure to move us back to GroupWise, which I desperately want to avoid :-/
We're running dovecot 1.2.9 on Ubuntu 10.4 LTS+postfix. The server is a VM with 1 vCPU and 4GB of RAM. We serve about 10,000 users with anywhere from 500-1000 logged in at any one time. Messages are stored in Maildir format on two NFS servers (one for staff, the other for students).
Today I implemented the "High performance" setup described here: http://wiki.dovecot.org/NFS (mainly moving indexes off of NFS, since I'm only using the one server).
I also added imapproxy to our webmail client server (SOGo). The vast majority of our users come in over the web.
We currently see load averages spiking into the 20-30 range. When this happens, service crawls to a near standstill, and ultimately the SOGo client starts crashing out.
I'm wondering if anything jumps out at anybody here - feel free to mock if/when you find an obvious configuration problem. I just want it to work :-)
dovecot -n
# 1.2.9: /etc/dovecot/dovecot.conf # OS: Linux 2.6.32-25-server x86_64 Ubuntu 10.04.1 LTS log_timestamp: %Y-%m-%d %H:%M:%S protocols: imap imaps pop3 pop3s managesieve listen(default): * listen(imap): * listen(pop3): * listen(managesieve): *:2000 ssl_cert_file: /etc/dovecot/certs/mail_nhusd_k12_ca_us.crt ssl_key_file: /etc/dovecot/certs/mail_nhusd_k12_ca_us.key disable_plaintext_auth: no login_dir: /var/run/dovecot/login login_executable(default): /usr/lib/dovecot/imap-login login_executable(imap): /usr/lib/dovecot/imap-login login_executable(pop3): /usr/lib/dovecot/pop3-login login_executable(managesieve): /usr/lib/dovecot/managesieve-login login_process_per_connection: no login_process_size: 512 login_processes_count: 20 login_max_processes_count: 3000 login_max_connections: 64 max_mail_processes: 2048 mail_max_userip_connections(default): 20 mail_max_userip_connections(imap): 20 mail_max_userip_connections(pop3): 10 mail_max_userip_connections(managesieve): 10 mail_access_groups: staffmailusers mail_privileged_group: dovecot mail_uid: mail mail_gid: 502 mail_location: maildir:~/Maildir:INDEX=/var/indexes/%u mail_nfs_storage: yes mbox_write_locks: fcntl dotlock mail_executable(default): /usr/lib/dovecot/imap mail_executable(imap): /usr/lib/dovecot/imap mail_executable(pop3): /usr/lib/dovecot/pop3 mail_executable(managesieve): /usr/lib/dovecot/managesieve mail_plugins(default): acl imap_acl quota imap_quota expire mail_plugins(imap): acl imap_acl quota imap_quota expire mail_plugins(pop3): mail_plugins(managesieve): mail_plugin_dir(default): /usr/lib/dovecot/modules/imap mail_plugin_dir(imap): /usr/lib/dovecot/modules/imap mail_plugin_dir(pop3): /usr/lib/dovecot/modules/pop3 mail_plugin_dir(managesieve): /usr/lib/dovecot/modules/managesieve namespace: type: private separator: / inbox: yes list: yes subscriptions: yes namespace: type: shared separator: / prefix: shared/%%u/ location: maildir:%%h/Maildir:INDEX=~/Maildir/shared/%%u list: children lda: deliver_log_format: %$ -- FROM=%f SUBJECT=%s mail_plugins: cmusieve acl expire log_path: info_log_path: syslog_facility: mail postmaster_address:postmaster@nhusd.k12.ca.us hostname: mail.nhusd.k12.ca.us auth_socket_path: /var/run/dovecot/auth-master auth default: passdb: driver: pam passdb: driver: ldap args: /etc/dovecot/dovecot-ldap.conf userdb: driver: ldap args: /etc/dovecot/dovecot-ldap.conf socket: type: listen master: path: /var/run/dovecot/auth-master mode: 384 plugin: quota: maildir:User quota quota_rule: *:storage=9G quota_rule2: Trash:storage=200M acl: vfile acl_shared_dict:file:/home/staff/dovecot/shared-mailboxes expire: Trash 7 Trash/* 7 Spam 30 expire_dict: proxy::expire sieve: ~/.dovecot.sieve sieve_dir: ~/sieve sieve_extensions: +imapflags dict: expire: mysql:/etc/dovecot/dovecot-dict-expire.conf
-- Chris Hobbs Director, Technology New Haven Unified School District
-- This message was scanned by ESVA and is believed to be clean.
On 7.10.2010, at 0.32, Chris Hobbs wrote:
We currently see load averages spiking into the 20-30 range. When this happens, service crawls to a near standstill, and ultimately the SOGo client starts crashing out.
Is the load CPU load or disk I/O load? If I/O load, what NFS operations are peaking there, or all of them? Pretty graphs of nfsstat output would be nice.
login_processes_count: 20
Probably could use less then 20.
login_max_connections: 64
And this could be higher. In general you should have maybe 1-2x the number of login processes than CPU cores.
mail_nfs_storage: yes
You said you have only one server accessing mails. So set this to "no".
mail_location: maildir:~/Maildir:INDEX=/var/indexes/%u .. namespace: type: shared separator: / prefix: shared/%%u/ location: maildir:%%h/Maildir:INDEX=~/Maildir/shared/%%u
The INDEX path here is wrong now.
Also you could try if maildir_very_dirty_syncs=yes is helpful.
On 10/6/10 5:22 PM, Timo Sirainen wrote:
Is the load CPU load or disk I/O load? If I/O load, what NFS operations are peaking there, or all of them? Pretty graphs of nfsstat output would be nice.
If I'm reading the output of our monitoring system correctly, the CPU is spending quite a bit of time in WAIT status, so I asuume that means it is IO bound?
After 19 minutes of uptime, nfsstat looks like (I'm not monitoring this [yet], so I don't have pretty graphs :/ ):
Client nfs v3:
null getattr setattr lookup access readlink
0 0% 20389 15% 7615 5% 42198 31% 26896 20%
58 0%
read write create mkdir symlink mknod
6923 5% 5825 4% 4178 3% 29 0% 0 0%
0 0%
remove rmdir rename link readdir
readdirplus
2771 2% 0 0% 2500 1% 77 0% 7238 5%
1428 1%
fsstat fsinfo pathconf commit
3 0% 4 0% 2 0% 5690 4%
login_processes_count: 20 Probably could use less then 20.
login_max_connections: 64 And this could be higher. In general you should have maybe 1-2x the number of login processes than CPU cores. Since this is in a virtual environment, I went ahead and ramped up the number of CPUs to 4, since I have the processors to spare. mail_nfs_storage: yes You said you have only one server accessing mails. So set this to "no". Done. mail_location: maildir:~/Maildir:INDEX=/var/indexes/%u .. namespace: type: shared separator: / prefix: shared/%%u/ location: maildir:%%h/Maildir:INDEX=~/Maildir/shared/%%u The INDEX path here is wrong now. Fixed - luckily most of our users don't share so this shouldn't have had a huge impact. Also you could try if maildir_very_dirty_syncs=yes is helpful. Done.
Will report back tomorrow on how much these fixes help. Really appreciate the effort.
-- Chris Hobbs Director, Technology New Haven Unified School District
-- This message was scanned by ESVA and is believed to be clean.
On 10/6/2010 7:41 PM, Chris Hobbs wrote:
On 10/6/10 5:22 PM, Timo Sirainen wrote: login_processes_count: 20
Probably could use less then 20.
login_max_connections: 64 And this could be higher. In general you should have maybe 1-2x the number of login processes than CPU cores. Since this is in a virtual environment, I went ahead and ramped up the number of CPUs to 4, since I have the processors to spare.
Is your disk a virtual disk as well? Have you checked performance?
Something like:
| hdparm -tT /dev/sda|
On a ZFS RAID 10 of 10 7200RPM SATA drives, I get about 100MB/s
At work we have an EMC SAN, and I get like 350MB/s on that beast - but if this is just a vhd on a Windows box, that could be an issue..
Rick
On 10/6/10 6:28 PM, Rick Romero wrote:
Is your disk a virtual disk as well? Have you checked performance?
Something like: | hdparm -tT /dev/sda|On a ZFS RAID 10 of 10 7200RPM SATA drives, I get about 100MB/s
At work we have an EMC SAN, and I get like 350MB/s on that beast - but if this is just a vhd on a Windows box, that could be an issue..
The disks are virtual on an EMC iSCSI san. I know you're supposed to run hdparm when things are quiet, but trying it now (somewhat quiet) I get:
on the NFS server: Timing cached reads: 11844 MB in 2.00 seconds = 5928.81 MB/sec Timing buffered disk reads: 176 MB in 3.00 seconds = 58.62 MB/sec
on the dovecot server (where indexes are stored): Timing cached reads: 12310 MB in 2.00 seconds = 6162.13 MB/sec Timing buffered disk reads: 230 MB in 3.02 seconds = 76.22 MB/sec
-- Chris Hobbs Director, Technology New Haven Unified School District
-- This message was scanned by ESVA and is believed to be clean.
imapproxy can only take you from "doesn't work" to "might as well not work", ime. If at all possible look into a stateful web client.
-bdh
On Oct 6, 2010, at 6:32 PM, Chris Hobbs chobbs@nhusd.k12.ca.us wrote:
Hi all,
I'm sure my issues are a result of misconfiguration, but I'm hoping someone can point me in the right direction. I'm getting pressure to move us back to GroupWise, which I desperately want to avoid :-/
We're running dovecot 1.2.9 on Ubuntu 10.4 LTS+postfix. The server is a VM with 1 vCPU and 4GB of RAM. We serve about 10,000 users with anywhere from 500-1000 logged in at any one time. Messages are stored in Maildir format on two NFS servers (one for staff, the other for students).
Today I implemented the "High performance" setup described here: http://wiki.dovecot.org/NFS (mainly moving indexes off of NFS, since I'm only using the one server).
I also added imapproxy to our webmail client server (SOGo). The vast majority of our users come in over the web.
We currently see load averages spiking into the 20-30 range. When this happens, service crawls to a near standstill, and ultimately the SOGo client starts crashing out.
I'm wondering if anything jumps out at anybody here - feel free to mock if/when you find an obvious configuration problem. I just want it to work :-)
dovecot -n
# 1.2.9: /etc/dovecot/dovecot.conf # OS: Linux 2.6.32-25-server x86_64 Ubuntu 10.04.1 LTS log_timestamp: %Y-%m-%d %H:%M:%S protocols: imap imaps pop3 pop3s managesieve listen(default): * listen(imap): * listen(pop3): * listen(managesieve): *:2000 ssl_cert_file: /etc/dovecot/certs/mail_nhusd_k12_ca_us.crt ssl_key_file: /etc/dovecot/certs/mail_nhusd_k12_ca_us.key disable_plaintext_auth: no login_dir: /var/run/dovecot/login login_executable(default): /usr/lib/dovecot/imap-login login_executable(imap): /usr/lib/dovecot/imap-login login_executable(pop3): /usr/lib/dovecot/pop3-login login_executable(managesieve): /usr/lib/dovecot/managesieve-login login_process_per_connection: no login_process_size: 512 login_processes_count: 20 login_max_processes_count: 3000 login_max_connections: 64 max_mail_processes: 2048 mail_max_userip_connections(default): 20 mail_max_userip_connections(imap): 20 mail_max_userip_connections(pop3): 10 mail_max_userip_connections(managesieve): 10 mail_access_groups: staffmailusers mail_privileged_group: dovecot mail_uid: mail mail_gid: 502 mail_location: maildir:~/Maildir:INDEX=/var/indexes/%u mail_nfs_storage: yes mbox_write_locks: fcntl dotlock mail_executable(default): /usr/lib/dovecot/imap mail_executable(imap): /usr/lib/dovecot/imap mail_executable(pop3): /usr/lib/dovecot/pop3 mail_executable(managesieve): /usr/lib/dovecot/managesieve mail_plugins(default): acl imap_acl quota imap_quota expire mail_plugins(imap): acl imap_acl quota imap_quota expire mail_plugins(pop3): mail_plugins(managesieve): mail_plugin_dir(default): /usr/lib/dovecot/modules/imap mail_plugin_dir(imap): /usr/lib/dovecot/modules/imap mail_plugin_dir(pop3): /usr/lib/dovecot/modules/pop3 mail_plugin_dir(managesieve): /usr/lib/dovecot/modules/managesieve namespace: type: private separator: / inbox: yes list: yes subscriptions: yes namespace: type: shared separator: / prefix: shared/%%u/ location: maildir:%%h/Maildir:INDEX=~/Maildir/shared/%%u list: children lda: deliver_log_format: %$ -- FROM=%f SUBJECT=%s mail_plugins: cmusieve acl expire log_path: info_log_path: syslog_facility: mail postmaster_address:postmaster@nhusd.k12.ca.us hostname: mail.nhusd.k12.ca.us auth_socket_path: /var/run/dovecot/auth-master auth default: passdb: driver: pam passdb: driver: ldap args: /etc/dovecot/dovecot-ldap.conf userdb: driver: ldap args: /etc/dovecot/dovecot-ldap.conf socket: type: listen master: path: /var/run/dovecot/auth-master mode: 384 plugin: quota: maildir:User quota quota_rule: *:storage=9G quota_rule2: Trash:storage=200M acl: vfile acl_shared_dict:file:/home/staff/dovecot/shared-mailboxes expire: Trash 7 Trash/* 7 Spam 30 expire_dict: proxy::expire sieve: ~/.dovecot.sieve sieve_dir: ~/sieve sieve_extensions: +imapflags dict: expire: mysql:/etc/dovecot/dovecot-dict-expire.conf
-- Chris Hobbs Director, Technology New Haven Unified School District
-- This message was scanned by ESVA and is believed to be clean.
Chris,
-----Original Message----- Subject: [Dovecot] Significant performance problems
I'm sure my issues are a result of misconfiguration, but I'm hoping someone can point me in the right direction. I'm getting pressure to move us back to GroupWise, which I desperately want to avoid :-/
We're running dovecot 1.2.9 on Ubuntu 10.4 LTS+postfix. The server is a VM with 1 vCPU and 4GB of RAM. We serve about 10,000 users with anywhere from 500-1000 logged in at any one time. Messages are stored in Maildir format on two NFS servers (one for staff, the other for students).
Is the webmail interface and imap proxy also running on this server? What does memory utilization look like on the server? How much is being used by applications, and how much is free for filesystem cache? What mount options are you using on your NFS exports (on the NFS client side)?
We run 60k accounts with about 10k concurrent sessions across 12 servers. Each server has 4 cores and 8GB of RAM, and mounts 16 NFS exports spread across two servers. The servers handle close to 1k concurrent sessions each without breaking a load of 1.
The keys seem to be keeping NFS IO latency down, and allowing the server to cache as much as possible. If the Dovecot server is always having to go back to NFS for client data, and the NFS server doesn't have enough memory to cache filesystem metadata and/or spindles to access the data in a timely manner, you're going to hit a pain point pretty quick.
Try bumping up the RAM on both servers to 8+GB, and make sure that you don't have any mount options that would prevent the client from caching data - noac for example is a killer. You could also try mounting with noac, and disabling or turning down speculative readahead on the NFS server. Have you followed all of your storage vendor's block alignment guidelines when setting up the LUNs and virtual disks?
-Brad
Brandon 'Brad' Davidson Virtualization Systems Administrator University of Oregon Information Services (541) 346-8098 brandond@uoregon.edu
Chris,
-----Original Message----- Subject: Re: [Dovecot] Significant performance problems
Try bumping up the RAM on both servers to 8+GB, and make sure that you don't have any mount options that would prevent the client from caching data - noac for example is a killer. You could also try mounting with noac, and disabling or turning down speculative readahead on the NFS server.
Sorry - don't try noac, try noatime! Big difference!
As an additional data point, I will say that we see Dovecot processes for 900 concurrent users consume about 3GB of memory. If your system is anything like ours, you've probably got less than 1GB of memory left for the kernel to use as filesystem cache. Throw as much memory as you can spare at the Dovecot and NFS servers, and see what happens.
-Brad
For documentation's sake, here's what I've done so far:
Implemented Timo's fixes to my config file (fixed shared INDEX, adjusted nfs settings for reality of only one server hitting it)
installed imapproxy on the webmail server at the recommendation of the developers of that product (SOGo)
Modified my NFS mount with noatime to reduce i/o hits there. Need to figure out what Brad's suggestions about readahead on the server mean.
Threw gobs of RAM at both the dovecot server (went from 4GB to 8) and the NFS server (from 1GB to 8). Also cranked up vCPUs on each to 4.
I hope that's enough to get things working much better tomorrow morning. I'll be back to report or beg for more. I really appreciate the quick responses and helpful advice.
I do have one more idea I'll throw out there. Everything I've got here is virtual. I only have the one Dovecot/Postfix server running now, and the impression I get from you all is that that should be adequate for my load. What would the collective opinion be of simply removing the NFS server altogether and mounting the virtual disk holding my messages directly to the dovecot server? I give up the ability to have a failover dovecot/postfix server, which was my motivation for using NFS in the first place, but a usable system probably trumps a redundant one.
Chris
On 10/6/10 4:32 PM, Chris Hobbs wrote:
Hi all,
I'm sure my issues are a result of misconfiguration, but I'm hoping someone can point me in the right direction. I'm getting pressure to move us back to GroupWise, which I desperately want to avoid :-/
-- Chris Hobbs Director, Technology New Haven Unified School District
-- This message was scanned by ESVA and is believed to be clean.
On Wed, 06 Oct 2010 21:42:57 -0700, Chris Hobbs chobbs@nhusd.k12.ca.us wrote:
For documentation's sake, here's what I've done so far:
I do have one more idea I'll throw out there. Everything I've got here is virtual. I only have the one Dovecot/Postfix server running now, and the impression I get from you all is that that should be adequate for my load. What would the collective opinion be of simply removing the NFS server altogether and mounting the virtual disk holding my messages directly to the dovecot server? I give up the ability to have a failover dovecot/postfix server, which was my motivation for using NFS in the first place, but a usable system probably trumps a redundant one.
Chris
I have done some tests here that shows that NFS is a major overhead compared to local filesystem on iSCSI volume. I have tested only NFS4 with linux clients and server. Finally we went with a couple of mails servers that mount OCFS2 shared volume - this setup also has some drawbacks in terms of complexity.
You also could achieve redundant mail system with local fs (XFS for example) over iSCSI volume - one server will be standby and will mount the volume and bring up a floating IP if primary goes down. You could automate such a setup with heartbeat/pacemaker or other cluster manager. Though, in such a setup you could not load-balance if you are serving only one mail-domain.
Best regards
-- Luben Karavelov Research and development Spectrum Net JSC
36, D-r G. M. Dimitrov Blvd. 1797 Sofia Mobile: +359 884332840 url: www.spnet.net
Chris,
On 10/6/10 9:42 PM, "Chris Hobbs" chobbs@nhusd.k12.ca.us wrote:
- Modified my NFS mount with noatime to reduce i/o hits there. Need to figure out what Brad's suggestions about readahead on the server mean.
It's been a while since I mucked with Linux as a NFS server, I've been on Netapp for a while. There may be less knobs than I recall.
I do have one more idea I'll throw out there. Everything I've got here is virtual. I only have the one Dovecot/Postfix server running now, and the impression I get from you all is that that should be adequate for my load. What would the collective opinion be of simply removing the NFS server altogether and mounting the virtual disk holding my messages directly to the dovecot server?
If you're not planning on doing some sort of HA failover or load balancing, and have the option to make your storage direct-attached instead of NAS, it might be worth trying. There's not much to be gained from NFS in a single-node configuration.
-Brad
On 10/6/10 11:26 PM, Brandon Davidson wrote:
I do have one more idea I'll throw out there. Everything I've got here is virtual. I only have the one Dovecot/Postfix server running now, and the impression I get from you all is that that should be adequate for my load. What would the collective opinion be of simply removing the NFS server altogether and mounting the virtual disk holding my messages directly to the dovecot server? If you're not planning on doing some sort of HA failover or load balancing, and have the option to make your storage direct-attached instead of NAS, it might be worth trying. There's not much to be gained from NFS in a single-node configuration.
I ended up implementing this just now. Still had load issues this morning and I'm hoping that removing NFS helps me out.
Chris
Chris Hobbs Director, Technology New Haven Unified School District
-- This message was scanned by ESVA and is believed to be clean.
participants (7)
-
Brad Davidson
-
Brandon Davidson
-
Brian Hayden
-
Chris Hobbs
-
Luben Karavelov
-
Rick Romero
-
Timo Sirainen