IMAP dovecot\postgres low authentication performance
Hi. We have a performance problem with imap authentication through postgresql. Our servers(modoboa based) have a big amount of permanent imap connections(5000-50000). Current performance is about 3000 successful authentications per hour. No visible reasons for such low speed. Accordingly, after a network failure or server restart, all clients try to reconnect, but restoring the connection pool takes hours and even tens of hours. Judging by the logs after the restart, a huge number of auth requests closed by timeout after 70-90 seconds. The postgresql database is not overloaded at the restore connections process and the postgresql connection pool (100) does not overflow. Manually started sql auth queries work fast, tables have indexes. So I guess there is a bottleneck somewhere in dovecot auth service or postgresql driver.
I couldn't find any settings in the documentation that are directly responsible for regulating the number of connections to PostgreSQL from the auth service or performance of the driver. Is there any way to manage this? Does it make sense to use pgbouncer in front of the base? What else can be a bottleneck in our configuration and how to regulate the number of possible simultaneous authentications via PostgreSQL? I would be happy for any advice on how to increase the performance to at least 100 authentications per second.
Thanks in advance for your help. Anatoliy Zhestov.
---------------- configs of server with 48 core, 184 gb mem ---------------
dovecot 2.3.16 # 2.3.16 (7e2e900c1a): /etc/dovecot/dovecot.conf # Pigeonhole version 0.5.16 (09c29328) # OS: Linux 5.15.0-126-generic x86_64 Ubuntu 22.04.5 LTS # Hostname: imap.ourcompany.net auth_cache_negative_ttl = 0 auth_cache_size = 20 M auth_cache_ttl = 3 hours auth_master_user_separator = * auth_mechanisms = plain login default_client_limit = 131072 default_process_limit = 131072 dict { quota = pgsql:/etc/dovecot/dovecot-dict-sql.conf.ext } mail_location = maildir:~/Maildir mail_plugins = quota managesieve_notify_capability = mailto managesieve_sieve_capability = fileinto reject envelope encoded-character vacation subaddress comparator-i;ascii-numeric relational regex imap4flags copy include variables body enotify environment mai lbox date index ihave duplicate mime foreverypart extracttext editheader imapflags notify vnd.dovecot.pipe vnd.dovecot.execute namespace inbox { inbox = yes location = mailbox Drafts { auto = subscribe special_use = \Drafts } mailbox Junk { auto = subscribe special_use = \Junk } mailbox Sent { auto = subscribe special_use = \Sent } mailbox Trash { auto = subscribe special_use = \Trash } prefix = } passdb { args = /etc/dovecot/dovecot-sql.conf.ext driver = sql } passdb { args = /etc/dovecot/dovecot-sql-master.conf.ext driver = sql master = yes pass = yes } plugin { quota = dict:User quota::proxy::quota sieve = ~/.dovecot.sieve sieve_dir = ~/sieve sieve_execute_bin_dir = /usr/lib/dovecot/sieve-execute sieve_execute_socket_dir = . sieve_extensions = +notify +imapflags +editheader +body +vnd.dovecot.execute +vnd.dovecot.pipe sieve_pipe_bin_dir = /usr/lib/dovecot/sieve-execute sieve_pipe_socket_dir = . sieve_plugins = sieve_extprograms } protocols = " imap lmtp sieve" service anvil { client_limit = 393219 } service auth { client_limit = 524288 unix_listener /var/spool/postfix/private/auth { group = postfix mode = 0666 user = postfix } unix_listener auth-radicale { group = radicale mode = 0666 user = radicale } unix_listener auth-userdb { user = vmail } } service config { vsz_limit = 1 G } service dict { unix_listener dict { mode = 0600 user = vmail } } service imap { executable = imap postlogin process_limit = 131072 vsz_limit = 1 G } service lmtp { unix_listener /var/spool/postfix/private/dovecot-lmtp { group = postfix mode = 0600 user = postfix } vsz_limit = 512 M } service pop3 { executable = pop3 postlogin } service postlogin { executable = script-login /usr/local/bin/postlogin.sh service_count = 1000 process_min_avail = 5 user = modoboa }
service send-to-click-frontapp1 { executable = script /usr/lib/dovecot/sieve-execute/clickhouse_log_frontapp1.sh unix_listener send-to-click-frontapp1 { group = vmail mode = 0666 user = vmail } user = dovenull vsz_limit = 2 G } service send-to-click { executable = script /usr/lib/dovecot/sieve-execute/clickhouse_log.sh unix_listener send-to-click { group = vmail mode = 0666 user = vmail } user = dovenull vsz_limit = 2 G } service stats { unix_listener stats-reader { group = vmail mode = 0660 user = vmail } unix_listener stats-writer { group = vmail mode = 0660 user = vmail } vsz_limit = 2 G } ssl_cert = </etc/letsencrypt/live/imap.ourcompany.net/fullchain.pem ssl_cipher_list = EECDH+ECDSA+AESGCM:EECDH+aRSA+AESGCM:EECDH+ECDSA+SHA384:EECDH+ECDSA+SHA256:EECDH+aRSA+SHA384:EECDH+aRSA+SHA256:EECDH+aRSA+RC4:EECDH:EDH+aRSA:!aNULL:!eNULL:!LOW:!3DES:!MD5:!EXP:!PSK:!SRP:!DSS:!RC4 ssl_client_ca_dir = /etc/ssl/certs ssl_dh = # hidden, use -P to show it ssl_key = # hidden, use -P to show it ssl_min_protocol = TLSv1 stats_writer_socket_path = userdb { driver = prefetch } userdb { args = /etc/dovecot/dovecot-sql.conf.ext driver = sql } protocol lmtp { mail_plugins = quota sieve postmaster_address = postmaster@imap.ourcompany.net } protocol imap { mail_max_userip_connections = 50 mail_plugins = quota imap_quota }
/etc/dovecot/dovecot-sql-master.conf.ex driver = pgsql connect = host=127.0.0.1 port=5432 dbname=modoboa user=modoboa password=XXXXXXXXXXXXXXXXX password_query = SELECT email AS user, password FROM core_user WHERE email='%u' and is_active and master_user
/etc/dovecot/dovecot-sql.conf.ext driver = pgsql connect = host=127.0.0.1 port=5432 dbname=modoboa user=modoboa password=XXXXXXXXXXXXXXXXX user_query = SELECT '/srv/vmail/%d/%n' AS home, 1003 as uid, 1003 as gid, '*:bytes=' || mb.quota || 'M' AS quota_rule FROM admin_mailbox mb INNER JOIN admin_domain dom ON mb.domain_id=dom.id INNER JOIN core_user u ON u.id=mb.user_id WHERE (mb.is_send_only IS NOT TRUE OR '%s' NOT IN ('imap', 'pop3', 'lmtp')) AND mb.address='%n' AND dom.name='%d' password_query = SELECT email AS user, password, '/srv/vmail/%d/%n' AS userdb_home, 1003 AS userdb_uid, 1003 AS userdb_gid, CONCAT('*:bytes=', mb.quota, 'M') AS userdb_quota_rule FROM core_user u INNER JOIN admin_mailbox mb ON u.id=mb.user_id INNER JOIN admin_domain dom ON mb.domain_id=dom.id WHERE (mb.is_send_only IS NOT TRUE OR '%s' NOT IN ('imap', 'pop3')) AND email='%u' AND is_active AND dom.enabled iterate_query = SELECT email AS user FROM core_user
/usr/local/bin/postlogin.sh #!/bin/sh PATH="/usr/bin:/usr/local/bin:/bin" psql -c "UPDATE core_user SET last_login=now() WHERE username='$USER'" > /dev/null exec "$@"
/usr/lib/dovecot/sieve-execute/clickhouse_log_frontapp1.sh
#!/bin/bash
cd /tmp
logger -i -t SIEVE_TEST "$1,$2,$3,$4"
clickhouse-client -h XXXXXXXXXXXXXXXXX -d default -u some_user --password
XXXXXXXXXXXXXXXXX
--param_origin_from="$1"
--param_origin_to="$2"
--param_message_id="$3"
--param_forward="$4"
-q "INSERT INTO frontapp1_redirects
(origin_from_email,origin_to_email,message_id,forward_email) VALUES
({origin_from:String},{origin_to:String},{message_id:String},{forward:String})"
/usr/lib/dovecot/sieve-execute/clickhouse_log.sh
#!/bin/bash
cd /tmp
logger -i -t SIEVE_TEST "$1,$2,$3,$4"
clickhouse-client -h XXXXXXXXXXXXXXXXX -d default -u some_user --password
XXXXXXXXXXXXXXXXX
--param_origin_from="$1"
--param_origin_to="$2"
--param_message_id="$3"
--param_forward="$4"
-q "INSERT INTO frontapp_redirects
(origin_from_email,origin_to_email,message_id,forward_email) VALUES
({origin_from:String},{origin_to:String},{message_id:String},{forward:String})"
Current performance is about 3000 successful authentications per hour. No
I don't really get this authentication attempts is limitted by tcp not? So it does not really matter what you have mariadb, ldap, http, you awalys are limited to 150-200 r/s. Once you have a connection, you can go easily to something like >9000 r/s.
If I divide your 3000 auths/h by 3600, you are at 1 r/s????. Maybe you have a lot of iowaits because of some recovery?
No, we don't have the noticeable iowait problem as I see it(at least until the number of connections lower 20-30K). The problem appears when thousands of clients try to reconnect at the same time and according to the documentation the auth service should make a simple request to postgres for this. This should not be related to i\o in theory. Just a sql query.
I don't really get this authentication attempts is limitted by tcp not? So
it does not really matter what you have mariadb, ldap, http, you awalys are limited to 150-200 r/s.
Can you explain? We also have SMTP connections and postfix on the same servers, but problems arise only at the stage of connecting to IMAP via postgres. How can this be related to tcp limits?
Additionally, I would like to note that we can have hundreds of connections from one IP. Could this lead to a race condition with the anvil service and an increase in connection timeouts?
On Mon, Feb 3, 2025 at 12:51 PM Marc <Marc@f1-outsourcing.eu> wrote:
Current performance is about 3000 successful authentications per hour. No
I don't really get this authentication attempts is limitted by tcp not? So it does not really matter what you have mariadb, ldap, http, you awalys are limited to 150-200 r/s. Once you have a connection, you can go easily to something like >9000 r/s.
If I divide your 3000 auths/h by 3600, you are at 1 r/s????. Maybe you have a lot of iowaits because of some recovery?
No, we don't have the noticeable iowait problem as I see it(at least until the number of connections lower 20-30K). The problem appears when thousands of clients try to reconnect at the same time and according to the documentation the auth service should make a simple request to postgres for this. This should not be related to i\o in theory. Just a sql query.
Oh yes? What is this then /usr/local/bin/postlogin.sh I don't know you have to look at what is different after the restart.
I don't really get this authentication attempts is limitted by tcp not? So
it does not really matter what you have mariadb, ldap, http, you awalys are limited to 150-200 r/s.
Can you explain? We also have SMTP connections and postfix on the same servers, but problems arise only at the stage of connecting to IMAP via postgres. How can this be related to tcp limits?
Probably not as you only restarted. The limit is when you have create a new connection and don't use an existing one.
Additionally, I would like to note that we can have hundreds of connections from one IP. Could this lead to a race condition with the anvil service and an increase in connection timeouts?
But this to you already had before the restart not? You have to search what is different.
Oh yes? What is this then /usr/local/bin/postlogin.sh I don't know you have to look at what is different after the restart.
Do you mean that the contents of this file are not cached? or that there is some limit on the number of simultaneous requests to read it? content of this file not look too heavy
psql -c "UPDATE core_user SET last_login=now() WHERE username='$USER'" > /dev/null
Probably not as you only restarted. The limit is when you have create a new
connection and don't use an existing one.
I don't see a way to reuse an existing connection yet if the number of persistent connections after a restart should increase from 0 to 40K I miss something obvious? (We tried changing the client_limit value for imap-login (in order to reduce the number of processes), but in our experience it works correctly only within 3-5 and at higher values imap-login stops responding quite quickly.)
On Mon, Feb 3, 2025 at 10:29 PM Marc <Marc@f1-outsourcing.eu> wrote:
No, we don't have the noticeable iowait problem as I see it(at least until the number of connections lower 20-30K). The problem appears when thousands of clients try to reconnect at the same time and according to the documentation the auth service should make a simple request to postgres for this. This should not be related to i\o in theory. Just a sql query.
Oh yes? What is this then /usr/local/bin/postlogin.sh I don't know you have to look at what is different after the restart.
I don't really get this authentication attempts is limitted by tcp not? So
it does not really matter what you have mariadb, ldap, http, you awalys are limited to 150-200 r/s.
Can you explain? We also have SMTP connections and postfix on the same servers, but problems arise only at the stage of connecting to IMAP via postgres. How can this be related to tcp limits?
Probably not as you only restarted. The limit is when you have create a new connection and don't use an existing one.
Additionally, I would like to note that we can have hundreds of connections from one IP. Could this lead to a race condition with the anvil service and an increase in connection timeouts?
But this to you already had before the restart not? You have to search what is different.
Do you mean that the contents of this file are not cached? or that there is some limit on the number of simultaneous requests to read it? content of this file not look too heavy
psql -c "UPDATE core_user SET last_login=now() WHERE username='$USER'"
/dev/null
I would put a '&' at the end, so at least nothing is waiting for this. This does not seem to be critical. This is also not efficient. Can't you make a trigger in the database so you do not even need a shell script?
Probably not as you only restarted. The limit is when you have create a new
connection and don't use an existing one.
I don't see a way to reuse an existing connection yet if the number of persistent connections after a restart should increase from 0 to 40K I miss something obvious? (We tried changing the client_limit value for imap-login (in order to reduce the number of processes), but in our experience it works correctly only within 3-5 and at higher values imap-login stops responding quite quickly.)
I can't really write anything useful. On the client side MUA you can't do anything these are all 'unique' connections. You can only optimize to your connection to the database. But I am not using this kind of setup. So I don't really know. I am using ldap and ldap stuff is being cached by sssd or nslcd.
On 3. Feb 2025, at 7.05, Anatoliy Zhestov via dovecot <dovecot@dovecot.org> wrote:
Hi. We have a performance problem with imap authentication through postgresql. Our servers(modoboa based) have a big amount of permanent imap connections(5000-50000). Current performance is about 3000 successful authentications per hour. No visible reasons for such low speed. Accordingly, after a network failure or server restart, all clients try to reconnect, but restoring the connection pool takes hours and even tens of hours. Judging by the logs after the restart, a huge number of auth requests closed by timeout after 70-90 seconds. The postgresql database is not overloaded at the restore connections process and the postgresql connection pool (100) does not overflow. Manually started sql auth queries work fast, tables have indexes. So I guess there is a bottleneck somewhere in dovecot auth service or postgresql driver.
Are you sure the problem is authentication / pgsql? You could test with looping "doveadm auth lookup $user" rapidly. Of course for different users to avoid them coming from cache. Or if you can reproduce it that way, try if the same happens for repeating the same user so it does come from cache.
Are you sure the problem is authentication / pgsql? You could test with looping "doveadm auth lookup $user" rapidly. Of course for different users to avoid them coming from cache. Or if you can reproduce it that way, try if the same happens for repeating the same user so it does come from cache.
i test in condition when 90% of imap connection is already established. auth cache is enabled so i guess tests with the same user are not relevant.
-------- less loaded server ps waux|grep imap-login|wc -l 24977
netstat -n | grep ":993" | grep -ic established 24868
find /srv/vmail/*/ -depth -maxdepth 1 -type d |grep -v "/$"|awk -F '/' '{print $5"@"$4}' > /tmp/userlist
cat /tmp/userlist |wc -l 13285
time $(for i in $(cat /tmp/userlist); do doveadm auth lookup $i; done) passdb: command not found
real 5m43.336s user 1m4.634s sys 2m32.215s
echo "13285 / 343" |bc 38 (per second)
--------- high loaded server ps waux|grep imap-login|wc -l 52264
netstat -n | grep ":993" | grep -ic established 52773
cat /tmp/userlist |wc -l 37727
time $(for i in $(cat /tmp/userlist); do doveadm auth lookup $i; done) passdb:: command not found
real 26m26.193s user 3m55.742s sys 18m21.957s
echo "37727 / 1586" |bc 23 (per second)
Less than the 40\25 query per second with successive requests - it's much slower than I would like to see but not awful. However, with parallel requests, things seem to get a lot worse. Most requests die by timeout. Unfortunately, I can't measure this now on busy servers without risking affecting clients.
And logs look like this
Jan 27 18:51:46 imapserver dovecot: imap-login: Error: Timeout while finishing login (waited 44 secs): user=<aaa@bbb>, method=PLAIN, rip=xx.xx.xx.xx, lip=xx.xx.xx.xx, TLS: Connection closed, session=<3BnDhrQs6MANO0fO> Jan 27 18:51:46 imapserver dovecot: imap-login: Error: Timeout while finishing login (waited 40 secs): user=<bbb@ccc>, method=PLAIN, rip=xx.xx.xx.xx, lip=xx.xx.xx.xx, TLS: Connection closed, session=<0f7IhrQsRuIDEJ9E> Jan 27 18:51:46 imapserver dovecot: imap-login: Error: Timeout while finishing login (waited 41 secs): user=<ccc@ddd>, method=PLAIN, rip=xx.xx.xx.xx, lip=xx.xx.xx.xx, TLS: Connection closed, session=<f8TFhrQshMMDDlat> Jan 27 18:51:46 imapserver dovecot: imap-login: Error: Timeout while finishing login (waited 40 secs): user=<ddd@eee>, method=PLAIN, rip=xx.xx.xx.xx, lip=xx.xx.xx.xx, TLS: Connection closed, session=<lknKhrQspOMDEJ9E> Jan 27 18:51:47 imapserver dovecot: imap-login: Error: Timeout while finishing login (waited 40 secs): user=<eee@fff>, method=PLAIN, rip=xx.xx.xx.xx, lip=xx.xx.xx.xx, TLS: Connection closed, session=<4dLJhrQsXOAS2fSG>
On Mon, Feb 3, 2025 at 8:04 PM Timo Sirainen <timo@sirainen.com> wrote:
On 3. Feb 2025, at 7.05, Anatoliy Zhestov via dovecot <dovecot@dovecot.org> wrote:
Hi. We have a performance problem with imap authentication through postgresql. Our servers(modoboa based) have a big amount of permanent imap connections(5000-50000). Current performance is about 3000 successful authentications per hour. No visible reasons for such low speed. Accordingly, after a network failure
or
server restart, all clients try to reconnect, but restoring the connection pool takes hours and even tens of hours. Judging by the logs after the restart, a huge number of auth requests closed by timeout after 70-90 seconds. The postgresql database is not overloaded at the restore connections process and the postgresql connection pool (100) does not overflow. Manually started sql auth queries work fast, tables have indexes. So I guess there is a bottleneck somewhere in dovecot auth service or postgresql driver.
Are you sure the problem is authentication / pgsql? You could test with looping "doveadm auth lookup $user" rapidly. Of course for different users to avoid them coming from cache. Or if you can reproduce it that way, try if the same happens for repeating the same user so it does come from cache.
participants (3)
-
Anatoliy Zhestov
-
Marc
-
Timo Sirainen