[Dovecot] Random timeouts on mailboxes
Hello,
My dovecot server is currently having random issues when users want to acces their mailbox. When the issue is occuring there's a timeout. It's random because it happens more or less often and is not dependent on a user or the way their check the mailbox. The time during the problem occurs is random, sometimes it's less than a minute, sometimes it's some hours. When a user can't connect to its mailbox, he generally can receive its emails through his BlackBerry (it's just to illustrate the fact that it doesn't seem to be linked to an account). When this issue occurs they can't receive the headers from any folder (INBOX, custom, sent, ...) and can't read mails that headers are known by the mail client. When there's timeout, users can send emails (same jail and with an error when the client want to write a copy in the sentbox) and use the calendar server for example (on the same machine).
Dovecot is running on a FreeBSD 7.0 32bits, 4GB RAM, Intel Xeon QuadCore @ 1.86 Ghz, and 3*500Gb SATA-2 RAID-5 disks. The box is hosting jails, and the mail jails ( imap + smtp, clamav + spamassassin ). The mail jails are new (since August 2008) but worked great since the beginning of this year. The server is hosting 122 accounts currently.
My first thought was that there's an I/O issue, the disks are maybe too busy or there's paging that result in a timeout. I check it through vmstat an top commands but nothing appears, there's always some memory free (between 90-300 MB) and a very little paging, generally around 1MB. The fault are under a hundred and when, few times, it has more than a hundred (generally under 200) the next snapshot is under 100. I set the screen/line to refresh every 5 seconds. I shutdowned all jails not directly related to the mail service but the problem still occured. I also separated clamav and spamassassin from imap and smtp to a different box. After that I checked the dovecot config to lighten it, and ( only ) disabled fsync. I upgraded the RAM, added memory is used, but nothing changed. The human resources are constantly changing here, we were more before the problem started than now (around 10-15%), to illustrate the fact that I don't think it is linked to the number of users. I tried to recreate some accounts having the issue, but the problem appeared again. I upgraded Dovecot Friday to 1.1.11 from 1.1.7 (The installation was before 1.1.7, I did an upgrade some times ago).
I used a command script to log Thunderbird's IMAP activity, everything is fine but there's no timestamp in the logs. So I'm only sure now this is not an erroneous packet/info sent problem. I watched the TB doc ( http://www.mozilla.org/projects/nspr/reference/html/prlog.html#25306 ) but there's no directive to put a timestamp for each line of log.
The network is fine. I tried different configuration to be sure a device isn't doing something weird.
Now, I don't know what to check to identify the issue. If anyone has any idea I didn't wrote here, or if I did erroneous interpretation(s) from the datas, I'll be glad to know.
Regards,
-- Bastien Semene Administrateur Réseau & Système
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
On Mon, 9 Mar 2009, Bastien Semene wrote:
Hello,
what's your setting for login_max_processes_count and assoc. settings? They also do apply to SSL connections, see: http://wiki.dovecot.org/LoginProcess?highlight=%28login_max_processes_count%...
I encounter such problem in v1.0x, when the count of connections reach the limit. I made a log displaying the number of imap processes owned by non-root users and the number of used file descriptors of the dovecot process. When I get hit by the connection refusal problem, the number of file descriptors does not lower over night, when the user count decreases.
Bye,
Steffen Kaiser -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (GNU/Linux)
iQEVAwUBSbUAz3WSIuGy1ktrAQJWYAf+PkPDz537HFGU3XBOSpM7OVmnVC/CBnc7 Fcz3S49EDaleqTq+RJTMqleXBYPk3b6AgEY5trldvf2W/nAH9aZNXgh+hKKZ5a6h xnvJlGwEQ2jCiVAXAbLHdeFxu+ayFYOotB50V35FoDOsDjbqINL8rByh6BPFVLlE k7A9VlqiXA9WP7VdnNohQZrcAOfcauEzAbl34mD+K5vdF3H5WVANZNFkB6JsYrvo wUmybZk/LcpXP5oPsysqfKpY38SAeO8D7HfO7DtkyxAe+Mau2vOqya/Aq8RG6dDF rvZ4mIO0fSTaHxKKTzn9MtFJwgPa1jn+xw4HE9r6IXci0Gn54/TpQw== =sxr6 -----END PGP SIGNATURE-----
Hello,
I tracked the processes number and it was very close to the limit, I read your link and think it is the solution to my problem. I modified the config to fit our needs. No problem yesterday and today, it is rare but happened so I can't be affirmative, therefore the number of processes reaches the previous limit today. I'll give some feedback to the list if the problem is still here.
Thank you Andreas for the idea but TB is installed locally and I think Steffen's answer is the good one. It fits perfectly with the symptoms.
Steffen Kaiser a écrit :
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
On Mon, 9 Mar 2009, Bastien Semene wrote:
Hello,
what's your setting for login_max_processes_count and assoc. settings? They also do apply to SSL connections, see: http://wiki.dovecot.org/LoginProcess?highlight=%28login_max_processes_count%...
I encounter such problem in v1.0x, when the count of connections reach the limit. I made a log displaying the number of imap processes owned by non-root users and the number of used file descriptors of the dovecot process. When I get hit by the connection refusal problem, the number of file descriptors does not lower over night, when the user count decreases.
Bye,
- -- Steffen Kaiser -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (GNU/Linux)
iQEVAwUBSbUAz3WSIuGy1ktrAQJWYAf+PkPDz537HFGU3XBOSpM7OVmnVC/CBnc7 Fcz3S49EDaleqTq+RJTMqleXBYPk3b6AgEY5trldvf2W/nAH9aZNXgh+hKKZ5a6h xnvJlGwEQ2jCiVAXAbLHdeFxu+ayFYOotB50V35FoDOsDjbqINL8rByh6BPFVLlE k7A9VlqiXA9WP7VdnNohQZrcAOfcauEzAbl34mD+K5vdF3H5WVANZNFkB6JsYrvo wUmybZk/LcpXP5oPsysqfKpY38SAeO8D7HfO7DtkyxAe+Mau2vOqya/Aq8RG6dDF rvZ4mIO0fSTaHxKKTzn9MtFJwgPa1jn+xw4HE9r6IXci0Gn54/TpQw== =sxr6 -----END PGP SIGNATURE-----
-- Bastien Semene Administrateur Réseau & Système
participants (2)
-
Bastien Semene
-
Steffen Kaiser