Hello,
My dovecot server is currently having random issues when users want to acces their mailbox. When the issue is occuring there's a timeout. It's random because it happens more or less often and is not dependent on a user or the way their check the mailbox. The time during the problem occurs is random, sometimes it's less than a minute, sometimes it's some hours. When a user can't connect to its mailbox, he generally can receive its emails through his BlackBerry (it's just to illustrate the fact that it doesn't seem to be linked to an account). When this issue occurs they can't receive the headers from any folder (INBOX, custom, sent, ...) and can't read mails that headers are known by the mail client. When there's timeout, users can send emails (same jail and with an error when the client want to write a copy in the sentbox) and use the calendar server for example (on the same machine).
Dovecot is running on a FreeBSD 7.0 32bits, 4GB RAM, Intel Xeon QuadCore @ 1.86 Ghz, and 3*500Gb SATA-2 RAID-5 disks. The box is hosting jails, and the mail jails ( imap + smtp, clamav + spamassassin ). The mail jails are new (since August 2008) but worked great since the beginning of this year. The server is hosting 122 accounts currently.
My first thought was that there's an I/O issue, the disks are maybe too busy or there's paging that result in a timeout. I check it through vmstat an top commands but nothing appears, there's always some memory free (between 90-300 MB) and a very little paging, generally around 1MB. The fault are under a hundred and when, few times, it has more than a hundred (generally under 200) the next snapshot is under 100. I set the screen/line to refresh every 5 seconds. I shutdowned all jails not directly related to the mail service but the problem still occured. I also separated clamav and spamassassin from imap and smtp to a different box. After that I checked the dovecot config to lighten it, and ( only ) disabled fsync. I upgraded the RAM, added memory is used, but nothing changed. The human resources are constantly changing here, we were more before the problem started than now (around 10-15%), to illustrate the fact that I don't think it is linked to the number of users. I tried to recreate some accounts having the issue, but the problem appeared again. I upgraded Dovecot Friday to 1.1.11 from 1.1.7 (The installation was before 1.1.7, I did an upgrade some times ago).
I used a command script to log Thunderbird's IMAP activity, everything is fine but there's no timestamp in the logs. So I'm only sure now this is not an erroneous packet/info sent problem. I watched the TB doc ( http://www.mozilla.org/projects/nspr/reference/html/prlog.html#25306 ) but there's no directive to put a timestamp for each line of log.
The network is fine. I tried different configuration to be sure a device isn't doing something weird.
Now, I don't know what to check to identify the issue. If anyone has any idea I didn't wrote here, or if I did erroneous interpretation(s) from the datas, I'll be glad to know.
Regards,
-- Bastien Semene Administrateur Réseau & Système