On 11/04/2011 08:43 PM, Timo Sirainen wrote:
On Sat, 2011-10-22 at 21:21 +0200, Gordon Grubert wrote:
Hello,
our dovecot server crashes totally without any really useful log messages. The error log can be found in the attachment. The only way to get dovecot running again is a complete system restart.
How often does it break? If really a "complete system restart" is needed to fix it, it doesn't sound like a Dovecot problem. Check if it's enough to stop dovecot and then make sure there aren't any dovecot processes lying around afterwards. Currently, the problem occurred three times. The last time some days ago. The last "crash" was in the night and, therefore, we used the chance for a detailed debugging of the system.
You could be right, that it's not a dovecot problem. Next to dovecot, we found other processes hanging and could not be killed by "kill -9". Additionally, we found a commonness of all of these processes: They hanged while trying to access the mailbox volume. Therefore, we repaired the filesystem. Now, we're watching the system ...
Oct 11 09:55:23 mailserver2 dovecot: master: Error: service(imap): Initial status notification not received in 30 seconds, killing the process Oct 11 09:56:23 mailserver2 dovecot: imap-login: Error: master(imap): Auth request timed out (received 0/12 bytes)
Kind of looks like auth process is hanging. You could see if stracing it shows anything useful. Also are any errors logged about LDAP? Is LDAP running on the same server? Dovecot authenticates against postfix and postfix has an LDAP connection. The LDAP is running on an external cluster. Here, no errors are reported.
We hope, that the filesystem error was the reason for the problem and, that the problem is fixed by repairing it.
Best regards, Gordon