Understanding why Dovecot unexpectedly died
Hi list!
I use Dovecot 1.2.17 (I can't upgrade right now, due to many reasons), controlled by Pacemaker (I have an HA-Cluster). Now I see that Pacemaker restarts often Dovecot. I wrote my own script to manage Dovecot, since Pacemaker does not have his own.
My script, by the "monitor" section has this:
monitor)
if [ ! -e $OCF_RESKEY_pid ]; then
echo "stopped (no pidfile)"
echo "DOVECOT STOPPED - NO PIDFILE" | /usr/bin/logger -p local0.info -t DOVECOT-MONITOR -i
exit $OCF_NOT_RUNNING
else
/bin/ps axuwf | /bin/grep /bin/cat $OCF_RESKEY_pid
| /bin/grep -v grep > /dev/null 2>&1
if [ $? -ne 0 ]; then
echo "stopped"
echo "DOVECOT STOPPED - NO PROCESS" | /usr/bin/logger -p local0.info -t DOVECOT-MONITOR -i
exit $OCF_NOT_RUNNING
else
if [ "/bin/netstat -tupan | /bin/grep dovecot | /bin/grep $OCF_RESKEY_bindaddr | /usr/bin/wc -l
" -ne 0 ]; then
exit $OCF_SUCCESS
else
echo "DOVECOT STOPPED - NO LISTEN [/bin/netstat -tupan | /bin/grep dovecot
]" | /usr/bin/logger -p local0.info -t DOVECOT-MONITOR -i
exit $OCF_ERR_GENERIC
fi
fi
fi
exit $OCF_SUCCESS
;;
The "loggers" was added now to try to understand why it dies... Well, I can see in my syslog, when Pacemaker restarts Dovecot, these lines:
ov 15 18:59:09 mail01 DOVECOT-MONITOR[530]: DOVECOT STOPPED - NO LISTEN [tcp 0 0 192.168.33.1:37545 192.168.33.3:3306 ESTABLISHED 637/dovecot-auth Nov 15 18:59:09 mail01 DOVECOT-MONITOR[530]: tcp 0 0 192.168.33.1:37537 192.168.33.3:3306 ESTABLISHED 529/dovecot-auth]
So, there is no "dovecot"-Process listening anymore... Normally I have these:
tcp 0 0 0.0.0.0:110 0.0.0.0:* LISTEN 634/dovecot tcp 0 0 0.0.0.0:143 0.0.0.0:* LISTEN 634/dovecot tcp 0 0 0.0.0.0:993 0.0.0.0:* LISTEN 634/dovecot tcp 0 0 0.0.0.0:995 0.0.0.0:* LISTEN 634/dovecot tcp 0 0 192.168.33.1:40994 192.168.33.3:3306 VERBUNDEN 891/dovecot-auth tcp 0 0 192.168.33.1:40984 192.168.33.3:3306 VERBUNDEN 638/dovecot-auth tcp6 0 0 :::110 :::* LISTEN 634/dovecot tcp6 0 0 :::143 :::* LISTEN 634/dovecot tcp6 0 0 :::993 :::* LISTEN 634/dovecot tcp6 0 0 :::995 :::* LISTEN 634/dovecot
In the mail.log and mail.err I can't see anything but:
Nov 15 18:59:13 mail01 dovecot: Dovecot v1.2.17 starting up Nov 15 18:59:13 mail01 dovecot: auth-worker(default): mysql: Connected to 192.168.33.3 (exim)
And in the syslos there is nothing about Dovecot...
Any idea?
Thanks a lot! Luca Bertoncello (lucabert@lucabert.de)
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
On Sat, 15 Nov 2014, Luca Bertoncello wrote:
I use Dovecot 1.2.17 (I can't upgrade right now, due to many reasons), controlled by Pacemaker (I have an HA-Cluster). Now I see that Pacemaker restarts often Dovecot. I wrote my own script to
Please define "often". If it is rather very often, try start dovecot with a script an catch its output, e.g.:
#!/bin/bash
logf=/tmp/dovecot.start.log
( /../sbin/dovecot -F rc=$? echo $(date) rc=$rc exit $rc ) >>"$logf" 2>&1
manage Dovecot, since Pacemaker does not have his own.
My script, by the "monitor" section has this:
monitor) if [ ! -e $OCF_RESKEY_pid ]; then echo "stopped (no pidfile)" echo "DOVECOT STOPPED - NO PIDFILE" | /usr/bin/logger -p local0.info -t DOVECOT-MONITOR -i exit $OCF_NOT_RUNNING else /bin/ps axuwf | /bin/grep
/bin/cat $OCF_RESKEY_pid
| /bin/grep -v grep > /dev/null 2>&1
this is vague and catches many false positives if the pid is low, don't your system accepts:
if ! ps /bin/cat $OCF_RESKEY_pid
>/dev/null 2>&1; then
to query one particular process id?
if [ $? -ne 0 ]; then echo "stopped"
echo "DOVECOT STOPPED - NO PROCESS" | /usr/bin/logger -p local0.info -t DOVECOT-MONITOR -i exit $OCF_NOT_RUNNING else
How about to log:
lsof -p /bin/cat $OCF_RESKEY_pid
lsof -c dovecot
netstat -tupan
into a temporary file, say /tmp/dovecot.monitor.log
if [ "`/bin/netstat -tupan | /bin/grep dovecot | /bin/grep $OCF_RESKEY_bindaddr | /usr/bin/wc -l`" -ne 0 ]; then exit $OCF_SUCCESS else
echo "DOVECOT STOPPED - NO LISTEN [
/bin/netstat -tupan | /bin/grep dovecot
]" | /usr/bin/logger -p local0.info -t DOVECOT-MONITOR -i exit $OCF_ERR_GENERIC fi fi fi exit $OCF_SUCCESS ;;The "loggers" was added now to try to understand why it dies... Well, I can see in my syslog, when Pacemaker restarts Dovecot, these lines:
ov 15 18:59:09 mail01 DOVECOT-MONITOR[530]: DOVECOT STOPPED - NO LISTEN [tcp 0 0 192.168.33.1:37545 192.168.33.3:3306 ESTABLISHED 637/dovecot-auth Nov 15 18:59:09 mail01 DOVECOT-MONITOR[530]: tcp 0 0 192.168.33.1:37537 192.168.33.3:3306 ESTABLISHED 529/dovecot-auth]
So, there is no "dovecot"-Process listening anymore... Normally I have these:
tcp 0 0 0.0.0.0:110 0.0.0.0:* LISTEN 634/dovecot tcp 0 0 0.0.0.0:143 0.0.0.0:* LISTEN 634/dovecot tcp 0 0 0.0.0.0:993 0.0.0.0:* LISTEN 634/dovecot tcp 0 0 0.0.0.0:995 0.0.0.0:* LISTEN 634/dovecot tcp 0 0 192.168.33.1:40994 192.168.33.3:3306 VERBUNDEN 891/dovecot-auth tcp 0 0 192.168.33.1:40984 192.168.33.3:3306 VERBUNDEN 638/dovecot-auth tcp6 0 0 :::110 :::* LISTEN 634/dovecot tcp6 0 0 :::143 :::* LISTEN 634/dovecot tcp6 0 0 :::993 :::* LISTEN 634/dovecot tcp6 0 0 :::995 :::* LISTEN 634/dovecot
In the mail.log and mail.err I can't see anything but:
Nov 15 18:59:13 mail01 dovecot: Dovecot v1.2.17 starting up Nov 15 18:59:13 mail01 dovecot: auth-worker(default): mysql: Connected to 192.168.33.3 (exim)
And in the syslos there is nothing about Dovecot...
Any idea?
Thanks a lot! Luca Bertoncello (lucabert@lucabert.de)
Steffen Kaiser -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux)
iQEVAwUBVGsF/3z1H7kL/d9rAQLpJwf/TkKJ6pLDGH434gTuZ6kyvUfDbuuONNHm NJpLktdHjsTMj6DU5hmygWnVJfa2aJseT6FGn3GQCyIVHoQQIF5YmBo6UPyYjW9U JEjDortE20LobEEhUOHegBuIu05pfyHQbjdcRM2OXh99G4o3BtDiHqAnPskFyY2X VMEwH3j9a00EgTDeh37NECgI4iITCt2WYZAGcOweCTiEj+8ll4Og/bAA0Q3Lk+aP A0i4DnGzyPPayvKEzLmtfgJ0J6mKXNyD+14VPRcaGj4y+KrMc628JVAXpmyvO7N1 9J9drp5qUdeuyMXWQejI4rkvP0ZsuUKaMPJ94uJ2vCBtviLJJ8uoIA== =tBd9 -----END PGP SIGNATURE-----
participants (2)
-
Luca Bertoncello
-
Steffen Kaiser