[Dovecot] Server power loss and "Dovecot is already running with PID xxx"
Hi,
I'm running Dovecot 1.0.7 (with various patches) on CentOS 5.2.
The server has suffered a couple of power loss events. Dovecot is run as a standalone server.
The problem is that dovecot refuses to start up at boot because the PID file from before the power loss is left behind. The message is as follows:
$ /sbin/service dovecot start Starting Dovecot Imap: Error: Dovecot is already running with PID 10825 (read from /var/run/dovecot/master.pid) Fatal: Invalid configuration in /etc/dovecot.conf [FAILED] (Note: there is nothing wrong in the configuration file so the error message is somewhat misleading.)
I looked at the release notes of 1.0.xx releases and they didn't mention this.
Is this already a known problem? Should the start-up logic be made more robust (e.g. check whether a process corresponding to the PID actually exists)?
-- Pekka Savola "You each name yourselves king, yet the Netcore Oy kingdom bleeds." Systems. Networks. Security. -- George R.R. Martin: A Clash of Kings
On Tue, 2008-07-01 at 00:14 +0300, Pekka Savola wrote:
$ /sbin/service dovecot start Starting Dovecot Imap: Error: Dovecot is already running with PID 10825 (read from /var/run/dovecot/master.pid) Fatal: Invalid configuration in /etc/dovecot.conf [FAILED] (Note: there is nothing wrong in the configuration file so the error message is somewhat misleading.)
Yes, it's a bit misleading. But I don't think I'll bother fixing it before rewriting the master/config handling for v2.0.
Is this already a known problem? Should the start-up logic be made more robust (e.g. check whether a process corresponding to the PID actually exists)?
It already checks if the PID exists, but it doesn't check what that process is (and I don't think there is a portable way to do it anyway). I don't think it's too much to ask to delete the master.pid if in rare situations it fails to start due to a PID conflict.
On Jul 1, 2008, at 12:51 AM, Timo Sirainen wrote:
Is this already a known problem? Should the start-up logic be made more robust (e.g. check whether a process corresponding to the PID actually exists)?
It already checks if the PID exists, but it doesn't check what that process is (and I don't think there is a portable way to do it
anyway). I don't think it's too much to ask to delete the master.pid if in rare situations it fails to start due to a PID conflict.
This is a pet peeve of mine for many services started at boot time.
Since the ordering of service startup is usually fairly static, a
*LOT* of times process IDs are nearly identical on boot. Depending on
which way they go, if they drift towards earlier, you'll have the PID
in use. This drove me NUTS with Sun's LDAP server.
Many recent OSes are now using memory-based filesystems for /var/run,
or otherwise clear out /var/run at boot time. But if a process stores
its PID somewhere else, you're SOL (much like Sun One Directory Server
does).
The problem with having to remove a master.pid file on boot is that
you might have a BUNCH of clients or customers that are using your
system, and you're either asleep at 3am when the server kicked over,
or in another state. It's not a problem if you have staff watching
machines reboot. ;-)
Sorry, had to kibitz.
Sean
PS I often times add a 'rm $PID' line in the init.d script, and let a
server die because it couldn't bind to the port. That doesn't work
with everything, though.
On Tue, 1 Jul 2008, Timo Sirainen wrote:
$ /sbin/service dovecot start Starting Dovecot Imap: Error: Dovecot is already running with PID 10825 (read from /var/run/dovecot/master.pid) Fatal: Invalid configuration in /etc/dovecot.conf [FAILED] (Note: there is nothing wrong in the configuration file so the error message is somewhat misleading.)
Yes, it's a bit misleading. But I don't think I'll bother fixing it before rewriting the master/config handling for v2.0.
Is this already a known problem? Should the start-up logic be made more robust (e.g. check whether a process corresponding to the PID actually exists)?
It already checks if the PID exists, but it doesn't check what that process is (and I don't think there is a portable way to do it anyway). I don't think it's too much to ask to delete the master.pid if in rare situations it fails to start due to a PID conflict.
Getting back to this after another power loss.
It doesn't seem to be that the current logic is working; there is no program with the PID that's in master.pid, and dovecot (1.0.7 + RHEL patches) refuses to start.
root: /root$ /sbin/service dovecot start Starting Dovecot Imap: Error: Dovecot is already running with PID 2746 (read from /var/run/dovecot/master.pid) Fatal: Invalid configuration in /etc/dovecot.conf [FAILED] root: /root$ more /var/run/dovecot/master.pid 2746 root: /root$ ps auxw | grep 2746 root 31714 0.0 0.1 4116 584 pts/1 R+ 20:19 0:00 grep 2746
-- Pekka Savola "You each name yourselves king, yet the Netcore Oy kingdom bleeds." Systems. Networks. Security. -- George R.R. Martin: A Clash of Kings
On Aug 2, 2008, at 8:22 PM, Pekka Savola wrote:
It doesn't seem to be that the current logic is working; there is no
program with the PID that's in master.pid, and dovecot (1.0.7 + RHEL
patches) refuses to start.root: /root$ /sbin/service dovecot start Starting Dovecot Imap: Error: Dovecot is already running with PID
2746 (read from /var/run/dovecot/master.pid) Fatal: Invalid configuration in /etc/dovecot.conf [FAILED] root: /root$ more /var/run/dovecot/master.pid 2746 root: /root$ ps auxw | grep 2746 root 31714 0.0 0.1 4116 584 pts/1 R+ 20:19 0:00
grep 2746
SELinux perhaps? It checks this by kill()ing the process and seeing if
it returns ESRCH. If not, it assumes the process exists. If you've
SELinux perhaps it always return EPERM to the call..
On Mon, 4 Aug 2008, Timo Sirainen wrote:
It doesn't seem to be that the current logic is working; there is no program with the PID that's in master.pid, and dovecot (1.0.7 + RHEL patches) refuses to start.
root: /root$ /sbin/service dovecot start Starting Dovecot Imap: Error: Dovecot is already running with PID 2746 (read from /var/run/dovecot/master.pid) Fatal: Invalid configuration in /etc/dovecot.conf [FAILED] root: /root$ more /var/run/dovecot/master.pid 2746 root: /root$ ps auxw | grep 2746 root 31714 0.0 0.1 4116 584 pts/1 R+ 20:19 0:00 grep 2746
SELinux perhaps? It checks this by kill()ing the process and seeing if it returns ESRCH. If not, it assumes the process exists. If you've SELinux perhaps it always return EPERM to the call..
'getenforce' says disabled, so no. This is pretty strange -- I looked at the code and basically duplicated the logic there and could not reproduce this problem with a smaller piece of code. And it doesn't seem to appear always in any case -- I killed dovecot with KILL signal (leaving the PID file behind), and after that it started up without problems. Unless you have other ideas what to look for, I guess this will remain a mystery..
-- Pekka Savola "You each name yourselves king, yet the Netcore Oy kingdom bleeds." Systems. Networks. Security. -- George R.R. Martin: A Clash of Kings
Pekka Savola píše v Po 04. 08. 2008 v 12:40 +0300:
On Mon, 4 Aug 2008, Timo Sirainen wrote:
It doesn't seem to be that the current logic is working; there is no program with the PID that's in master.pid, and dovecot (1.0.7 + RHEL patches) refuses to start.
root: /root$ /sbin/service dovecot start Starting Dovecot Imap: Error: Dovecot is already running with PID 2746 (read from /var/run/dovecot/master.pid) Fatal: Invalid configuration in /etc/dovecot.conf [FAILED] root: /root$ more /var/run/dovecot/master.pid 2746 root: /root$ ps auxw | grep 2746 root 31714 0.0 0.1 4116 584 pts/1 R+ 20:19 0:00 grep 2746
SELinux perhaps? It checks this by kill()ing the process and seeing if it returns ESRCH. If not, it assumes the process exists. If you've SELinux perhaps it always return EPERM to the call..
'getenforce' says disabled, so no. This is pretty strange -- I looked at the code and basically duplicated the logic there and could not reproduce this problem with a smaller piece of code. And it doesn't seem to appear always in any case -- I killed dovecot with KILL signal (leaving the PID file behind), and after that it started up without problems. Unless you have other ideas what to look for, I guess this will remain a mystery..
There is a not-so-prefect init script installed for dovecot in RHEL, try using the one from Fedora (http://cvs.fedoraproject.org/viewcvs/rpms/dovecot/devel/dovecot.init?rev=1.6&view=auto). A new init script will be added in RHEL 5.3.
Dan
-- Fedora and Red Hat package maintainer
participants (4)
-
Dan Horák
-
Pekka Savola
-
Sean Kamath
-
Timo Sirainen