Bill Cole wrote:
At 10:20 PM +0400 5/14/08, Eugene wrote:
Hi people,
From: Adam McDougall <mcdouga9@egr.msu.edu> I would just like to mention a circumstance that happened to me this Sunday. We had a total power outage in our building, longer than our UPS's could last and we don't have a generator for servers (nor is it economical or needed). When the power came back on, my local NTP server came on at the same time as my mail servers, as well a majority of my other servers. My servers tried to step their time to be in sync with my local NTP server, which was still busy trying to sync itself with outside sources, which takes a while, so my mail servers did not get an answer. Later, dovecot died because the time finally synced, and I found out why pretty quick (have seen this before) but this was an unusual situation.
My point is, we had an unusual circumstance, and even though I've taken steps to have my mail servers sync their time at boot and run ntpd afterwards, there are some circumstances in which this is not enough, and dovecot still died. Its not always because someone was lazy about their time setup.
My point exactly. It's amazing how some people are quick to ramble about someone else's administrative incompetence without taking time to read the situation.
I most certainly did read your description of the situation, and my use of the phrase "administrative incompetence" should not be taken personally. I did not say (or mean) "administrator incompetence" and would not try to make that sort of judgment at a distance.
(One person even suggested hacking the dovecot startup script to run ntpdate -- useless as ntpd already occupies the ports).
That's one of the things that "ntpdate -u" is good for.
Fact is, ntpd can take unpredictable delay before the initial time-step. Delay that can't be controlled, and it would be unreasonable to delay starting mail services until it is guaranteed to complete. Then, dovecot dies, and admin (who is not always immediately available) has to start it manually anyway (especially as it is not clear what to do with possibly unsynced timestamps) -- only after the unnecessary downtime.
Or you can have an external watchdog that re-launches Dovecot if it dies. This approach handles a broader set of failure modes and on some OS's is a built-in feature of the startup subsystem.
Because of the fact that Dovecot may be running in an environment with an external watchdog, perhaps one like launchd or classical SysV/Solaris init that can catch the exit of the process it spawned and use it to trigger an immediate respawn. This means that adding an internal respawn inside Dovecot that will not cause breakage on any system is not as simple as it may seem.
So, the question is: why on earth can't we add a single line of code to dovecot to restart itself after terminating?
You can do just that yourself if you believe that it is the best option for your circumstances and adequate to handle the problem you are having. One line of code might well do the trick you want on your system. If Timo puts the functionality in the code he distributes, it will need to be a great deal more than one line of code.
Problem I see is that an external script that *unconditionally* relaunches dovecot could be a terribly problem. It's better for dovecot to do it itself in this particular failure, because it's the only one who knows that it was just a date issue, and relaunching is safe.