Re: [Dovecot] Time moved backwards
Hi people,
From: Adam McDougall mcdouga9@egr.msu.edu I would just like to mention a circumstance that happened to me this Sunday. We had a total power outage in our building, longer than our UPS's could last and we don't have a generator for servers (nor is it economical or needed). When the power came back on, my local NTP server came on at the same time as my mail servers, as well a majority of my other servers. My servers tried to step their time to be in sync with my local NTP server, which was still busy trying to sync itself with outside sources, which takes a while, so my mail servers did not get an answer. Later, dovecot died because the time finally synced, and I found out why pretty quick (have seen this before) but this was an unusual situation.
My point is, we had an unusual circumstance, and even though I've taken steps to have my mail servers sync their time at boot and run ntpd afterwards, there are some circumstances in which this is not enough, and dovecot still died. Its not always because someone was lazy about their time setup.
My point exactly. It's amazing how some people are quick to ramble about someone else's administrative incompetence without taking time to read the situation. (One person even suggested hacking the dovecot startup script to run ntpdate -- useless as ntpd already occupies the ports).
Fact is, ntpd can take unpredictable delay before the initial time-step. Delay that can't be controlled, and it would be unreasonable to delay starting mail services until it is guaranteed to complete. Then, dovecot dies, and admin (who is not always immediately available) has to start it manually anyway (especially as it is not clear what to do with possibly unsynced timestamps) -- only after the unnecessary downtime. So, the question is: why on earth can't we add a single line of code to dovecot to restart itself after terminating?
Kind of reminds me of the "fsck_y_enable=YES" option in rc.conf. Without it, if fsck does not like someting during reboot, the server would just sit there in single-user prompt, waiting for (expensive) console operations.
Best wishes Eugene
At 10:20 PM +0400 5/14/08, Eugene wrote:
Hi people,
From: Adam McDougall mcdouga9@egr.msu.edu I would just like to mention a circumstance that happened to me this Sunday. We had a total power outage in our building, longer than our UPS's could last and we don't have a generator for servers (nor is it economical or needed). When the power came back on, my local NTP server came on at the same time as my mail servers, as well a majority of my other servers. My servers tried to step their time to be in sync with my local NTP server, which was still busy trying to sync itself with outside sources, which takes a while, so my mail servers did not get an answer. Later, dovecot died because the time finally synced, and I found out why pretty quick (have seen this before) but this was an unusual situation.
My point is, we had an unusual circumstance, and even though I've taken steps to have my mail servers sync their time at boot and run ntpd afterwards, there are some circumstances in which this is not enough, and dovecot still died. Its not always because someone was lazy about their time setup.
My point exactly. It's amazing how some people are quick to ramble about someone else's administrative incompetence without taking time to read the situation.
I most certainly did read your description of the situation, and my use of the phrase "administrative incompetence" should not be taken personally. I did not say (or mean) "administrator incompetence" and would not try to make that sort of judgment at a distance.
(One person even suggested hacking the dovecot startup script to run ntpdate -- useless as ntpd already occupies the ports).
That's one of the things that "ntpdate -u" is good for.
Fact is, ntpd can take unpredictable delay before the initial time-step. Delay that can't be controlled, and it would be unreasonable to delay starting mail services until it is guaranteed to complete. Then, dovecot dies, and admin (who is not always immediately available) has to start it manually anyway (especially as it is not clear what to do with possibly unsynced timestamps) -- only after the unnecessary downtime.
Or you can have an external watchdog that re-launches Dovecot if it dies. This approach handles a broader set of failure modes and on some OS's is a built-in feature of the startup subsystem.
Because of the fact that Dovecot may be running in an environment with an external watchdog, perhaps one like launchd or classical SysV/Solaris init that can catch the exit of the process it spawned and use it to trigger an immediate respawn. This means that adding an internal respawn inside Dovecot that will not cause breakage on any system is not as simple as it may seem.
So, the question is: why on earth can't we add a single line of code to dovecot to restart itself after terminating?
You can do just that yourself if you believe that it is the best option for your circumstances and adequate to handle the problem you are having. One line of code might well do the trick you want on your system. If Timo puts the functionality in the code he distributes, it will need to be a great deal more than one line of code.
Kind of reminds me of the "fsck_y_enable=YES" option in rc.conf. Without it, if fsck does not like someting during reboot, the server would just sit there in single-user prompt, waiting for (expensive) console operations.
Which is actually the right choice in some circumstances.
--
Bill Cole
bill@scconsult.com
Bill Cole wrote:
At 10:20 PM +0400 5/14/08, Eugene wrote:
Hi people,
From: Adam McDougall mcdouga9@egr.msu.edu I would just like to mention a circumstance that happened to me this Sunday. We had a total power outage in our building, longer than our UPS's could last and we don't have a generator for servers (nor is it economical or needed). When the power came back on, my local NTP server came on at the same time as my mail servers, as well a majority of my other servers. My servers tried to step their time to be in sync with my local NTP server, which was still busy trying to sync itself with outside sources, which takes a while, so my mail servers did not get an answer. Later, dovecot died because the time finally synced, and I found out why pretty quick (have seen this before) but this was an unusual situation.
My point is, we had an unusual circumstance, and even though I've taken steps to have my mail servers sync their time at boot and run ntpd afterwards, there are some circumstances in which this is not enough, and dovecot still died. Its not always because someone was lazy about their time setup.
My point exactly. It's amazing how some people are quick to ramble about someone else's administrative incompetence without taking time to read the situation.
I most certainly did read your description of the situation, and my use of the phrase "administrative incompetence" should not be taken personally. I did not say (or mean) "administrator incompetence" and would not try to make that sort of judgment at a distance.
(One person even suggested hacking the dovecot startup script to run ntpdate -- useless as ntpd already occupies the ports).
That's one of the things that "ntpdate -u" is good for.
Fact is, ntpd can take unpredictable delay before the initial time-step. Delay that can't be controlled, and it would be unreasonable to delay starting mail services until it is guaranteed to complete. Then, dovecot dies, and admin (who is not always immediately available) has to start it manually anyway (especially as it is not clear what to do with possibly unsynced timestamps) -- only after the unnecessary downtime.
Or you can have an external watchdog that re-launches Dovecot if it dies. This approach handles a broader set of failure modes and on some OS's is a built-in feature of the startup subsystem.
Because of the fact that Dovecot may be running in an environment with an external watchdog, perhaps one like launchd or classical SysV/Solaris init that can catch the exit of the process it spawned and use it to trigger an immediate respawn. This means that adding an internal respawn inside Dovecot that will not cause breakage on any system is not as simple as it may seem.
So, the question is: why on earth can't we add a single line of code to dovecot to restart itself after terminating?
You can do just that yourself if you believe that it is the best option for your circumstances and adequate to handle the problem you are having. One line of code might well do the trick you want on your system. If Timo puts the functionality in the code he distributes, it will need to be a great deal more than one line of code.
Problem I see is that an external script that *unconditionally* relaunches dovecot could be a terribly problem. It's better for dovecot to do it itself in this particular failure, because it's the only one who knows that it was just a date issue, and relaunching is safe.
Problem I see is that an external script that *unconditionally* relaunches dovecot could be a terribly problem. It's better for dovecot to do it itself in this particular failure, because it's the only one who knows that it was just a date issue, and relaunching is safe.
But as Timo has explained, simply relaunching might *not* be safe.
johannes
On May 15, 2008, at 5:12 PM, Neal Becker wrote:
Problem I see is that an external script that *unconditionally*
relaunches dovecot could be a terribly problem. It's better for dovecot to do it itself in this particular failure, because it's the only one who
knows that it was just a date issue, and relaunching is safe.
If someone wants to code this relaunching, feel free to do it and if
the code looks good I'll include it.
I'll maybe try fixing this some other way for v1.2 (verify all
timestamp comparison code, make timeout handling work if clock moves,
log only a warning. I'm not sure if I'm going to use a monotonic
clock, since I'd like to know when time moves backwards or too much
forwards and add some hooks there to do things like update dotlock
file mtimes).
At 10:12 AM -0400 5/15/08, Neal Becker wrote:
Problem I see is that an external script that *unconditionally* relaunches dovecot could be a terribly problem. It's better for dovecot to do it itself in this particular failure, because it's the only one who knows that it was just a date issue, and relaunching is safe.
That certainly does not need to be the case. Dovecot does log the reason in a trivially parsed manner, so a purpose-built watchdog could rather easily detect this particular failure mode. One truly simple change that could be made that would facilitate restarting under this special situation would be to have a specific exit value for Dovecot self-destructing in a time reversal, so a model where a parent process (e.g. launchd) is playing the watchdog role could use the exit value to decide whether to relaunch. That would be less likely to run into conflict with existing practice than internal logic terminating the existing processes and relaunching.
On the other hand, a more subtle handling of this issue internally without terminating at all is probably the best approach, since only Dovecot itself can really know whether an immediate relaunch after a time reversal is really safe or how to make it so.
For the specific problem of "infant mortality" at boot time that initiated this thread, the best approach is still prevention. Dovecot is far from the only daemon that will run into trouble if time jumps backwards, and there are widely used approaches (such as blocking the startup procedure on a successful ntpdate and using sound hardware whose clock doesn't drift too much in the first place) that minimize the risk of time reversal after sensitive daemons have started. If the problem of time stepping backwards after boot is really *common* then it may well be a dangerous cosmetic approach to just make Dovecot auto-recover (internally or externally) because it happens to be the only daemon that watches for and reacts to such an event. It is impossible to prevent every backwards time step, but preventing the predictable cases system-wide is a sounder approach than making one daemon adapt to what should be a very rare event.
-- Bill Cole bill@scconsult.com
Here's another thought:
From man ntpd:
If the -x option is included on the command line, the clock will never be stepped and only slew corrections will be used.
The issues should be carefully explored before deciding to use
the -x option. The maximum slew rate possible is limited to 500 parts-per-million (PPM) as a consequence of the correct- ness principles on which the NTP protocol and algorithm design are based. As a result, the local clock can take a long time to converge to an acceptable offset, about 2,000 s for each second the clock is outside the acceptable range. During this interval the local clock will not be consistent with any other network clock and the system cannot be used for distributed applications that require correctly synchronized network time.
participants (5)
-
Bill Cole
-
Eugene
-
Johannes Berg
-
Neal Becker
-
Timo Sirainen