[Dovecot] Time moved backwards

Eugene

13 May 2008 13 May '08

10:13 a.m.

Hello,

I would like to suggest a change in handling of 'Time moved backwards' problem. Right now dovecot just dies. So, the scenario:

Colocation server is shut down for some reason. The internal time drifts.
Server is started again.
Dovecot starts successfully.
In about a minute, NTP daemon feels confident about adjusting the system time.
Dovecot sees the changed time and dies.
Admin has to notice that, login and restart Dovecot manually.

I suggest that Dovecot simply terminate the current connections (causing the client to reconnect) or -- if the time change is really that much of a problem -- to restart itself automatically. Maybe a config option could be introduced.

Best wishes Eugene

Show replies by date

Timo Sirainen

13 May 13 May

10:23 a.m.

On Tue, 2008-05-13 at 11:13 +0400, Eugene wrote:

...

I suggest that Dovecot simply terminate the current connections (causing the client to reconnect) or -- if the time change is really that much of a problem -- to restart itself automatically. I guess terminating all current connections and restarting all processes would be pretty safe, but it's not really a high priority change for me..

...

Maybe a config option could be introduced.

There are too many settings already.

"Andraž 'ruskie' Levstik"

10:30 a.m.

On 09:23:57 2008-05-13 Timo Sirainen <tss@iki.fi> wrote:

...

On Tue, 2008-05-13 at 11:13 +0400, Eugene wrote:

...
I suggest that Dovecot simply terminate the current connections (causing the client to reconnect) or -- if the time change is really that much of a problem -- to restart itself automatically.
I guess terminating all current connections and restarting all processes would be pretty safe, but it's not really a high priority change for me..

...
Maybe a config option could be introduced.

There are too many settings already.

Or simply launch ntpd with the -s or whatever the appropriate switch to adjust time on bootup and ensure it starts pre-dovecot...

-- Andraž "ruskie" Levstik Source Mage GNU/Linux Games grimoire guru Geek/Hacker/Tinker

Be sure brain is in gear before engaging mouth. Ryle hira.

Key id = F4C1F89C Key fingerprint = 6FF2 8F20 4C9D DB36 B5B6 F134 884D 72CC F4C1 F89C

Eugene

10:31 a.m.

Hi Timo,

From: "Timo Sirainen" <tss@iki.fi>

...

...
I suggest that Dovecot simply terminate the current connections (causing the client to reconnect) or -- if the time change is really that much of a problem -- to restart itself automatically. I guess terminating all current connections and restarting all processes would be pretty safe, but it's not really a high priority change for me..

Nevertheless, it would be very nice if you could fix it. It's a fairly big availability problem (for us, at least). And after all, if we are terminating already, adding a simple spawn call before that should not take much time?

Best wishes Eugene

Charles Marcus

1:20 p.m.

On 5/13/2008, Eugene (genie@geniechka.ru) wrote:

...

...
I guess terminating all current connections and restarting all processes would be pretty safe, but it's not really a high priority change for me..

...

Nevertheless, it would be very nice if you could fix it. It's a fairly big availability problem (for us, at least).

The problem is not so much how dovecot deals with this issue, the problem is, why is your server having such drastic problems keeping its time sane?

Fix that, and your problem disappears.

Best regards,

Charles

Adam McDougall

10:13 p.m.

Charles Marcus wrote:

...

On 5/13/2008, Eugene (genie@geniechka.ru) wrote:

...
...
I guess terminating all current connections and restarting all processes would be pretty safe, but it's not really a high priority change for me..

...
Nevertheless, it would be very nice if you could fix it. It's a fairly big availability problem (for us, at least).

The problem is not so much how dovecot deals with this issue, the problem is, why is your server having such drastic problems keeping its time sane?

Fix that, and your problem disappears.

I would just like to mention a circumstance that happened to me this Sunday. We had a total power outage in our building, longer than our UPS's could last and we don't have a generator for servers (nor is it economical or needed). When the power came back on, my local NTP server came on at the same time as my mail servers, as well a majority of my other servers. My servers tried to step their time to be in sync with my local NTP server, which was still busy trying to sync itself with outside sources, which takes a while, so my mail servers did not get an answer. Later, dovecot died because the time finally synced, and I found out why pretty quick (have seen this before) but this was an unusual situation.

My point is, we had an unusual circumstance, and even though I've taken steps to have my mail servers sync their time at boot and run ntpd afterwards, there are some circumstances in which this is not enough, and dovecot still died. Its not always because someone was lazy about their time setup. But it doesn't cause me "big availability problems" since in general, my time is fine.

Scott Silva

10:33 p.m.

on 5-13-2008 12:13 PM Adam McDougall spake the following:

...

Charles Marcus wrote:

...
On 5/13/2008, Eugene (genie@geniechka.ru) wrote:

...
...
I guess terminating all current connections and restarting all processes would be pretty safe, but it's not really a high priority change for me..

...
Nevertheless, it would be very nice if you could fix it. It's a fairly big availability problem (for us, at least).

The problem is not so much how dovecot deals with this issue, the problem is, why is your server having such drastic problems keeping its time sane?

Fix that, and your problem disappears.

I would just like to mention a circumstance that happened to me this Sunday. We had a total power outage in our building, longer than our UPS's could last and we don't have a generator for servers (nor is it economical or needed). When the power came back on, my local NTP server came on at the same time as my mail servers, as well a majority of my other servers. My servers tried to step their time to be in sync with my local NTP server, which was still busy trying to sync itself with outside sources, which takes a while, so my mail servers did not get an answer. Later, dovecot died because the time finally synced, and I found out why pretty quick (have seen this before) but this was an unusual situation. My point is, we had an unusual circumstance, and even though I've taken steps to have my mail servers sync their time at boot and run ntpd afterwards, there are some circumstances in which this is not enough, and dovecot still died. Its not always because someone was lazy about their time setup. But it doesn't cause me "big availability problems" since in general, my time is fine.

This would be a good case for running ntpdate on startup at least on the ntp server. Just point it to a reliable outside server. AFAIR RedHat and clones do this in the init script for ntpd.

-- MailScanner is like deodorant... You hope everybody uses it, and you notice quickly if they don't!!!!

Bruce Bodger

10:37 p.m.

On 5/13/08 3:33 PM, Scott Silva wrote:

...

...
This would be a good case for running ntpdate on startup at least on the ntp server. Just point it to a reliable outside server. AFAIR RedHat and clones do this in the init script for ntpd. ...and how much more TIME shall we spend rechewing this non-dovecot issue ?

B. Bodger

Bill Cole

4:48 p.m.

At 11:31 AM +0400 5/13/08, Eugene wrote:

...

Hi Timo,

From: "Timo Sirainen" <tss@iki.fi>

...
...
I suggest that Dovecot simply terminate the current connections (causing the client to reconnect) or -- if the time change is really that much of a problem -- to restart itself automatically.

...
I guess terminating all current connections and restarting all processes would be pretty safe, but it's not really a high priority change for me..

Nevertheless, it would be very nice if you could fix it. It's a fairly big availability problem (for us, at least).

Then you have a badly broken system. There is no explanation for time going backwards on a server on a frequent unplanned basis that is not reducible to administrative incompetence or malfunctioning hardware (and the latter as a chronic issue can be seen as just a special case of the former.)

...

And after all, if we are terminating already, adding a simple spawn call before that should not take much time?

A system clock that moves backwards is indicative of a problem. Having a service respawn itself as a response to a problem that is outside of its control (i.e. the respawn is not itself a fix) is begging for trouble, because that behavior has to be carefully controlled to prevent it from contributing to a cascading problem. On a system whose clock is untrustworthy, this is a significant challenge. The effort to do that sort of code correctly just to accommodate people with broken systems seems like a terrible waste.

On the other hand, writing a freestanding watchdog for a critical service is (or at least should be) something any good sysadmin can do. If you are stuck with hardware so broken that it jumps backwards in time without warning but not so broken that you can get it replaced, and it lives in a network or resource environment that prevents you from fixing the core problem, you can adapt to the breakage yourself.

-- Bill Cole
bill@scconsult.com

"Andraž 'ruskie' Levstik"

4:58 p.m.

On 15:48:42 2008-05-13 Bill Cole <dovecot-20061108@billmail.scconsult.com> wrote:

...

At 11:31 AM +0400 5/13/08, Eugene wrote:

...
Hi Timo,

From: "Timo Sirainen" <tss@iki.fi>

...
...
I suggest that Dovecot simply terminate the current connections (causing the client to reconnect) or -- if the time change is really that much of a problem -- to restart itself automatically.

...
I guess terminating all current connections and restarting all processes would be pretty safe, but it's not really a high priority change for me..

Nevertheless, it would be very nice if you could fix it. It's a fairly big availability problem (for us, at least).

Then you have a badly broken system. There is no explanation for time going backwards on a server on a frequent unplanned basis that is not reducible to administrative incompetence or malfunctioning hardware (and the latter as a chronic issue can be seen as just a special case of the former.)

Harsh...

...

...
And after all, if we are terminating already, adding a simple spawn call before that should not take much time?

A system clock that moves backwards is indicative of a problem. Having a service respawn itself as a response to a problem that is outside of its control (i.e. the respawn is not itself a fix) is begging for trouble, because that behavior has to be carefully controlled to prevent it from contributing to a cascading problem. On a system whose clock is untrustworthy, this is a significant challenge. The effort to do that sort of code correctly just to accommodate people with broken systems seems like a terrible waste.

On the other hand, writing a freestanding watchdog for a critical service is (or at least should be) something any good sysadmin can do. If you are stuck with hardware so broken that it jumps backwards in time without warning but not so broken that you can get it replaced, and it lives in a network or resource environment that prevents you from fixing the core problem, you can adapt to the breakage yourself.

I use monit for monitoring services... It so far has worked great... auto restarts etc... depneding on configuration along with a web based interface one can see a quick overview and control things(of course optional web interface)

-- Andraž "ruskie" Levstik Source Mage GNU/Linux Games grimoire guru Geek/Hacker/Tinker

Be sure brain is in gear before engaging mouth. Ryle hira.

Key id = F4C1F89C Key fingerprint = 6FF2 8F20 4C9D DB36 B5B6 F134 884D 72CC F4C1 F89C

Charles Marcus

5:29 p.m.

...

...
...
Nevertheless, it would be very nice if you could fix it. It's a fairly big availability problem (for us, at least).

...

...
Then you have a badly broken system. There is no explanation for time going backwards on a server on a frequent unplanned basis that is not reducible to administrative incompetence or malfunctioning hardware (and the latter as a chronic issue can be seen as just a special case of the former.)

...

Harsh...

Maybe... but it is still true...

Best regards,

Charles

Bill Cole

5:52 p.m.

At 3:58 PM +0200 5/13/08, AndraÏ 'ruskie' Levstik wrote:

...

On 15:48:42 2008-05-13 Bill Cole <dovecot-20061108@billmail.scconsult.com> wrote:

...
At 11:31 AM +0400 5/13/08, Eugene wrote:

...
Hi Timo,

From: "Timo Sirainen" <tss@iki.fi>

...
...
I suggest that Dovecot simply terminate the current connections (causing the client to reconnect) or -- if the time change is really that much of a problem -- to restart itself automatically.

...
I guess terminating all current connections and restarting all processes would be pretty safe, but it's not really a high priority change for me..

Nevertheless, it would be very nice if you could fix it. It's a fairly big availability problem (for us, at least).

Then you have a badly broken system. There is no explanation for time going backwards on a server on a frequent unplanned basis that is not reducible to administrative incompetence or malfunctioning hardware (and the latter as a chronic issue can be seen as just a special case of the former.)

Harsh...

I think it is not so harsh if you read what I wrote carefully.

Part of what I meant to convey was that the real circumstances of a clock jumping backwards ought to be rare and predictable, such as a long period unpowered, either long enough to drain the clock battery or just long enough for the system clock to drift more than 1/8 of a second. If your system clock doesn't stay pretty close across a regular reboot, you have a hardware problem (most likely a dead clock battery...)

I should probably also note that I did not use 'incompetence' as a generic term and it does not mean 'stupid' or 'bad' or anything else more general, vague, and pejorative.

-- Bill Cole bill@scconsult.com

Anton Yuzhaninov

11:51 a.m.

Timo Sirainen пишет:

...

On Tue, 2008-05-13 at 11:13 +0400, Eugene wrote:

...
I suggest that Dovecot simply terminate the current connections (causing the client to reconnect) or -- if the time change is really that much of a problem -- to restart itself automatically. I guess terminating all current connections and restarting all processes would be pretty safe, but it's not really a high priority change for me..

IMHO more robust is to use clock_gettime(CLOCK_MONOTONIC, ..) for timeouts and just work fine even if time was changed via settimeofday().

-- WBR, Anton Yuzhaninov Rambler Mail

Timo Sirainen

11:56 a.m.

On Tue, 2008-05-13 at 12:51 +0400, Anton Yuzhaninov wrote:

...

Timo Sirainen пишет:

...
On Tue, 2008-05-13 at 11:13 +0400, Eugene wrote:

...
I suggest that Dovecot simply terminate the current connections (causing the client to reconnect) or -- if the time change is really that much of a problem -- to restart itself automatically. I guess terminating all current connections and restarting all processes would be pretty safe, but it's not really a high priority change for me..

IMHO more robust is to use clock_gettime(CLOCK_MONOTONIC, ..) for timeouts and just work fine even if time was changed via settimeofday().

Two problems with that:

clock_gettime() doesn't work everywhere (e.g. OSX).
With the current design gettimeofday() has to be called anyway to get the current real timestamp, causing extra unnecessary work.

Anyway that's not the main problem. I'm more concerned about timestamp comparison code where one or both of the timestamps come from filesystem. It might cause some corruption, such as dotlock being deleted while another process still holds it.

Quentin Garnier

10:32 a.m.

On Tue, May 13, 2008 at 11:13:39AM +0400, Eugene wrote:

...

Hello,

I would like to suggest a change in handling of 'Time moved backwards' problem. Right now dovecot just dies. So, the scenario:

Colocation server is shut down for some reason. The internal time drifts.

Server is started again.

Dovecot starts successfully.

In about a minute, NTP daemon feels confident about adjusting the system time.

Dovecot sees the changed time and dies.

Admin has to notice that, login and restart Dovecot manually.

The admin should run ntpdate before launching ntpd and dovecot. ntpd will _never_ move time backwards under normal drifting conditions (it has other ways of coping with that).

-- Quentin Garnier - cube@cubidou.net - cube@NetBSD.org "See the look on my face from staying too long in one place [...] every time the morning breaks I know I'm closer to falling" KT Tunstall, Saving My Face, Drastic Fantastic, 2007.

Eugene

10:39 a.m.

Hello,

From: "Quentin Garnier" <cube@cubidou.net>

...

...

In about a minute, NTP daemon feels confident about adjusting the system time. The admin should run ntpdate before launching ntpd and dovecot. ntpd will _never_ move time backwards under normal drifting conditions (it has other ways of coping with that).

Please read carefully. ntpd IS run before dovecot, but a change of time happens some time later. Of course, dovecot starting script can be hacked to sleep for some time, but it feels like a wrong way to solve a problem.

Best wishes Eugene

Quentin Garnier

11:21 a.m.

On Tue, May 13, 2008 at 11:39:54AM +0400, Eugene wrote:

...

Hello,

From: "Quentin Garnier" <cube@cubidou.net>

...
...

In about a minute, NTP daemon feels confident about adjusting the system time. The admin should run ntpdate before launching ntpd and dovecot. ntpd will _never_ move time backwards under normal drifting conditions (it has other ways of coping with that).

Please read carefully. ntpd IS run before dovecot, but a change of time

I'm not the one having trouble reading, here. The proper way to start a system is to run ntp*date* (as early as possible) and then ntpd.

NTP documentation has even more details how to do this properly.

http://support.ntp.org/bin/view/Support/StartingNTP4#Section_7.1.1.

Eugene

11:48 a.m.

Hello,

From: "Quentin Garnier" <cube@cubidou.net>

...

I'm not the one having trouble reading, here. The proper way to start a system is to run ntp*date* (as early as possible) and then ntpd.

That's what you say, and it is far from being officially accepted. NTP project clearly deprecates ntpdate for several reasons. In addition, "the clock should not be stepped until a consistent offset has been observed for a sanity interval, currently 15 minutes". So ntpd may in principle step time again. http://www.ntp.org/ntpfaq/NTP-s-config.htm

Eugene

"Andraž 'ruskie' Levstik"

11:56 a.m.

On 10:48:28 2008-05-13 "Eugene" <genie@geniechka.ru> wrote:

...

Hello,

From: "Quentin Garnier" <cube@cubidou.net>

...
I'm not the one having trouble reading, here. The proper way to start a system is to run ntp*date* (as early as possible) and then ntpd.

That's what you say, and it is far from being officially accepted. NTP project clearly deprecates ntpdate for several reasons. In addition, "the clock should not be stepped until a consistent offset has been observed for a sanity interval, currently 15 minutes". So ntpd may in principle step time again. http://www.ntp.org/ntpfaq/NTP-s-config.htm

Eugene

Maybe you should use openntpd... that's what works fine for me...

-- Andraž "ruskie" Levstik Source Mage GNU/Linux Games grimoire guru Geek/Hacker/Tinker

Be sure brain is in gear before engaging mouth. Ryle hira.

Key id = F4C1F89C Key fingerprint = 6FF2 8F20 4C9D DB36 B5B6 F134 884D 72CC F4C1 F89C

Bill Cole

6:26 p.m.

At 12:48 PM +0400 5/13/08, Eugene wrote:

...

Hello,

From: "Quentin Garnier" <cube@cubidou.net>

...
I'm not the one having trouble reading, here. The proper way to start a system is to run ntp*date* (as early as possible) and then ntpd.

That's what you say, and it is far from being officially accepted.

There is nothing 'official' in regards to how you start up your system that should carry more influence than having it start up correctly.

...

NTP project clearly deprecates ntpdate for several reasons.

I think that is an incorrect statement, and to the degree that it is correct, it still does not mean that it is reasonable to have an unstable clock.

The practice of running ntpdate as a cron job rather than a ntp daemon certainly is deprecated and always has been, but even that is not so much because of how well it works (on most systems it is perfectly functional) but rather because it does not scale across the net: the public NTP infrastructure gets what Dr. Mills calls "little fireballs of congestion" as the cron jobs fire, and that has bad impacts on everyone's NTP-based time accuracy.

On the other hand, ntpdate run once at boot still serves a purpose, even if you are running the latest ntpd with the initial stepping functionality rolled in and enabled, and it is helpful to note Per Hedeland's succinct and practical counterpoint in the FAQ immediately following Dr. Mills' long explanation of the various edge cases involved in choosing when to step and when not to.

...

In addition, "the clock should not be stepped until a consistent offset has been observed for a sanity interval, currently 15 minutes". So ntpd may in principle step time again. http://www.ntp.org/ntpfaq/NTP-s-config.htm

You are misreading and/or misrepresenting the context of the piece you quote. A good NTP daemon is properly quite conservative about stepping the clock at times *other than* at boot, and should take precautions against trusting clocks that seem wrong once it has what should be a trustworthy synchronization and characterization of its own local clock.

-- Bill Cole
bill@scconsult.com

Bill Cole

4:57 p.m.

At 11:13 AM +0400 5/13/08, Eugene wrote:

...

Hello,

I would like to suggest a change in handling of 'Time moved backwards' problem. Right now dovecot just dies. So, the scenario:

Colocation server is shut down for some reason. The internal time drifts.

Server is started again.

Dovecot starts successfully.

In about a minute, NTP daemon feels confident about adjusting the system time.

That's broken. Either your startup is running in the wrong order, it is missing a step, or your NTP daemon is misconfigured.

This sort of problem is why some OS's default startup procedure is intentionally designed to block on 'ntpdate' running successfully. You are likely to be better off with a system that is obviously not working than one which started and then was subjected to a backwards clock change, which can harm more than Dovecot.

-- Bill Cole
bill@scconsult.com

6264

Age (days ago)

6264

Last active (days ago)

List overview

20 comments

10 participants

participants (10)

"Andraž 'ruskie' Levstik"
Adam McDougall
Anton Yuzhaninov
Bill Cole
Bruce Bodger
Charles Marcus
Eugene
Quentin Garnier
Scott Silva
Timo Sirainen