[Dovecot] Time moved backwards by 4398 seconds

newer
[Dovecot] Private flags and shared...

Anders

20 Jun 2008 20 Jun '08

11:53 a.m.

Dovecot (v1.1.rc8) died tonight, with an error about time moving backwards by 4398 seconds. I can see from logs that this has happend a few times before with the imap processes, without me noticing. I sure noticed the master process missing, though :-). I was puzzled that it was always 4398 seconds, in particular because this server runs an NTP daemon. A little searching for this problem shows that it is an issue with the Linux kernel gettimeofday(), see e.g. http://lkml.org/lkml/2007/8/23/96 Below is a patch (untested) to work around this issue. Do you see something wrong with this approach, apart from the uglyness? I just picked the 4395-4400 values by chance. Can you figure out how big the window should be? Thanks, Anders. --- ./src/lib/ioloop.c-orig 2008-06-20 10:45:54.000000000 +0200 +++ ./src/lib/ioloop.c 2008-06-20 10:47:36.000000000 +0200 @@ -230,8 +230,13 @@ struct timeval tv, tv_call; unsigned int t_id; - if (gettimeofday(&ioloop_timeval, &ioloop_timezone) < 0) - i_fatal("gettimeofday(): %m"); + /* The Linux gettimeofday() will sometimes jump forward + * by approximately 4398 seconds. Ignore that reading. */ + do { + if (gettimeofday(&ioloop_timeval, &ioloop_timezone) < 0) + i_fatal("gettimeofday(): %m"); + } while (4395 < (ioloop_timeval.tv_sec - ioloop_time) + && (ioloop_timeval.tv_sec - ioloop_time) < 4400); /* Don't bother comparing usecs. */ if (ioloop_time > ioloop_timeval.tv_sec) {

Show replies by date

Johannes Berg

20 Jun 20 Jun

11:59 a.m.

On Fri, 2008-06-20 at 10:53 +0200, Anders wrote:

...

Dovecot (v1.1.rc8) died tonight, with an error about time moving backwards by 4398 seconds. I can see from logs that this has happend a few times before with the imap processes, without me noticing. I sure noticed the master process missing, though :-).

I was puzzled that it was always 4398 seconds, in particular because this server runs an NTP daemon. A little searching for this problem shows that it is an issue with the Linux kernel gettimeofday(), see e.g. http://lkml.org/lkml/2007/8/23/96

The thread puts it down to buggy hardware and puts a workaround into the kernel where it belongs, not in dovecot.

johannes

Anders

12:10 p.m.

Johannes Berg <johannes@sipsolutions.net> writes:

...

On Fri, 2008-06-20 at 10:53 +0200, Anders wrote:

...
I was puzzled that it was always 4398 seconds, in particular because this server runs an NTP daemon. A little searching for this problem shows that it is an issue with the Linux kernel gettimeofday(), see e.g. http://lkml.org/lkml/2007/8/23/96

The thread puts it down to buggy hardware and puts a workaround into the kernel where it belongs, not in dovecot.

That's not helpful.

By that line, the entire "time moved backwards" thing does not belong in Dovecot.

Anyway, I was not proposing the patch to be included, just asking for advice as to whether it would be safe. I even noted that it was ugly.

As I am already compiling Dovecot myself, I prefer a patch there, rather than diverting from the distribution kernel.

Cheers, Anders.

Johannes Berg

12:28 p.m.

On Fri, 2008-06-20 at 11:10 +0200, Anders wrote:

...

Johannes Berg <johannes@sipsolutions.net> writes:

...
On Fri, 2008-06-20 at 10:53 +0200, Anders wrote:

...
I was puzzled that it was always 4398 seconds, in particular because this server runs an NTP daemon. A little searching for this problem shows that it is an issue with the Linux kernel gettimeofday(), see e.g. http://lkml.org/lkml/2007/8/23/96

The thread puts it down to buggy hardware and puts a workaround into the kernel where it belongs, not in dovecot.

That's not helpful.

By that line, the entire "time moved backwards" thing does not belong in Dovecot.

Why? That's a different thing, dovecot is detecting that something is wrong and that it will be unsafe for it to continue operating. That's an entirely different class than trying to work around the detected problem, imho.

...

Anyway, I was not proposing the patch to be included, just asking for advice as to whether it would be safe. I even noted that it was ugly.

Ok. Yeah, it does seem safe, but as Timo said it'll loop there in case there is an actual forward jump.

...

As I am already compiling Dovecot myself, I prefer a patch there, rather than diverting from the distribution kernel.

Heh, ok.

johannes

Anders

1:20 p.m.

Johannes Berg wrote:

...

On Fri, 2008-06-20 at 11:10 +0200, Anders wrote:

...
Johannes Berg <johannes@sipsolutions.net> writes:

...
On Fri, 2008-06-20 at 10:53 +0200, Anders wrote:

...
I was puzzled that it was always 4398 seconds, in particular because this server runs an NTP daemon. A little searching for this problem shows that it is an issue with the Linux kernel gettimeofday(), see e.g. http://lkml.org/lkml/2007/8/23/96 The thread puts it down to buggy hardware and puts a workaround into the kernel where it belongs, not in dovecot. That's not helpful.

By that line, the entire "time moved backwards" thing does not belong in Dovecot.

Why? That's a different thing, dovecot is detecting that something is wrong and that it will be unsafe for it to continue operating. That's an entirely different class than trying to work around the detected problem, imho.

This bug causes Dovecot to run the IO loop in the future for one iteration, and then die when we get back to present time.

By the time Dovecot dies, some damage could already have happend, for example if ioloop_time is stored to permanent storage.

BTW, can you send a link to the post with the resolution for the kernel? I didn't manage to find any final conclusion, but I would like to propose the fix to our kernel provider :).

Thanks, Anders.

Johannes Berg

1:33 p.m.

...

This bug causes Dovecot to run the IO loop in the future for one iteration, and then die when we get back to present time.

By the time Dovecot dies, some damage could already have happend, for example if ioloop_time is stored to permanent storage.

Hmm, ok, I was under the impression it aborted early enough.

...

BTW, can you send a link to the post with the resolution for the kernel? I didn't manage to find any final conclusion, but I would like to propose the fix to our kernel provider :).

I didn't actually look up any fix just there were fixes in the thread. Didn't check which one if any was merged though.

johannes

Bill Cole

6:35 p.m.

At 11:10 AM +0200 6/20/08, Anders wrote:

...

Johannes Berg <johannes@sipsolutions.net> writes:

...
On Fri, 2008-06-20 at 10:53 +0200, Anders wrote:

...
I was puzzled that it was always 4398 seconds, in particular because this server runs an NTP daemon. A little searching for this problem shows that it is an issue with the Linux kernel gettimeofday(), see e.g. http://lkml.org/lkml/2007/8/23/96

The thread puts it down to buggy hardware and puts a workaround into the kernel where it belongs, not in dovecot.

I think it is more accurate to say "hardware being used for a purpose its designers did not intend" instead. Using the TSC as a clock has been iffy for quite some time, and defaulting to it in the kernel is a risky design choice and must be implemented with extreme caution. It's not that the hardware is buggy,but rather that it does things by design that are not obvious from a high-level description.

...

That's not helpful.

By that line, the entire "time moved backwards" thing does not belong in Dovecot.

I suspect that you don't understand why that is in Dovecot. Timo has explained it in detail a few times, but the bottom line is simple: running through the same system-clock time more than once induces a very real risk of destroying mail.

...

Anyway, I was not proposing the patch to be included, just asking for advice as to whether it would be safe. I even noted that it was ugly.

"Safe" is subjective. I think it would be safer (at the cost of a bounded amount of time) to nanosleep or maybe usleep once and retry the call rather than to go into the loop.

...

As I am already compiling Dovecot myself, I prefer a patch there, rather than diverting from the distribution kernel.

You might even be better off configuring your system to not use the TSC as a clock source. There's a strong chance that you won't really be sacrificing anything that you actually use.

-- Bill Cole bill@scconsult.com

Anders

21 Jun 21 Jun

1:04 a.m.

Bill Cole wrote:

...

At 11:10 AM +0200 6/20/08, Anders wrote:

...

...
By that line, the entire "time moved backwards" thing does not belong in Dovecot.

I suspect that you don't understand why that is in Dovecot. Timo has explained it in detail a few times, but the bottom line is simple: running through the same system-clock time more than once induces a very real risk of destroying mail.

... and running through the same system clock twice is exactly what you will do if you fail to detect a temporary forward jump of 1.2 hours when it happens.

I might follow your advice of "notsc", it makes me a bit uncomfortable that gettimeofday() can fail for other applications as well.

Cheers, Anders.

Anders

9 Jul 9 Jul

11:53 a.m.

Bill Cole wrote:

...

You might even be better off configuring your system to not use the TSC as a clock source. There's a strong chance that you won't really be sacrificing anything that you actually use.

Time to conclude on this.

Rather than going along with my Dovecot hack, I went and changed to "clocksource=hpet", though I am not quite sure about the difference.

As I was about to reboot the server anyway, I applied the updates from Red Hat, including this one (RHSA-2008:0519-24):

Due to a regression, "gettimeofday" may have gone backwards on certain x86 hardware. This issue was quite dangerous for time-sensitive systems, such as those used for transaction systems and databases, and may have caused applications to produce incorrect results, or even crash.

(The same kind of patch has been going in and out of the Ubuntu kernel, where it appears to cause trouble for suspend)

I have not seen the problem since applying those two changes, and am not currently willing to find out whether one of them is enough :)

Cheers, Anders.

Timo Sirainen

20 Jun 20 Jun

12:13 p.m.

On Fri, 2008-06-20 at 10:53 +0200, Anders wrote:

...

Dovecot (v1.1.rc8) died tonight, with an error about time moving backwards by 4398 seconds. I can see from logs that this has happend a few times before with the imap processes, without me noticing. I sure noticed the master process missing, though :-).

I was puzzled that it was always 4398 seconds, in particular because this server runs an NTP daemon. A little searching for this problem shows that it is an issue with the Linux kernel gettimeofday(), see e.g. http://lkml.org/lkml/2007/8/23/96

Below is a patch (untested) to work around this issue. Do you see something wrong with this approach, apart from the uglyness?

Only problem I can see is that if there's a legitimate jump of 4395 seconds it'll busy-loop for 5 seconds before continuing. Probably not very likely to happen.

6240

Age (days ago)

6259

Last active (days ago)

List overview

9 comments

4 participants

participants (4)

Anders
Bill Cole
Johannes Berg
Timo Sirainen