[Dovecot] Time moved backwards by 4398 seconds
Dovecot (v1.1.rc8) died tonight, with an error about time moving backwards by 4398 seconds. I can see from logs that this has happend a few times before with the imap processes, without me noticing. I sure noticed the master process missing, though :-). I was puzzled that it was always 4398 seconds, in particular because this server runs an NTP daemon. A little searching for this problem shows that it is an issue with the Linux kernel gettimeofday(), see e.g. http://lkml.org/lkml/2007/8/23/96 Below is a patch (untested) to work around this issue. Do you see something wrong with this approach, apart from the uglyness? I just picked the 4395-4400 values by chance. Can you figure out how big the window should be? Thanks, Anders. --- ./src/lib/ioloop.c-orig 2008-06-20 10:45:54.000000000 +0200 +++ ./src/lib/ioloop.c 2008-06-20 10:47:36.000000000 +0200 @@ -230,8 +230,13 @@ struct timeval tv, tv_call; unsigned int t_id; - if (gettimeofday(&ioloop_timeval, &ioloop_timezone) < 0) - i_fatal("gettimeofday(): %m"); + /* The Linux gettimeofday() will sometimes jump forward + * by approximately 4398 seconds. Ignore that reading. */ + do { + if (gettimeofday(&ioloop_timeval, &ioloop_timezone) < 0) + i_fatal("gettimeofday(): %m"); + } while (4395 < (ioloop_timeval.tv_sec - ioloop_time) + && (ioloop_timeval.tv_sec - ioloop_time) < 4400); /* Don't bother comparing usecs. */ if (ioloop_time > ioloop_timeval.tv_sec) {
On Fri, 2008-06-20 at 10:53 +0200, Anders wrote:
Dovecot (v1.1.rc8) died tonight, with an error about time moving backwards by 4398 seconds. I can see from logs that this has happend a few times before with the imap processes, without me noticing. I sure noticed the master process missing, though :-).
I was puzzled that it was always 4398 seconds, in particular because this server runs an NTP daemon. A little searching for this problem shows that it is an issue with the Linux kernel gettimeofday(), see e.g. http://lkml.org/lkml/2007/8/23/96
The thread puts it down to buggy hardware and puts a workaround into the kernel where it belongs, not in dovecot.
johannes
Johannes Berg <johannes@sipsolutions.net> writes:
On Fri, 2008-06-20 at 10:53 +0200, Anders wrote:
I was puzzled that it was always 4398 seconds, in particular because this server runs an NTP daemon. A little searching for this problem shows that it is an issue with the Linux kernel gettimeofday(), see e.g. http://lkml.org/lkml/2007/8/23/96
The thread puts it down to buggy hardware and puts a workaround into the kernel where it belongs, not in dovecot.
That's not helpful.
By that line, the entire "time moved backwards" thing does not belong in Dovecot.
Anyway, I was not proposing the patch to be included, just asking for advice as to whether it would be safe. I even noted that it was ugly.
As I am already compiling Dovecot myself, I prefer a patch there, rather than diverting from the distribution kernel.
Cheers, Anders.
On Fri, 2008-06-20 at 11:10 +0200, Anders wrote:
Johannes Berg <johannes@sipsolutions.net> writes:
On Fri, 2008-06-20 at 10:53 +0200, Anders wrote:
I was puzzled that it was always 4398 seconds, in particular because this server runs an NTP daemon. A little searching for this problem shows that it is an issue with the Linux kernel gettimeofday(), see e.g. http://lkml.org/lkml/2007/8/23/96
The thread puts it down to buggy hardware and puts a workaround into the kernel where it belongs, not in dovecot.
That's not helpful.
By that line, the entire "time moved backwards" thing does not belong in Dovecot.
Why? That's a different thing, dovecot is detecting that something is wrong and that it will be unsafe for it to continue operating. That's an entirely different class than trying to work around the detected problem, imho.
Anyway, I was not proposing the patch to be included, just asking for advice as to whether it would be safe. I even noted that it was ugly.
Ok. Yeah, it does seem safe, but as Timo said it'll loop there in case there is an actual forward jump.
As I am already compiling Dovecot myself, I prefer a patch there, rather than diverting from the distribution kernel.
Heh, ok.
johannes
Johannes Berg wrote:
On Fri, 2008-06-20 at 11:10 +0200, Anders wrote:
Johannes Berg <johannes@sipsolutions.net> writes:
On Fri, 2008-06-20 at 10:53 +0200, Anders wrote:
I was puzzled that it was always 4398 seconds, in particular because this server runs an NTP daemon. A little searching for this problem shows that it is an issue with the Linux kernel gettimeofday(), see e.g. http://lkml.org/lkml/2007/8/23/96 The thread puts it down to buggy hardware and puts a workaround into the kernel where it belongs, not in dovecot. That's not helpful.
By that line, the entire "time moved backwards" thing does not belong in Dovecot.
Why? That's a different thing, dovecot is detecting that something is wrong and that it will be unsafe for it to continue operating. That's an entirely different class than trying to work around the detected problem, imho.
This bug causes Dovecot to run the IO loop in the future for one iteration, and then die when we get back to present time.
By the time Dovecot dies, some damage could already have happend, for example if ioloop_time is stored to permanent storage.
BTW, can you send a link to the post with the resolution for the kernel? I didn't manage to find any final conclusion, but I would like to propose the fix to our kernel provider :).
Thanks, Anders.
This bug causes Dovecot to run the IO loop in the future for one iteration, and then die when we get back to present time.
By the time Dovecot dies, some damage could already have happend, for example if ioloop_time is stored to permanent storage.
Hmm, ok, I was under the impression it aborted early enough.
BTW, can you send a link to the post with the resolution for the kernel? I didn't manage to find any final conclusion, but I would like to propose the fix to our kernel provider :).
I didn't actually look up any fix just there were fixes in the thread. Didn't check which one if any was merged though.
johannes
At 11:10 AM +0200 6/20/08, Anders wrote:
Johannes Berg <johannes@sipsolutions.net> writes:
On Fri, 2008-06-20 at 10:53 +0200, Anders wrote:
I was puzzled that it was always 4398 seconds, in particular because this server runs an NTP daemon. A little searching for this problem shows that it is an issue with the Linux kernel gettimeofday(), see e.g. http://lkml.org/lkml/2007/8/23/96
The thread puts it down to buggy hardware and puts a workaround into the kernel where it belongs, not in dovecot.
I think it is more accurate to say "hardware being used for a purpose its designers did not intend" instead. Using the TSC as a clock has been iffy for quite some time, and defaulting to it in the kernel is a risky design choice and must be implemented with extreme caution. It's not that the hardware is buggy,but rather that it does things by design that are not obvious from a high-level description.
That's not helpful.
By that line, the entire "time moved backwards" thing does not belong in Dovecot.
I suspect that you don't understand why that is in Dovecot. Timo has explained it in detail a few times, but the bottom line is simple: running through the same system-clock time more than once induces a very real risk of destroying mail.
Anyway, I was not proposing the patch to be included, just asking for advice as to whether it would be safe. I even noted that it was ugly.
"Safe" is subjective. I think it would be safer (at the cost of a bounded amount of time) to nanosleep or maybe usleep once and retry the call rather than to go into the loop.
As I am already compiling Dovecot myself, I prefer a patch there, rather than diverting from the distribution kernel.
You might even be better off configuring your system to not use the TSC as a clock source. There's a strong chance that you won't really be sacrificing anything that you actually use.
-- Bill Cole bill@scconsult.com
Bill Cole wrote:
At 11:10 AM +0200 6/20/08, Anders wrote:
By that line, the entire "time moved backwards" thing does not belong in Dovecot.
I suspect that you don't understand why that is in Dovecot. Timo has explained it in detail a few times, but the bottom line is simple: running through the same system-clock time more than once induces a very real risk of destroying mail.
... and running through the same system clock twice is exactly what you will do if you fail to detect a temporary forward jump of 1.2 hours when it happens.
I might follow your advice of "notsc", it makes me a bit uncomfortable that gettimeofday() can fail for other applications as well.
Cheers, Anders.
Bill Cole wrote:
You might even be better off configuring your system to not use the TSC as a clock source. There's a strong chance that you won't really be sacrificing anything that you actually use.
Time to conclude on this.
Rather than going along with my Dovecot hack, I went and changed to "clocksource=hpet", though I am not quite sure about the difference.
As I was about to reboot the server anyway, I applied the updates from Red Hat, including this one (RHSA-2008:0519-24):
- Due to a regression, "gettimeofday" may have gone backwards on certain x86 hardware. This issue was quite dangerous for time-sensitive systems, such as those used for transaction systems and databases, and may have caused applications to produce incorrect results, or even crash.
(The same kind of patch has been going in and out of the Ubuntu kernel, where it appears to cause trouble for suspend)
I have not seen the problem since applying those two changes, and am not currently willing to find out whether one of them is enough :)
Cheers, Anders.
On Fri, 2008-06-20 at 10:53 +0200, Anders wrote:
Dovecot (v1.1.rc8) died tonight, with an error about time moving backwards by 4398 seconds. I can see from logs that this has happend a few times before with the imap processes, without me noticing. I sure noticed the master process missing, though :-).
I was puzzled that it was always 4398 seconds, in particular because this server runs an NTP daemon. A little searching for this problem shows that it is an issue with the Linux kernel gettimeofday(), see e.g. http://lkml.org/lkml/2007/8/23/96
Below is a patch (untested) to work around this issue. Do you see something wrong with this approach, apart from the uglyness?
Only problem I can see is that if there's a legitimate jump of 4395 seconds it'll busy-loop for 5 seconds before continuing. Probably not very likely to happen.
participants (4)
-
Anders
-
Bill Cole
-
Johannes Berg
-
Timo Sirainen