On 21.08.24 11:35, Timo Sirainen wrote:
[Lots and lots of "but my NTP sync is much more precise than that" in the FreeBSD thread]
The way Dovecot works is:
- It finds the next timeout, sees that it happens in e.g. 5 milliseconds.
- Then it calls kqueue() to wait for I/O for max 5 milliseconds
- Then it notices that it actually returned more than 105 milliseconds later, and then logs a warning about it.
I think that more information is needed to pinpoint possible causes, and one of the open questions is: What clock does dovecot look at to determine how long it *actually* stayed dormant? On Linux, software that has need of a monotonously increasing "time" to derive guaranteed unique IDs from often looks at the kernel uptime - which is essentially a count of ticks since bootup, and *not* being corrected by NTP.
Similarly, it should be determined whether the timeouts of I/O function called (i.e., kqueue()) are or aren't influenced by NTP's corrections to system time.
The third information I'd like to have is what client software provides that NTP sync to the machine; ntpd, chronyd, something else?
(As an example for why this is relevant: Several hundred deviations of 100 ms or more per day sum up to several 10+ seconds per day, if only they all are in the same direction, or several 115+ ppm. ntpd refuses to do *slews* correcting by more than 500 ppm; if the OS clock's frequency error exceeds that, ntpd would need to do *steps* every now and then, and in a default configuration, an ntpd will refuse to do a *second* step and *die* instead. Or, if the reference clock sways *back and forth*, ntpd should very likely complain about its sources' jitter in the logs. chronyd, however, is more ruthless in whacking the local clock into "sync" with the external sources, and much more inclined to define "sync" as "low difference", rather than also taking frequency stability into account like ntpd.)
Also, this is kind of a problem when it does happen. Since Dovecot thinks the time moved e.g. 100ms forward, it adjusts all timeouts to happen 100ms backwards. If this wasn't a true time jump, then these timeouts now happen 100ms earlier.
That is, of course, a dangerous approach if you do *not* have a guarantee that the timeouts of the I/O function called are *otherwise* true to the requested duration. But shouldn't those other concurrently-running timeouts notice an actual discontinuity of the timescale just the same as the first one did? Maybe some sort of "N 'nay's needed for a vote of nonconfidence" mechanism would be safer ...
Kind regards,
Jochen Bern Systemingenieur
Binect GmbH