put forth on 1/20/2011 8:32 AM:
Secondly we thought the issues were due to NTP as the time stamps vary so widely, so we rebuilt our NTP servers and found closer stratum 1 source clocks to synchronize to hoping it would alleviate the problem but the dotlock errors returned after about 12 hours. We have fcntl locking set in our configuration file, but it is our understanding from look at the source code that this file is locked with dotlock.
Any help troubleshooting is appreciated. From your description it sounds as if you're ntpd syncing each of the 4 servers against an external time source, first stratum 2/3 sources, then stratum 1 sources in an attempt to cure this problem.
In a clustered server environment, _always_ run a local physical box/router ntpd server (preferably two) that queries a set of external sources, and services your internal machine queries. With RTTs all on your LAN, and using the same internal time sources for every query, this clock drift issue should be eliminated. Obviously, when you first set this up, stop ntpd and run ntpdate to get an initial time sync for each cluster host. You're much better off running one ntp server than two. With just two servers providing time, if they drift from one another, for whatever reason, there is no way to tell which one has the correct time. If you need to ensure the time is correct, peer at least 3 machines together,
On 2011-01-20 8:57 AM, Stan Hoeppner wrote: then they can take care of themselves if one drifts.