Tuesday, June 3, 2014

Exadata -- Ntp Problem, actually a Windows Server problem

I have faced an incident on Exadata X3-2. It was related with ntp time syncronization services.
The actual cause that have made me analyze the situation, was the time on the compute nodes.. The time was falling behind.. The gap was almost 40 second in a month..


Normally, when you buy an  Exadata Machine, the field engineer configures the Exadata Ntp services. The service is running as a standard ntpd and configured during the deployment..(/etc/ntp.conf)
You will see the actual configuration in the following lines;
#### BEGIN Generated by Exadata. DO NOT MODIFY ####
# 12650539
restrict default mask 0.0.0.0 noquery nomodify notrap
restrict 192.168.0.99 mask 255.255.255.255 nomodify notrap noquery
server 192.168.0.99 prefer
#### END Generated by Exadata ####

Note that: My Ntp time server 's ip is 192.168.0.99, and its hostname is erman.prt

So, the first thing, that I have looked was the ntp.conf, but I couldnt find any anomalies there.

Then I checked the ntp services..
service ntpd status
ntpd (pid  8261) is running...

The ntpd was running. No problems..

After that, I have used ntpq query program to have a general view for the ntp syncronization situation.

ntpq -p
remote refid st t when poll reach delay offset jitter
==============================================================================
*LOCAL(0) .LOCL. 10 l 24 64 37 0.000 0.000 0.001
erman.prt .LOCL. 1 u 18 64 37 0.147 26.399 15.870

Note that . * indicates the preferred time source.

So by looking to the output of ntpq, I could find the problem actually..
The problem seemed to be that the LOCAL clock was chosen for syncronization..
Ntpd somehow have found LOCAL clock as a better source than erman.prt..
Before investigating further, I have seen that the refid column was displaying .LOCL. for erman.prt.

So this made me think that my erman.prt Time Server was not syncing itself from an Ntp pool..
The erman.prt was a Windows 2008 server..
So I checked Windows 2008 time syncronization services using  w32tm tool.
C:\>w32tm /monitor
erman.prt *** PDC *** [192.168.0.99]:
ICMP: 0ms delay.
NTP: +0.0000000s offset from erman.prt
RefID: 'LOCL' [76.79.67.76]

So the output was suggesting my thought.. My Timeserver was not syncing with a remote ntp pool server, because its RefID was LOCL ,and the IP address 76.79.67.76 was just a ascii representation of LOCL ( L->76, O->79 etc.. ) :)

Then I have requested our Windows Server admins to configure the Windows 2008 time syncronization service such a way that it could sync its time from another ntp pool.

So, Windows admins did what I want, and made the Windows be syncronized with uk.pool.ntp.org. 
Extra Info: Ntp uses port 123 from source to target and vice versa.. It is UDP)

Windows gave the following message;
The time service is now synchronizing the system time with the time source uk.pool.ntp.org (ntp.m|0x1|192.168.0.99:123->85.119.80.232:123).

w32tm on Windows confirmed that.

w32tm /monitor
erman.prt *** PDC *** [192.168.0.99]:
ICMP: 0ms delay.
NTP: +0.0000000s offset from erman.prt
RefID: resntp-a-vip.lon.bitfolk.com [85.119.80.232]

After these actions, I have restarted the ntp services in exadata compute nodes.
service ntpd stop
service ntpd start

Unfortuneatly, the problem continued..
Ntp on Linux was still preferring the LOCAL clock for the Time syncronizations..
As seen below, altough the reference clock is 85.119.80.232, ntp still choses LOCAL(0) as a more reliable time source.
 ntpq -p
remote refid st t when poll reach delay offset jitter
==============================================================================
*LOCAL(0) .LOCL. 10 l 23 64 1 0.000 0.000 0.001
erman.prt 85.119.80.232 5 u 22 64 1 0.116 8.789 0.001


At this moment, I have decided to analyze deeper and had a look look at the associations;

ntpq> as
ind assID status conf reach auth condition last_event cnt
===========================================================
1 51929 9014 yes yes none reject reachable 1
2 51930 9014 yes yes none reject reachable 1

As you see above, both of the time sources are usable. So not firewall issue here..

Then I have looked the variables in the problematic association, which is erman.prt.

ntpq> rv 51930
assID=51930 status=9014 reach, conf, 1 event, event_reach,
srcadr=erman.prt srcport=123, dstadr=192.168.0.11,
dstport=123, leap=00, stratum=5, precision=-6, rootdelay=48.676,
rootdispersion=7900.940, refid=85.119.80.232, reach=001, unreach=1,
hmode=3, pmode=4, hpoll=6, ppoll=6, flash=400 peer_dist, keyid=0, ttl=0,
offset=-6.259, delay=0.106, dispersion=7945.313, jitter=0.001,
reftime=d737f5f3.d38433d6 Tue, Jun 3 2014 10:21:23.826,
org=d737fc6b.7f8c64fd Tue, Jun 3 2014 10:48:59.498,
rec=d737fc6b.812a0af7 Tue, Jun 3 2014 10:48:59.504,
xmt=d737fc6b.81231be4 Tue, Jun 3 2014 10:48:59.504,
filtdelay= 0.11 0.00 0.00 0.00 0.00 0.00 0.00 0.00,
filtoffset= -6.26 0.00 0.00 0.00 0.00 0.00 0.00 0.00,
filtdisp= 15.63 16000.0 16000.0 16000.0 16000.0 16000.0 16000.0 16000.0

Finally, I have seen the reason here;
flash=400 peer_dist
This flash code means “distance threshold exceeded”..

The dispersion and rootdispersion values were also high.

          Root Dispersion:
This is a number indicating the maximum error relative to the primary reference source at the root of the synchronization subnet, in seconds. Only positive values greater than zero are possible.

Dispersion:
Represents the maximum error of the local clock relative to the reference clock.

To fix this, I immediately increased the related limit by adding the following line into the /etc/ntp.conf of the compte nodes;

tos maxdist 16

Then I restarted the ntp services and checked ntp after a while;

ntpq -p
remote refid st t when poll reach delay offset jitter

LOCAL(0) .LOCL. 10 l 56 64 3 0.000 0.000 0.001
*erman.prt 85.119.80.232 5 u 53 64 3 0.107 -5.625 0.458


Finally, It have worked properly.. Ntp have chosen what it need to chose , my Time server.
But this did not satisfy me :)
I was suspecting from the Window Site.. I knew that Windows was using w32tm for the time server, and it could be the cause of this dispension..

So I have requested a check against the following registry key: 
SYSTEM\CurrentControlSet\Services\W32Time\Config\LocalClockDispersion

LocalClockDispersion was set to 10 , as I expected..  This was a windows default as I suppose.
So we set it to 0 and restarted W32time service on Windows 
net stop w32time
net start w32time

After that, I removed the tos maxdist 16 from the /etc/ntp.conf and restarted the ntpd on Linux..

Now, everything works as it should be .. (without a noticiation to the ntp.conf file) 

[root@osrvdb01 ~]# ntpq -p
     remote           refid      st t when poll reach   delay   offset  jitter
==============================================================================
 LOCAL(0)        .LOCL.          10 l   14   64   17    0.000    0.000   0.001
*erman.prt .LOCL.           1 u   18   64   17    0.114    2.140   0.583

No comments :

Post a Comment