Wednesday, November 2, 2016

Exadata -- Cell NTP problem, real life diagnostics

I have faced a NTP - time synchronization problem in one of my Exadata customers.
The problem was observed in the cells, in other words; the Compute Nodes had no trouble when synching time from the NTP server.

Before going forward, I want to give an overview of the Exadata network;

I will just give the difference between Compute Nodes and cell nodes in terms of network access paths.

The NTP related network path for Exadata Storage cells, goes through the Ethernet switch , which comes built-in with the Exadata Rack. ( this path is the only way for reaching the ntp server from a cell)
The NTP related network path for the Compute Nodes, however may go through either the Ethernet switch or the bonded client access interface, which is directly connected to the Client's network.

So , let's say, if your client access network and NTP server's network are 192.168.0.* and if your bonded client interfaces have ips addresses from the same network ( 192.168.0.* ) , then your compute node reaches the NTP server directly from the client network interfaces. ( I have seen this. This is true)

On the other hand, when your cell management interfaces (which is the only network related with accessing the NTP Server from the cell nodes) have 10.10.10.* and when your NTP server's network is 192.168.0.*, the connection is not direct anymore.

Here is a demo showing this difference;

Compute node with ip : 192.168.0.82 reaches the ntp server directly.
[root@osrvdb01 ~]# traceroute 192.168.0.14
traceroute to 192.168.0.14 (192.168.0.14), 30 hops max, 40 byte packets
 1  ntpserver (192.168.0.14)  0.086 ms  0.110 ms  0.103 ms

Cell node with it 10.10.10.104 reaches the ntp server indirectly through the gateway.
[root@osrvcel03 ~]# traceroute 192.168.0.14
traceroute to 192.168.0.14 (192.168.0.14), 30 hops max, 40 byte packets
 1  10.10.10.1 (10.10.10.1)  0.101 ms  0.124 ms  0.118 ms
 2  ntpserver (192.168.0.14)  0.188 ms  0.186 ms  0.181 ms

After seeing this difference and explaining the general situation briefly, let's take a look at a possible scenario which I have faced in one my clients.
I m giving this scenario , because the diagnostic work performed for determining the cause here was important. That is, the  network team disagreed the underlying network/routing problem till I delivered the necessary diagnostics to them.

Scenario:
  • The cell nodes were not synching time from Ntp. It could be seen with "the date command";

  • The cell wall was not the cause because the rules for reaching the NTP server were there in the cellwalls rule list;
[root@osrvcel01 ~]# /etc/init.d/cellwall state|grep 192.168.0.14
    0     0 ACCEPT     udp  --  bondib0 *       192.168.0.14         0.0.0.0/0           udp spt:123 
    0     0 ACCEPT     tcp  --  bondib0 *       192.168.0.14         0.0.0.0/0           tcp spt:53 
    0     0 ACCEPT     udp  --  bondib0 *       192.168.0.14         0.0.0.0/0           udp spt:53 
    0     0 ACCEPT     udp  --  eth0   *       192.168.0.14         0.0.0.0/0           udp spt:123 
    0     0 ACCEPT     tcp  --  eth0   *       192.168.0.14         0.0.0.0/0           tcp spt:53 
   14  1591 ACCEPT     udp  --  eth0   *       192.168.0.14         0.0.0.0/0           udp spt:53 
  • The ntpq -p output was showing that the LOCAL(0) is the preffered NTP server . (rather than 192.16.0.14) .. There was a "*" sign in front of the LOCAL(0) and this meant that the NTP Server was not used for time synch.
  • "ntpdate -dv 192.168.0.14" command was giving the following output (note that, the following output just a little piece of the full output, but it shows the problem)
host found : 192.168.0.14
transmit(192.168.0.14)
receive(192.168.0.3)
receive: server not found
transmit(192.168.0.14)
receive(192.168.0.3)
receive: server not found
transmit(192.168.0.14)
receive(192.168.0.3)
receive: server not found
  • "tracert" command to 192.168.0.14 was hanging...
[root@osrvcel01 ~]# tracert 192.168.0.14
traceroute to 192.168.0.14 (192.168.0.14), 30 hops max, 40 byte packets
1 (10.10.10.1) 0.121 ms 0.113 ms *
2 * * *
3 * * *
4 * * *
5 * * *
6 * *
  • Most important thing was in the nslookup output:
[root@osrvcel01 ~]# nslookup 10.10.10.1
;; reply from unexpected source: 192.168.0.3#53, expected 192.168.0.14#53

Yo see the reply was coming weirdly.. This unexpected source message was saying something like "there is some device which is located in the middle and it is performing an improper NAT or something."

After delivering these diagnostic outputs to the Network team, the issue resolved.

That is, they accepted that this is a network problem in the company network and the cause was in the server which was doing the IP routing . (they added a static to route to that ip routing machine).

After they implemented the fix, I restarted the Ntp daemon in the cells and seen the Ntp server became the preffered one and the time of the cell nodes was synched from the Ntp server..

[root@osrvcel01 ~]# ntpq -p
remote refid st t when poll reach delay offset jitter
==============================================================================
LOCAL(0) .LOCL. 10 l 12 64 37 0.000 0.000 0.001
*192.168.0.14 109.74.206.120 3 u 13 64 37 0.222 7.665 7.230

So again, although; we are DBAs or Apps DBAs or Exadata Admins and although; our network knowledges are limited, we can still do network diagnostics from the OS layer and we can still underline the cause in the network. These diagnostics works are important because at the end of the day, we are responsible from the Exadata Machine and all its problems, which may even be caused by the things residing outside our scope.

1 comment :

  1. Trainees in my institute also face this time synchronization problem in one of my Exadata customers error.I guided your post now.thank you sir.
    Regards,
    Oracle exadata training in hyderabad.

    ReplyDelete