Wednesday, September 17, 2014

Exadata-- TNS-12170 error diagnostics -- a detailed approach

It has been a while since my article about Exadata.
This will be about an ORA-12170 error diagnostics in Exadata.. I find it quite interesting, as the cause of the problem mentioned above, was an unexpected one..
This post will be like short story , but I will try to give some detailed information during the transitions as well.
Anyways.. The story is based on a problem reported by one of my customers..
Customer has an quarter-rack Exadata X3 machine, configured to host an Oracle EBS 11i Production database..
We have configured this Exadata Machine and have implemented EBS on it.  We migrated the EBS database from Solaris to this Exadata machine, and migrated the EBS Apps Tier from Solaris to a new Oracle Linux server, as well.
It was successfull project , as we did hard and complicated works during the migration.(the source systems were Solaris, and database was 10gR2 and so on.)
As you may guess , the customer was also using Discoverer for reporting purposes.
Discoverer Desktop edition have been running all over the place , and no issues were reported about this product till yesterday.

The problematic situation has arisen when a bunch of the employees tried to connect to the mentioned EBS database located in Exadata..
They were using Discoverer tool to connect to the database, and encountering ORA-12170/TNS-12170 errors during their connection attempts..

I was already familiar with TNS-12170 error , and knew it was related with the client-server network of the corresponding database environment..
Even so, I made a quick search about this error, and found the following;

Cause: The server shut down because connection establishment with a client failed to complete within the allotted time interval. This may be a result of network/system delays; or this may indicate that a malicious client is trying to cause a Denial of Service attack on the server. 


Action: If the error occurred because of a slow network/system, reconfigure the SQLNET.INBOUND_CONNECT_TIMEOUT parameter in sqlnet.ora to a larger value. If a malicious client is suspected, use the address in sqlnet.log to identify the source and restrict access. Note that logged addresses may not be reliable as they can be forged (e.g. in TCP/IP). 

So, as you see above, the search have provided the familiar information about the error, and a generic action plan that can be checked and implemented accordingly..

So far so good...
On the other hand, this was a quite new Exadata... It was configured by the Professionals and EBS was running without a problem.. In fact, all the other Discoverer Clients were connecting to the database without any problems.. 
These facts made me think that, there should have been a specific network problem between these problematic clients and the Exadata Database Machine ..

Here is what I did for the analysis;

First, checked the firewall in the db nodes and did not see any deny-rule there.
Then thought that, using wireshark for capturing the client-server network communication to see the break might be useful. But I quickly give up that thought when I saw the infiband network device configuration..
The infiniband network was as follows;

DO NOT REMOVE THESE LINES ####
%GENERATED BY CELL% ####
DEVICE=bondib0
USERCTL=no
BOOTPROTO=none
ONBOOT=yes
IPADDR=192.168.11.2
NETMASK=255.255.252.0
NETWORK=192.168.7.0
BROADCAST=192.168.11.255
BONDING_OPTS="mode=active-backup miimon=100 downdelay=5000 updelay=5000 num_grat_arp=100"
IPV6INIT=no
MTU=65520

So infiband/private interconnect was on network 192.168.7.0...
Then I realized that these problematic clients were on the same network .. I mean, they were using the ip addresses from 192.168.7.200 to 19.2.168.7.220..

Afterwards; I took a look to the route configuration of the db nodes in this Exadata Machine.

route
Kernel IP routing table
Destination Gateway Genmask Flags Metric Ref Use Iface
10.10.10.0 * 255.255.255.0 U 0 0 0 eth0
192.168.0.0 * 255.255.252.0 U 0 0 0 bondeth0
192.168.7.0 * 255.255.252.0 U 0 0 0 bondib0
169.254.0.0 * 255.255.0.0 U 0 0 0 bondib0
default 192.168.0.3 0.0.0.0 UG 0 0 0 bondeth0

And yes! The cause was there.. The route declaration for all the ip address in the 192.168.7.0  network was directed from the bondib0 device..
Bondib is the bond device of internal network in Exadata..
So, it was normal for these clients to not to be able to connect to the databases on this Exadata, as Infiniband network is an internal network...


It is needless to say that there was a bad luck or misconfiguration in the network layer of this environment ..
I mean that the infiniband network should have been never used for clients, so bondib which is a private network should have been non routable and unique..

On the other hand, as I have faced this situation, I needed to solve:)

I knew that the network is derived from the subnet mask.

For example: If you want to have an private interconnect ip addres as 192.168.10.1 and if you want to have a subnet mask 255.255.252.0 , you get 192.168.8.0 as your network ,and this effects your route configurations..

Exactly as follows;

010 = 00001010
252 = 11111100
--------------------- AND
8 =   00001000  --> so the network is 192.168.8.0


Related MOS notes:
The Basics of IPv4 Subnet and Oracle Clusterware ( Doc ID 1386709.1 ) 
How to calculate the correct subnet for an interface ( Doc ID 1059759.1 ) 
How to Validate Network and Name Resolution Setup for the Clusterware and RAC ( Doc ID 1054902.1 ) 

So, in order to fix this; there were 3 ways/methods to prefer;

Methods:

1) Changing Exadata's infiniband network
2) Changing the network of these problematic clients.
3) Adding the routes to the Exadata Db nodes for overwriting the general route configuration for accessing those clients.

Considerations:

1) Changing Exadata infiniband network was a risky thing, and it wasnt worth changing it to make 15-20 discoverer clients happy. 
Additional info: 
The steps are mentioned in the Oracle Exadata Database MachineOwner's Guide 11g Release 2 Page 7-60
Also in  Changing IP addresses on Exadata Database Machine ( Doc ID 1317159.1 ) 

2) In my opinion , it was the best way, but the customer didnt prefer this one.

***3) "Seemed like the the second best way in this situation. So , I applied this method as follows.."*****

For adding specific routes , I used route command.
I executed the following commands on db nodes one by one.

route add -host 192.168.7.200 dev bondeth0
route add -host 192.168.7.201 dev bondeth0
route add -host 192.168.7.202 dev bondeth0
route add -host 192.168.7.203 dev bondeth0
route add -host 192.168.7.204 dev bondeth0
route add -host 192.168.7.205 dev bondeth0
route add -host 192.168.7.206 dev bondeth0
route add -host 192.168.7.207 dev bondeth0
route add -host 192.168.7.208 dev bondeth0
route add -host 192.168.7.209 dev bondeth0
route add -host 192.168.7.210 dev bondeth0
route add -host 192.168.7.211 dev bondeth0
....
Above ip addresses were the ip addresses of the problematic Discoverer clients.

Also, to make these routes permenant; I have added these scripts in to the /etc/rc.local files of the Db nodes as follows;

#BEGIN DO NOT REMOVE Added by Oracle Exadata 
if [ -x /etc/rc.d/rc.Oracle.Exadata ]; then
. /etc/rc.d/rc.Oracle.Exadata
fi
#END DO NOT REMOVE Added by Oracle Exadata ###########

#DO NOT REMOVE Added by Erman  :)##
#ADD ROUTE FOR DISCOVERER PROBLEM##
route add -host 192.168.7.100 dev bondeth0
route add -host 192.168.7.101 dev bondeth0
route add -host 192.168.7.102 dev bondeth0
route add -host 192.168.7.103 dev bondeth0
route add -host 192.168.7.104 dev bondeth0
route add -host 192.168.7.105 dev bondeth0
route add -host 192.168.7.106 dev bondeth0
route add -host 192.168.7.107 dev bondeth0
route add -host 192.168.7.108 dev bondeth0
route add -host 192.168.7.109 dev bondeth0
route add -host 192.168.7.110 dev bondeth0
route add -host 192.168.7.111 dev bondeth0
###########################################

I had to add these routes in rc.local because the route -p option is not there in Linux.. -p option is only available in windows. :)

So that was it . TNS-12170 errors have dissapeared immediately after adding those routes..

2 comments :

  1. Perfect...thanks a lot for knowledge sharing. Please upload me so that community can learn more and more. your blog is awesome

    ReplyDelete
  2. Can Cihan HırçınNovember 2, 2014 at 8:01 AM

    great workflow and diagnosis. thanks for sharing

    ReplyDelete