Wednesday, January 4, 2017

ODA - "Investigating an OAKCLI Problem", OAKERR:6000 Exception encountered in Module getMasterNode OAKERR:6002 No master node was found

Recently investigated an oakcli problem, which was encountered in a virtualized ODA X4.
The problem was obvious and it was blocking the administration.
All the oakcli commands were encountering the following errors;

OAKERR:6000 Exception encountered in Module getMasterNode
OAKERR:6002 No master node was found

I have did a detailed investigation and fixed the error.
The cause was interesting and that's why I want to share it with you.

Note: There is no document or article written for this specific issue in Oracle Support.


I first looked at the odabaseagent.log..

ODA_BASE /opt/oracle/oak/log/$HOMENAME/odabaseagent/odabaseagent.log

Howerer; there was nothing there. That is, I executed a failing command (actually all the oakcli commands were failing) "oakcli show vm", but the odabaseagent.log was idle.

Then, I checked the oakcli.log, /opt/oracle/oak/log/ODA_NAME/client/oakcli.log and saw the following there.

2017-01-02 20:42:03.199: [ OAKCLI][10737392] OakCli started.... !
2017-01-02 20:42:03.233: [ OAKCLI][10737392] SERVER IPC CONNECT STR: (ADDRESS=(PROTOCOL=IPC)(KEY=__OAKD_CONNECT_STR))
2017-01-02 20:42:03.257: [ OAKCLI][10737392] Cli's version is: 2
2017-01-02 20:42:03.293: [CLSFRAME][10737392] Inited lsf context 0xf11da0
2017-01-02 20:42:03.329: [CLSFRAME][10737392] Initing CLS Framework messaging
2017-01-02 20:42:03.347: [CLSFRAME][10737392] New Framework state: 2
2017-01-02 20:42:03.365: [CLSFRAME][10737392] M2M is starting...
2017-01-02 20:42:03.390: [ CRSCOMM][10737392] Ipc: Starting send thread
2017-01-02 20:42:03.420: [ CRSCOMM][1468736] Ipc: sendWork thread started.
2017-01-02 20:42:03.586: [ COMMCRS][10590528]clsc_connect: (0x7f53f80009a0) no listener at (ADDRESS=(PROTOCOL=IPC)(KEY=__OAKD_CONNECT_STR))

2017-01-02 20:42:03.619: [ CRSCOMM][10737392][FFAIL] IpcC: Could not connect to (ADDRESS=(PROTOCOL=IPC)(KEY=__OAKD_CONNECT_STR)) ret = 9
2017-01-02 20:42:03.637: [CLSFRAME][10737392] Failure at IPC connect to server:2
2017-01-02 20:42:03.655: [CLSFRAME][10737392] Unable to start module-to-module comms: 1
2017-01-02 20:42:03.674: [ OAKCLI][10737392] OakCli unable to connect to the server
2017-01-02 20:42:03.692: [ OAKCLI][10737392] OAKD is exiting with exit code: -1

There was a connection problem recorded in the oakcli log. It was an IPC connection and oakcli was compilaning about a listener, but that listener must not have been  a regular database listener.

Then I modified the scripts(mostly perl) residing in the oak directory(/opt/oracle/oak....) and put some debug lines in them to make my own debugging.

With this move, I  could see the flow was actually as follows...

Note that: in parallel, I also checked the log files of the programs that are visited during the flow and had a clearer view of the flow and the endpoint of it.

"oakcli show vm" command --> executes doakcli.pl -> connects to the dcliagent.py listening on the port 8881 -> dcliagent.py tries to reach the "oakd", but it can not reach the oakd and failing with: "DEBUG Failed to connect to oakd"

oakcli could reach the dcliagents running on both of the ODA BASE nodes (listening from the private interconnect ip address and port 8881 ), as it could be seen from the following doakcli.logs. The error OAKERR.:6002 was also in doakcli logs but it was the result of an exception..

2017-01-02 20:47:45,928 doakcli_log DEBUG trying to get the master node
2017-01-02 20:47:45,928 doakcli_log DEBUG Trying to contact http://192.168.16.28:8881
2017-01-02 20:47:46,373 doakcli_log INFO 1 call to isMaster() returned false
2017-01-02 20:47:46,373 doakcli_log DEBUG Trying to contact http://192.168.16.27:8881
2017-01-02 20:47:46,845 doakcli_log INFO 0 call to isMaster() returned false
2017-01-02 20:47:46,845 doakcli_log ERROR OAKERR:6000 Exception encountered in Module getMasterNode
OAKERR:6002 No master node was found

The errors we saw, were actually the results.
The connections to the dcliagents were successful. However, the oakd, which was the core of the process, was not running. 
 Because oakd could not be reached, the agent code were breaking and the agent was saying that "No master node was found"..
It should say oakd was not running.

Anyways, I started to search the answer  for the question "why".. Why wasn't oakd running? We are talking about an Engineered System, right? So why wasn't such an important process running?

In order to be sure, I used the various "ps" commands and concluded that oakd was really not running on any of the ODA Base nodes.

So I tried to start it..
I used oakcli restart oak but it didn't help. Nothing happened, no logs generated.. Probably it was also dependent to oakd.

Then I checked all the files in /etc directory, because I knew that oakd was started automatically during the boot process of ODA Machines.

I used a simple but handy "grep" for this.

cd /etc
grep -R oak *

The grep command displayed lots of lines and most of them were irrelevant.. But seeing the relevant file was not so hard.

The file in question was inittab (/etc/inittab).
The inittab file was containing a line similar to the following;

h3:35:respawn:/etc/init.d/init.oak run >/dev/null 2>&1 </dev/null

I will not go in to the details about inittab (actually I want to but it is out of the scope and I m very tired :) 
However, I must say : what the line above means is, "init.oak script is run automatically during the boot process and relevant runlevel changes. init.oak was configured to be run during by the boot process automatically and it was under the control of Linux.  So in case of a failure, it was configured to be restarted automatically by Linux." 

The interesting thing was, the init.oak file was empty!
It was containing zero lines, zero characters.
I checked another ODA (in another customer environment) and saw that it was filled with the code, the code/script to start/stop/restart the oakd process.

At that moment, I concluded that, the problem was in init.oak. It was empty!
This was also explaining why I could start oakd manually.. That is, I could start oakd manually for working around this error as I documented in the following post almost 1 year ago. 

This workaround was written for the same issue .. The same issue in the same ODA...

Thanks to my customer, who didn't accept this workaround, I analyzed the problem deeper and I m here today for giving you the fix:)

As for the fix,  I copied the init.oak from another ODA.. 
I copied it to both of the ODA BASE nodes (to /etc/init.d directories) and then run "/etc/init.d/init.oak run" command in these ODA BASE nodes.
The issue dissapeared without a need to reboot ODA BASE domain. Moreover, oakd could start without any problems after rebooting the ODA BASE nodes.

Note: the reason why init.oak was empty, is still a mystery.. Who or what deleted its contents is uncertain.. No trace or logs for it.. I will also share it with you , if I 'll find that reason..

No comments :

Post a Comment