Wednesday, June 4, 2014

Rdbms-- Standby problem analysis -- a real life example

To check whether if our standby services are running; we check 2 things in general. 1) Are the changes in our primary database are transffered to our standby datasbaee? 2) Are those changes actually applied  to our Standby Databases? The second control ,which is meant to check the MRP services, is very important, as we cant say that our standby services are working properly without seeing the logs are actually applied to our standby databases.
So to control or schedule an alert for standby services should be in two phases.
- check if our logs are transferred , if they are transferred then check if they are applied

In this blog post; I will demonstrate a problematic standby environment , which seems not working even at first glance.

In this kind problematic standby situation, I usually check both the primary and standby sites..

Now suppose that we receive an email from the Customer Site, complaining about a Standby problem; here is what we do for diagnosing the situation in general;

Production Database
---------------------------------------------------------------------------------
We check if the changes are transferred to the Standby site?

select sequence#
from v$archived_log
where standby_dest='NO'
minus
select sequence#
from v$archived_log
where standby_dest='YES'
minus
select min(sequence#) from v$archived_log

 SEQUENCE#
----------
     81092
     81093
     81096
     81101
     81143
     81144
     81146
     81153
     81154
     81155
     81158

 SEQUENCE#
----------
     81159
     81168
     81172
     81176
     81177

In this case there are gaps as you see.. So by looking to the output of above, we can say that there is something wrong between archiver and rfs .. We can take 81092 and 81093 as our references for diagnosing further.. It is also interesting that after 81093, the logs have continued to be transferred ( 81094, 81095) till the log with sequence no 81096

Then we open alert log of our production databsae and look for the errors, especially error which are recorded while the archiver was processing our problematic archives..

WARN: ARC1: Terminating pid 11757 hung on an I/O operation
WARN: ARC1: Terminating pid 11763 hung on an I/O operation
krsv_proc_kill: Killing 1 processes (Process by index)
krsv_proc_kill: Killing 1 processes (Process by index)
ARC1: Error 16198 due to hung I/O operation to LOG_ARCHIVE_DEST_2

Hmm.. As we see above, it seems we had an IO problem.. Note that : this kind of IO problem can be related with filesystem, permissions or even this problem can be seen  because of the network  between the prod and the standby sites.

Note that : after analyzing the alert log a little further, we see that archive log was terminated and restart after these errors..

Next thing we check is the trace of the problematic Archiver Process.

RFS network connection lost at host 'PRODSTDB' error 3113
Fail to ping standby 'PRODSTDB', error = 3113 
*** 2014-06-03 14:59:43.207 4320 krsh.c
ARC2: Attempting destination LOG_ARCHIVE_DEST_2 network reconnect (3113)
Retry every 5 seconds up to 1 times
Redo shipping client performing standby login
*** 2014-06-03 14:59:43.284 4645 krsu.c
Logged on to standby successfully 
Client logon and security negotiation successful!
RFS network connection lost at host 'PRODSTDB' error 1089

Hmm , as you see above , we had a Network problem in that time..
Those info almost explain the missing archives on the standby site..

Okay, we can handle that.. But lets look at the standby Site.

Standby Database
---------------------------------------------------------------------------------

We continue our checks in the Standby Site.
First of all , we check the standby service processes;


SQL> select process ,block#,sequence#,status from v$managed_standby;

PROCESS       BLOCK#  SEQUENCE# STATUS
--------- ---------- ---------- ------------
ARCH               0          0 CONNECTED
ARCH               0          0 CONNECTED
ARCH               0          0 CONNECTED
ARCH               0          0 CONNECTED
RFS                0          0 IDLE
RFS                0          0 IDLE
RFS            41447      81232 IDLE

Hmm, in this example as you see above, there are no MRP processes running. This is a big problem..  It seems we are not applying any redo in to this standby database..

Information about the processes:
  • Remote file server (RFS)
    The remote file server (RFS) process receives redo data from the primary database either in the form of archived redo logs or standby redo logs.
  • Archiver (ARCn)
    If standby redo logs are being used, the ARCn process archives the standby redo logs that are to be applied by the managed recovery process (MRP).
  • Managed recovery process (MRP)
    The managed recovery process (MRP) applies information from the archived redo logs to the standby database. When performing managed recovery operations, log apply services automatically apply archived redo logs to maintain transactional synchronization with the primary database.

Next, we check the alert log file of the standby database and see the following;

RFS[1063]: Selected log 5 for thread 1 sequence 81092 dbid 209281643 branch 797443307

So , RFS selects the problematic log, but not archives it..
Actually, RFS selects the log and then encounters the following

RFS[1064]: Possible network disconnect with primary database
Tue Jun 03 14:02:00 2014
RFS[1065]: Possible network disconnect with primary database
Tue Jun 03 14:02:00 2014
RFS[1063]: Possible network disconnect with primary database

And when we check the trace of RFS process; we see the network errors..

*** 2014-06-03 14:02:00.179 1684 krsb.c
krsb_stream_receive: Error 3113 reaping meta-data buffer
*** 2014-06-03 14:02:00.180 4320 krsh.c
RFS[1063]: Possible network disconnect with primary database

So we have 2 problems now;
1)some log file are not transferred because of a network problem
2)Our Mrp process is not running.

we can handle problem 1, as the network will be one way or another..
For diagnosing problem 2; we execute the following query.
Note that : this query reports both max sequence no of transferred logs and max sequence that was applied in to the Standby Database;

SELECT ARCHIVED_THREAD#, ARCHIVED_SEQ#, APPLIED_THREAD#, APPLIED_SEQ#
  FROM V$ARCHIVE_DEST_STATUS;  

ARCHIVED_THREAD# ARCHIVED_SEQ# APPLIED_THREAD# APPLIED_SEQ#
---------------- ------------- --------------- ------------
               0             0               0            0
               0             0               0            0
               0             0               0            0
               0             0               0            0
               0             0               0            0
               0             0               0            0
               0             0               0            0
               0             0               0            0
               0             0               0            0
               0             0               0            0
               0             0               0            0

ARCHIVED_THREAD# ARCHIVED_SEQ# APPLIED_THREAD# APPLIED_SEQ#
---------------- ------------- --------------- ------------
               0             0               0            0
               0             0               0            0
               0             0               0            0
               0             0               0            0
               0             0               0            0
               0             0               0            0
               0             0               0            0
               0             0               0            0
               0             0               0            0
               0             0               0            0
               0             0               0            0

ARCHIVED_THREAD# ARCHIVED_SEQ# APPLIED_THREAD# APPLIED_SEQ#
---------------- ------------- --------------- ------------
               0             0               0            0
               0             0               0            0
               0             0               0            0
               0             0               0            0
               0             0               0            0
               0             0               0            0
               0             0               0            0
               0             0               0            0
               0             0               0            0
               1         81235               1        46598

As you see , we have a huge difference.. We didnt apply any logs to the standby database since the log with sequence number 46498 and we are at 81235 in terms of log sequence..

Lastly we check the alert log of the standby database.. We search for 46498.. We want to see what happened after that, and here it is..

Sun Nov 03 09:02:01 2013
Media Recovery Log /backup/prod/fast_recovery_area/o1_mf_1_46598_97cx751r_.arc
WARNING: File being created with same name as in Primary
Existing file may be overwritten
Errors in file /u1/oracle/PROD/db/tech_st/11.2.0.3/test/PROD/trace/PROD_pr00_9494.trc:
ORA-01119: error in creating database file '/u02/test/system12.dbf'
ORA-27054: NFS file system where the file is created or resides is not mounted with correct options
Linux-x86_64 Error: 13: Permission denied
File #23 added to control file as 'UNNAMED00023'

So , here is the problem... A file named system12.dbf was created in the Production Database, and because there is not a similar directory  in the Standby site, this file could not be created in the standby site .. Normally, this unfortunate situation prevents our Log apply services from applying the logs.. If we had used DB_FILE_NAME_CONVERT parameter set, then there would be no problems because of this action, As this parameter converts the filename of a new datafile on the primary database to a filename on the standby database..

In conclusion, 

We have seen how to check a standby database environment in case of a problematic situation. Also, we could use the same controls for a daily check routine.  

More importantly, in this post, we have seen a problematic situation that have made us recall the following facts:

Use db_file_name_convert if you have a different directory structure in Standby database..
Always check applied logs...Checking transferred logs alone is not enough .

No comments :

Post a Comment

If you will ask a question, please don't comment here..

For your questions, please create an issue into my forum.

Forum Link: http://ermanarslan.blogspot.com.tr/p/forum.html

Register and create an issue in the related category.
I will support you from there.