Friday, July 24, 2020

Database Forensics - Incomplete/Interrupted Tablespace Drop operation / INTERNAL DDL / cause & effect

Being fast is an advantage. However, it always involves some risks.

Yes! it is priceless to complete the required work quickly.  However; there is always a big risk, when we try to be fast, try to gain some time during an important work.. For instance, gaining some time by not doing the checks we have to do...

As I always say, as long as there is a mistake that can be corrected, mistakes can always be made.
The real problem is making a mistake that cannot be corrected.
The same rule applies in the database world.. Recovarable, interruptible and restartable operations are safer.. On the other hand, we need to approach the unrecoverable and uninterruptable operations more seriously.


After these important advices, I want to tell you an interesting story.
Don't worry :) We will not switch our context, so as I mentioned in the title, this topic is in the database forensics field. Maybe, that's why it sounds interesting to me..

Did you ever encounter a situation where some tables in the database dropped automatically and internally by Oracle itself? Well, I did...

Did you ever see a situation where your datafiles are present but some of tables in the relevant tablespace are missing? I saw this too :)

In this story, we will be dealing with some dropped tables. Some important ones..

Suppose the business users or some functional super users called you and told that they are getting some errors on their application. Normally, what you would do is, checking the error and finding a solution, right? What you would generally see is an expection or an error stack that tells you the reasons, and with those reasons, you do some searches on the internet, or in the vendor's support website and consider applying the recommended solutions.

However, what if you conclude that the problem is much more critical? What if the error you see in that error stack tells you that some of the tables are missing.
Consider a scenario where you have thousands of tables dropped in a second and you realize it some time after..

Well, in this kind of a stiuation, you will have 2 important tasks.

First one is, you will have restore those tables, and recover them ..
This is logical, right :) However, there are some challenges for this.

For instance, you can't flashback a table if it was dropped with purge option.

If the application continued to run even after some tables were dropped, then Flashback database is not an option either.. I mean Flashback database option will flashback the database but then you will lose the changes that were done on the other tables which were available during this period between the time that tables were dropped and the time you realized that.

RMAN restore and databsae recovery are your options but you will have to be fast.. In addition you should have built a restore and recovery mechanism and a supporting infrastructure according to your RPO and RTO to use rman restore and recovery in this kind of a situation.

Using Log Miner  is another option.. You can mine the change operations which have occured after those tables were dropped and then apply them as DML instructions, after you flasback the database just before the point in time those tables were dropped.

So no need to panic :) -- Of course if you are an expert in these things and know what you are doing.

The second task is finding the reason behind these dropped tables. Who or what dropped them? What kind of a statement caused this to happen?

Well, in order to complete this task, you will check the following;

First, you will check database audit records. (if audit is enabled)

If audit is not enabled or not configured correctly, you will check the redologs/archivelogs generated during the time that tables were dropped.  This way, at least you can see the statement that was caused the issue and if you are lucky, you can also see the database user that executed the statement.

If the operation was a SYS operation, I mean if the operation was done using as sysdba as sysoper, and if AUDIT_SYS_OPERATIONS parameter was set to true, you can also see the statement and the user that executed it, by checking the OS audit files.

There are other methods to find the cause, but let's focus on our story.

So we have some dropped tables, and we checked the audit trail, but it was empty. (db audit was disabled)
Then we checked the redologs generated during that period using Log Miner and saw that those tables were dropped using DROP table purge statements and these statements/operations was recorded as INTERNAL DDL operations in redo. INTERNAL DDL means that, it is done by Oracle itself. Weird, but is was clearly seen so in v$logmnr_contents..

Alert log was not giving us any clues, so we checked the SYS aud files.
In one of the SYS aud files, we saw a DROP TABLESPACE INCLUDING CONTENTS AND DATAFILES statement. The tablespace name used in this operations was the one that the dropped tables belong. However; all the data files of that tablespace was still there, they were online and the tablespace looked healthy.

Probably, there was attempt for dropping the tablespace, maybe a wrong attempt.. There was an attempt, but it was clear that it wasn't successful, because we had all the data files in place.

Still, we focused on that.. We checked the internal mechanism of Oracle's tablespace drop and found that, if we execute a drop tablespace statement with include contents and data files clause, first; all the contents are dropped. (Tables etc).. Oracle executes drop commands internally. (This was the cause of that INTERNAL DDL operations recorded in Redologs)
Then, the OS files (db files) are deleted.. Lastly, we get 'Tablespace dropped.' message.

So if we interrupt (for instance using CTRL+c) this drop tablespace process, we may experience unstable situations where our datafiles are present but some of tables in the relevant tablespace are missing.

Well, I think we have already come to the end of this story :) 

A tablespace operation was executed and then it was interrupted using CTRL+C or it was aborted using some other method. So, Oracle started dropping the tables belong to that tablespace, and actually if it could finish dropping them, it would drop all the datafiles and remove the tablespace from the configuration... 
However, it was stopped and could do a little part of its work only. But still, even that little part caused great problems, as those dropped tables were critical..

What about the remaining tablespace and all those data files?  As the dropping the contents part of the drop tablespace operation did not complete, those data files and the tablespace itself were probably healthy. Still, we should made sure that all the data files were online , they didn't have any errors and they didn't need any recovery.
A query which can be used for that -> select file#, substr(name, 1, 50), status, error, recover from v$datafile_header where TABLESPACE_NAME='tablespace_name';

Yes. We came to the end of our article... As we have seen in our story, the faster we detect and realize, the better... This is correct almost in any cases.. Especially, for the problems that require restore and recovery.. This is for sure. The later we find the problem (the later we realize), the harder our work gets. It is needless to say that we should not skip important checkpoints to save time.
So these seem to be my last recommendations in this post.. I hope this helps :)

No comments :

Post a Comment

If you will ask a question, please don't comment here..

For your questions, please create an issue into my forum.

Forum Link: http://ermanarslan.blogspot.com.tr/p/forum.html

Register and create an issue in the related category.
I will support you from there.