Thursday, November 24, 2016

RAC -- ORA-01578: ORACLE data block corrupted, "computed block checksum: 0x0", Bad header found during dbv

Yesterday, an interesting issue was escalated to me.
It was a production 2 node RAC environment, and the instances on node 1 could not be started.
It was interesting, becuase the instances on node 2, which belong to the same databases as the instance on node1, could be started and used without any errors.
The instance on node 1 was seeing the disks but they were reporting lots of corruptions.
The dbv when executed from node 1, was again reporting lots of corruptions, but the interesting thing was the corrupted block checksum.
The corrupted block checksum reported by dbv on node 1 has the value of 0x0, which means the checksum was okay.
However, even the dbv was reporting the corruptions..
The corruptions reported there were actually reported for the contents of the blocks.
So the checksum was okay but the contents of the block were not as expected.
In other words, Oracle database or dbv was thinking that the problematic blocks should not be there in their current location, as they belong some other place in the datafile/datafiles.

Here is an example output that I have gathered from the environment.
It was a dbv output, which was produced for a datafile, which had corrupted blocks.

Page 3043 is marked corrupt
Corrupt block relative dba: 0x00400be3 (file 1, block 3043)
Bad header found during dbv:
Data in bad block:
type: 6 format: 2 rdba: 0x1f6179e3
last change scn: 0x0008.a95bf9e0 seq: 0x1 flg: 0x04
spare1: 0x0 spare2: 0x0 spare3: 0x0
consistency value in tail: 0xf9e00601
check value in block header: 0xfe88
computed block checksum: 0x0

The important values and strings in this output were:

rdba: 0x1f6179e3  --> this is Hex. When converted to binary, its first 10 bits corresponds to file number.. In this case it seems like it is file 125.
corrupt block relative dba(rdba): 0x00400be3  -> file 1 block 3043
computed block checksum: 0x0
-Bad header found during dbv-
So, again checksum was okay, but rdba was different than corrupted block rdba. dbv reported "Bad Header found"as well. So the placement issue was obvious. In other words; the block were healthy (computed block checksum is 0x0) but the contents of them were actually the contents of  different blocks.

What I did to solve this was checking the OS layer of node 1 . (since node 2 was not encountering this)

I firstly, suspected from the storage, as the redundancy of the diskgroups was "external". However, the problem was there in the multipath.

The problem was in the multipath since there were conflicting paths.

What I mean by conflicting path is;

there was an ASM disk (customer was using asmlib) and its name was IBM19.

This disk was based on a multipath device called mpath13. That is, IBM19 was created using mpath13.

When I use multipath -ll , I saw the mpath13 was there.

However, when I checked the situation from the ASM perspective using oracleasm querydisk , I saw that, the disk IBM19 was based on 2 disks, mpath13 and mpath40. The mpath40 was displayed in the first line of the oracleasm querydisk output, and since asmlib disks goes through the first path that OS gives them , Oracle was reaching the disk through the wrong path. (It should have gone from mpath13, but it was going from the mpath40)

Note that, node 2 was also seeing the asm disk IBM19 in the same way. The only difference was, in node 2; mpath13 was displayed in the first line of the oracleasm querydisk, so that Oracle was reaching the disks through mpath13, thus there were no problems in node 2.

--mpath40 was based on newly added disks, and it was not included in any ASM diskgroup altough it was formatted using" oracleasm createdisk".

So, multipaths were conflicting. oracleasm was conflicting them somehow.

In other words, Oracle was using the mpath40 to reach the disks that supposed to be pointed by the mpath13, and thus Oracle was reaching the wrong disks.

What I did to fix this was , removing the devices that mpath40 was based on. I used, echo "scsi remote-single device" and then executed, multipathd.

The problem went away. After that, I added the devices back using echo" scsi add-single-device" and the conflict didn't appear again.

At the end of the day, the instance on node 1 could be started ( I only recreated the undo tablespace of node 1, as after the database went in to the open state, undo block corruptions were reported --probably caused by earlier instance terminations)

No comments :

Post a Comment