Monday, January 26, 2015

Linux -- I/O hang : io_schedule+0x42/0x5c, corrupted controlfile , scsi error , sd 0:0:0:0: timing out command, waited 360s

This post will  be about an I/O error which caused a  controlfile of an EBS Oracle Database to be corrupted.
Recoverying the controlfile was not a big challenge, as we had multiplexed controlfiles for this database.
Recoverying a controlfile problem: shutdown the databases; delete the problematic controlfile; copy a proper controlfile with the name of the deleted controlfile; open the database.. That's it.

As I said; recoverying from a controlfile corruption was not a big deal..
On the other hand, diagnosing this kind of errors was a little tricky. 

In this post; you will find diagnostics in OS level (linux) .. Diagnostics for finding the cause of such an error on I/O subsystem..  We have to diagnose these kind of errors.. We cant let it go, because it is a production system and there is no guarantee that we will not end up with a corrupted system datafile at next time the error occurs..

The system info:

Production
OS: Oracle Linux 5.8 64 bit
Kernel : 2.6.32.300.10.1.el5uek
File systems : ext3 , db is on non-asm ext3 filesystems.
Multipath was not configured.

Diagnostics:
  
INFO: task kjournald:2372 blocked for more than 120 seconds.
Jan 26 06:55:55 ermandb kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jan 26 06:55:55 ermandb kernel: kjournald     D 0000000000000000     0  2372      2 0x00000000
Jan 26 06:55:56 ermandb kernel:  ffff8802358ffc40 0000000000000046 ffff8802358ffc10 ffffffffac0a666e
Jan 26 06:55:56 ermandb kernel:  ffff88022ec74040 ffff8802372da880 ffff88022ec74410 ffff880028200000
Jan 26 06:55:56 ermandb kernel:  ffff8802358ffcd0 ffff88002801d0b0 ffff8802358ffc20 ffff88022ec74040

Seemed kjournald was in hung state.. kjournald is the deamon for the journaling feature of EXT 3 filesystem. Journaling in Linux works like redolog mechanism in Oracle.. Kjournal works like LGWR and writes the changes in to a special area—the journal— So After a crash, recovery simply involves reading the journal from the file system and replaying changes from this journal until the file system is consistent again.

It seemed like kjournald is not the only thing which is hung state; Here is an oracle process which was in hung state caused by exactly the same problem;

INFO: task oracle:5608 blocked for more than 120 seconds.
Jan 26 06:55:56 ermandb kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jan 26 06:55:56 ermandb kernel: oracle D ffff8802295fc7c0 0 5608 1 0x00000080
Jan 26 06:55:56 ermandb kernel: ffff880226eddb58 0000000000000086 ffff880226eddb18 ffffffffac0a666e
Jan 26 06:55:56 ermandb kernel: ffff880226b72300 ffff88001839a180 ffff880226b726d0 ffff880028200000
Jan 26 06:55:56 ermandb kernel: ffff880226eddbe8 ffff88002802a9e8 ffff880226eddb38 ffff880226b72300

Last executed routine was io_schedule. Here is the call trace of the hung processes;

Jan 26 06:55:56 ermandb kernel: [<ffffffff8145480c>] io_schedule+0x42/0x5c
Jan 26 06:55:56 ermandb kernel: [<ffffffff8114044c>] sync_buffer+0x2a/0x2e
Jan 26 06:55:56 ermandb kernel: [<ffffffff81454d3b>] __wait_on_bit+0x4a/0x7c
Jan 26 06:55:56 ermandb kernel: [<ffffffff81140422>] ? sync_buffer+0x0/0x2e
Jan 26 06:55:56 ermandb kernel: [<ffffffff81140422>] ? sync_buffer+0x0/0x2e
Jan 26 06:55:56 ermandb kernel: [<ffffffff81454de0>] out_of_line_wait_on_bit+0x73/0x80
Jan 26 06:55:56 ermandb kernel: [<ffffffff81077039>] ? wake_bit_function+0x0/0x2f
Jan 26 06:55:56 ermandb kernel: [<ffffffff811c1206>] ? __journal_temp_unlink_buffer+0x1a/0xf4
Jan 26 06:55:56 ermandb kernel: [<ffffffff8114039b>] __wait_on_buffer+0x24/0x26
Jan 26 06:55:56 ermandb kernel: [<ffffffff811c2fa5>] wait_on_buffer+0x31/0x35
Jan 26 06:55:56 ermandb kernel: [<ffffffff811c3555>] journal_commit_transaction+0x4d9/0xe38
Jan 26 06:55:56 ermandb kernel: [<ffffffff8104b334>] ? finish_task_switch+0x52/0xab
Jan 26 06:55:56 ermandb kernel: [<ffffffff8106511b>] ? lock_timer_base+0x2b/0x4f
Jan 26 06:55:56 ermandb kernel: [<ffffffff810651b2>] ? try_to_del_timer_sync+0x73/0x81
Jan 26 06:55:56 ermandb kernel: [<ffffffff811c7057>] kjournald+0xcd/0x211
Jan 26 06:55:56 ermandb kernel: [<ffffffff81076ffc>] ? autoremove_wake_function+0x0/0x3d
Jan 26 06:55:56 ermandb kernel: [<ffffffff811c6f8a>] ? kjournald+0x0/0x211
Jan 26 06:55:56 ermandb kernel: [<ffffffff81076c53>] kthread+0x6e/0x76
Jan 26 06:55:56 ermandb kernel: [<ffffffff81012dea>] child_rip+0xa/0x20
Jan 26 06:55:56 ermandb kernel: [<ffffffff81076be5>] ? kthread+0x0/0x76
Jan 26 06:55:56 ermandb kernel: [<ffffffff81012de0>] ? child_rip+0x0/0x20
Jan 26 06:55:56 ermandb kernel: INFO: task oracle:5608 blocked for more than 120 seconds.
Jan 26 06:55:56 ermandb kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jan 26 06:55:56 ermandb kernel: oracle D ffff8802295fc7c0 0 5608 1 0x00000080
Jan 26 06:55:56 ermandb kernel: ffff880226eddb58 0000000000000086 ffff880226eddb18 ffffffffac0a666e
Jan 26 06:55:56 ermandb kernel: ffff880226b72300 ffff88001839a180 ffff880226b726d0 ffff880028200000
Jan 26 06:55:56 ermandb kernel: ffff880226eddbe8 ffff88002802a9e8 ffff880226eddb38 ffff880226b72300

The cause have appeared in the following lines ;

Jan 26 06:55:59 ermandb kernel: sd 0:0:0:0: timing out command, waited 360s
Jan 26 06:55:59 ermandb kernel: sd 0:0:0:0: [sda] Unhandled error code
Jan 26 06:55:59 ermandb kernel: sd 0:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_OK
Jan 26 06:55:59 ermandb kernel: sd 0:0:0:0: [sda] CDB: Write(10): 2a 00 1c d1 bb 3d 00 00 20 00
Jan 26 06:55:59 ermandb kernel: end_request: I/O error, dev sda, sector 483507005
Jan 26 06:55:59 ermandb kernel: Buffer I/O error on device sda4, logical block 47383555
Jan 26 06:55:59 ermandb kernel: lost page write due to I/O error on sda4
Jan 26 06:55:59 ermandb kernel: Buffer I/O error on device sda4, logical block 47383556
Jan 26 06:55:59 ermandb kernel: lost page write due to I/O error on sda4
Jan 26 06:55:59 ermandb kernel: Buffer I/O error on device sda4, logical block 47383557
Jan 26 06:55:59 ermandb kernel: lost page write due to I/O error on sda4
Jan 26 06:55:59 ermandb kernel: Buffer I/O error on device sda4, logical block 47383558

So , it was obvious.. The problem was caused by an I/O error.
Strange thing was that ; we were seeing CIFS error in dmesg output.. 
These errors seemed to be encountered just before the I/O errors in /dev/sda..

CIFS VFS: No response for cmd 114 mid 8857
CIFS VFS: No response for cmd 114 mid 8878
CIFS VFS: No response for cmd 114 mid 8883
CIFS VFS: No response for cmd 114 mid 25488
CIFS VFS: No response for cmd 114 mid 25565
CIFS VFS: No response for cmd 114 mid 25583
CIFS VFS: No response for cmd 114 mid 25612
CIFS VFS: No response for cmd 114 mid 25629
CIFS VFS: No response for cmd 114 mid 25662
CIFS VFS: No response for cmd 114 mid 25667
CIFS VFS: No response for cmd 114 mid 25696
CIFS VFS: No response for cmd 114 mid 25729
CIFS VFS: No response for cmd 114 mid 25758
CIFS VFS: No response for cmd 114 mid 25799


There were bugs records about CIFS in the kernels below 2.6.32-400.33.1
Bug 18248478 : CIFS LEADS TO KERNEL PANIC --> Upgrade to UEK1 kernel-2.6.32-400.33.1 or later.
SAMBA DATA (CIFS MOUNT) CORRUPTION REPORTED ON UEK2 (Doc ID 1949715.1)

There were bugs but we did not have a kernel panic.. Besides; we had corruptions/IO errors in local disk partitions... So these CIFS problems were not related with the issue..

So what was the cause then?  Was it OS or SAN (the disks were on storage) ?

We have seen an issue record similar to ours, the issue record was saying that it was an OS - Filesystem problem...
File System issues lead to RAC Instance hangs (Doc ID 1510727.1)

On the other hand; according to that record, we had to see the following in /var/log/messages : kernel: attempt to access beyond end of device..  But we did not have such a line in /var/log/messages..

What we have seen actually was the Scsi errors in dmesg output above... They were important errors which have affected our decision about the cause of this problem..

Jan 26 06:55:59 ermandb kernel: sd 0:0:0:0: timing out command, waited 360s
Jan 26 06:55:59 ermandb kernel: sd 0:0:0:0: [sda] Unhandled error code
Jan 26 06:55:59 ermandb kernel: sd 0:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_OK
Jan 26 06:55:59 ermandb kernel: sd 0:0:0:0: [sda] CDB: Write(10): 2a 00 1c d1 bb 3d 00 00 20 00
 
So actually we had a problem in scsi level.. We have started to concantrate on this one actually..
Here is a bug record for this. It was for Exadata but it did not matter , at the end of the day; it gave us the clue.
Exadata Compute Node RAID Controller Failed (Doc ID 1362174.1)..

So as we had errors is scsi level ; then we thought that we had the problem in the levels below scsi.. HBA driver, HBA ,storage controller or physical disks..


There was also an opportunity that the error could be caused by the scsi drivers, but most probably the cause was a non-responsive scsi target..
Anyways, at this point; we have redirected the problem to the SAN admins.. We have requested them to check the SAN itself and all the I/O Paths used for reaching the disks on it.

Conclusion: 

The error was caused by the SANs.. SANs were rebooting theirselves , and as there was no multipath configuration which could tolerate this , we ended up with the errors in filesystems, corruptions in Oracle's controlfiles and dead EBS application processes which were caused by the read only filesystems...

No comments :

Post a Comment