Saturday, August 8, 2015

ODA X4-2 -- Virtualized platform, ASMResilver running all the time, Bug 20438706, ASM , resilvering , asmResilver2

We may notice a performance problem caused by  an ASMResilver process, which runs all the time.
the issue can be encountered in ODA X4-2 virtualized platforms and it may be started after restarting an ODA node.
After the restart , we may see the asmResilver2 process starts and runs forever. It cannot complete its work and stays in a stuck situation.
As a result, the overal performance of the viruatliazed system decreases.

I have explained Asm Resilvering generally in one of my previous posts: http://ermanarslan.blogspot.com.tr/2015/05/exadata-asm-resilvering-vs-asm-rebalance.html

So, in this post, I will focus on the resilvering bug in ODA , rather than explaining what asm resilvering is.

There are two major threads take roles in ASM Resilvering process. One of them is asmResilver1. This thread starts in case of a recovery situation such as when a node was aborted and then started again. AsmResilver1 checks if a mirror recovery is required and if it finds a recovery is required, it begins the recovery.
asmResilver2 is the other thread plays roles in the ASM recover/resilvering process. It is triggered when a cluster membership is changed. This kind of changes may be the results of a node crash or an ASM instance crash.
asmResilver2 checks to find out if a volume needs to be recoverd and does the recovery if it is needed.
If there is a ongoing mirror recovery operation in the volume that needs to recovered, that recovery operations is aborted and this thread restarts the recovery process.
So that 's why, after a node failure , we can see asmResilver starts immediately during the restart of the node and then aborted and then asmResilver2 gets started, which it is actually an expected behaviour.

This type of recovery can be triggered by the following events:
1. System reboot/crash while volumes are mounted.
2. ASM shutdown with volumes mounted (shutdown abort/immediate)
3. Forced diskgroup dismount while volume is open.

Also it can be aborted by the following events:
1. Restarting ASM instance anywhere in the cluster
2. Rebooting a node anywhere in the cluster.
3. Unmounting any mirrored volume anywhere in the cluster.

Moreover, if an ongoing mirror recovery is aborted, the recovery will start from the beginning when it is restarted.

So, if we restart a node oftenly, then we may see asmResilver2 running all the time , as it may not finish its work between two restarts and because each time it needs to start from beginning, thus it may not complete the recovery very quickly.

But, the problem that makes me write this blog post is not like that .. 
The problem I m talking about was that asmResilver2 kept running with %100 CPU usage all the time, altough we did not reboot a node for weeks.

This kind of issues can be seen from the logs. The mirrored regions per sec and the resilvering resilvering window can be estimated.
So, by considering the volume size and resilvered total region size , we may decide whether it is normal or not.
For example:

If  the entire volume is 4TB and if we  see 1.07 TB of this 4 TB is resilvered in 19 hours and if the resilvering process is still running , then it is an evidence of a hung situation or an endless loop situation as it is not possible for resilvering for resilvering process to operate on this kind of big data during a normal DRL aging interval. (this was seen in our case actually)

Anyways, I ll keep it short, so this hung or endless loop situation that I have explained above was caused by a bug actually..
The bug was getting triggered for the volumes which were greater than 2TB. Because of this bug the resilvering code was trying toı recover the same volume regios repeatedly causing an endless loop.
During this recovery operation , the resilvering was done for the same regions repeadetly and this action was increasing the total resilvered regions and that s why we were seeing such an huge total resilvered region counts.

So , for this kind of problem;
The cause is the Bug with number 20438706 and the fix is to apply the patch 20438706.
The patch should be applied on ODA BASE nodes, where GI homes are located.
The patch should be applied by reading the readme file of it and before applying it all the VMs in the ODA environment should be stopped.

In brief, if we have an ODA X4-2 virtualized platform, keep in mind that, there is a bug in the resilvering code and it must be fixed before we ll have volumes greater than 2 TB in size.

2 comments :

  1. Thanks for this excellent post. It's been very useful for me as I cannot find specific information regarding ASM resilvering specially for general ACFS filesystems, without ODA or Exadata. Can you suggest any other place to look for deeper asmresilver processes information in ACFS?

    ReplyDelete
  2. Hi Alberto,

    The documentation for this is not publically available. (at least I could not find a detailed document that explaning the asmresilver process)
    I suggest you to make a list of the things that you wonder about it and ask it in Oracle Support, in Oracle Community or in my forum(I don't promise that I will give you all the answers :) but I can do a research about it)

    ReplyDelete