Well, I found this magic number, or let's say this magic percentage (%15) a little interesting and that's why I want to share my thoughths on this with you.
Normally, we have a metric named USABLE_FILE_MB as you may already know. It may depend on the version but, normally this metric gives us the safe allocatable size considering a case of a disk failure.. In the old versions, this was reporting the safe allocatable size, a value which can be taken as a reference for being safe even in a cell failure.
In simple logic, we can say that; we have no risks, ofcourse if the USABLE_FILE_MB has a positive value and if we think it will stay positive even when we consider potential new future allocations.
Moreover, USABLE_FILE_MB is derived by considering the REQUIRED_MIRROR_FREE_MB, which is the required size for a rebalance operation to complete in the worst case scenario.
The formulas are as follows;
Normal Redundancy
USABLE_FILE_MB = (FREE_MB – REQUIRED_MIRROR_FREE_MB) / 2
High Redundancy
USABLE_FILE_MB = (FREE_MB – REQUIRED_MIRROR_FREE_MB) / 3
If USABLE_FILE_MB is a negative value, then we can directly say that the normal redundancy environments are in danger, but in any case we can still check FREE_MB. If the value that we see in FREE_MB is bigger than the disk size (if the disk sizes are equal.. If they are not equal, then FREE_MB should be bigger than the largest disk size), we can still rebalance in case of a disk failure.
So far so good. These are all related with disk failures. (as I mentioned earlier, we need to check the version and conclude what the USABLE_FILE_MB reports to us.. Usable file mb even in the case of a disk failure or Usable file mb even in the case of a cell failure)
Of course, if we lose a cell and if the USABLE_FILE_MB considers only the disk failures, the situation is different. We need to multiple the USABLE_FILE_MB with the count of disks in the cell.
It is independent from the redundancy being normal or high; for instance , if the USABLE_FILE_MB is 10 and it reporting us the usable file mb in the case of disk failures and if we have 12 disks in a cell, then we have to multiply that value 10 with 12. This makes 120 and that 's minimum usable file mb that we need to see in USABLE_FILE_MB in order to be safe even in a case of a cell failure.
At this point and in this context, following article of Emre Baransel might be nice for reading.
https://www.doag.org/formes/pubfiles/8587254/2016-INF-Emre_Baransel-A_Deep_Dive_into_ASM_Redundancy_in_Exadata-Manuskript.pdf
In my opinion, it shouldn't be that way.. I mean, there shouldn't be a %15 rule and I think this subject is a little buggy.
Note that, at the moment; we need to consider the %15 rule and we must follow it!
Anways; if we reserve %15 of space , are we safe ? Well, probably.. But, the following bug says that, even if we have %15 reserve space ,we still may have problem during rebalance..
Bug 21083850 ORA-15041 during rebalance despite having free space -> Bug 21083850 - ORA-15041 during rebalance despite having free space (Doc ID 21083850.8)
The cause of this bug is probably the imbalance during rebalance ->
When a disk is force dropped, its partners lose a partner.As a result, the partners of its partners get more extents relocated to them, causing an imbalance.
This imbalance results in the ORA-15041, because some disks run out of space faster than others.
This situation can also be explained by ; having those disks already overloaded even before the rebalance.. So as you may guess, if ASM uses them aggressively during the rebalance they get full and the rebalance code returns an error.
Ofcouse, imbalance may be normal in some cases.. For instance when we have fail groups ..
That is; when we have a fail group configuration, ASM will have a more difficult job during the rebalance.. I mean, when we have fail groups; ASM will have less choices for distributing the mirror extents when a disk is dropped.. Still, I don't think that these kinds of causes should not be enough to reveal such a rule (%15 rule)
Well, these are my thought on this subject... Please feel free to comment and correct me if I'm wrong. Please share your thoughts on this subjects by commenting to this blog post.
While checking error ORA-15041 found doc, Doc ID 1367078.1. Here one sentence took my attention ;
ReplyDeleteFrom Doc:
" If any one disk is short of free_mb, then the error might be seen, even if there is sufficient free space in the whole diskgroup."
This supports your thesis i believe, since expected behavior of ASM is to distribute extensions evenly, which seems ASM not smart enough till version 19c.
Also another line from doc;
"Starting 10.2, the total size of the disk is taken into consideration for allocations. So there will be imbalanced IO to disks. A future task would be to add/drop disks to have all the disks of same size."
So starting with 10.2 allocation method uses total disk size not individual which apparently causes imbalanced disks.And future task might completed in 19c.