Sunday, 13 December 2015

Dude, Where's My Datafile?

disclaimer: here

ASM can be too transparent at times.

I found that out when I was investigating read errors (ORA-19501 & ORA-15081 - which caused RMAN backup to fail) on a datafile.

For a reason that will be explained below, it was necessary to find out which ASM disk the datafile resides on, but you can't retrieve that information from V$ASM_ views, because they do not reveal the association between datafiles (i.e. the ASM Aliases) and the ASM disk itself.

Environment: "External" redundancy across ASM diskgroup, which is shared between six RAC DBs. GI and Databse are both 11.2.0.4, but the solutions applies to 12.1.0.2 as well.

unrecoverable error ORA-15188 raised in ASM I/O path


Considering the chances of logical corruption, and using the block number reported in ORA-19501, it is possible to find the affected segment (see Stefan Knecht's article) and take an action to prevent the same error from happening again during RMAN backup; such as deleting the segment.

But after deleting the segment, although I could run the backup successfully, validation of the datafile in RMAN failed, citing the exact same errors, on the exact same block.

This is due to the difference in the mechanism of backup compared to validation's: "When validating whole files, RMAN checks every block of the input file" while RMAN backup touches blocks that contain data, and skips the blocks that do not contain data.

Incidentally checking the v$ASM_disk.READ_ERRS revealed that one of the ASM disks has experienced read errors. Interestingly the errors could also be traced in ASM instance's alert.log, as well, citing information about the location of the issue:

WARNING: Read Failed. group:2 disk:2 AU:235530 offset:786432 size:131072
ERROR: unrecoverable error ORA-15188 raised in ASM I/O path; terminating process 28997

Enter The Headless Chicken!


That's me: totally ignoring the fact that although I don't know the solution, I SHOULD know what's NOT the solution!

Indications of the issue in ASM instance's alert.log (warnings similar to the one in the previous lines) dated back to the last time an ASM disk was added to the diskgroup, and rebalancing was triggered.

Rebalancing had made ASM go through a series of continuous start/failure/stop of ARB0 background process: The initial attempt for rebalancing (after adding a new ASM disk to the diskgroup) triggered the start of ARB0, failed with ORA-15118, which lead to ARB0 stopping, and restarting the whole process over and over again.

Yet, the headless chicken decided to drop the ASM disk from the diskgroup to rectify the problem!

Eventually the futility of the process dawned on me and I executed the following commands:

alter diskgroup DATA_01 undrop disks;
alter diskgroup DATA_01 rebalance power 0;

The solution


The key lies in the fact that both RMAN validation and ASM rebalacing deal with the same thing: the whole datafile.

In this case, the ORA-15118s were an indication that ASM was unable to access some non-data-holding parts of the datafile on the storage, so the solution was shrinking the datafile to a size as close as possible to the volume of the data; because RMAN backup is carried out with no errors.

But how can we make sure that datafiles of other DBs which are stored on the same diskgroup, are not affected? First we need to find out which datafiles from which DBs have bits and pieces on the same ASM disk, and then validate each one using RMAN.

Which gets us to the main point of this post (finally!): the missing logical link between data files and ASM disks can be found in X$KFFXP fixed table.

I found out about it from "Secrets of Oracle Database" (Chapter 12, Norbert Debes, Apress, 2009) but there are other sources which discuss the fixed tables in ASM instance, such as "Pro Oracle Database 10g RAC on Linux" and Luca Canali's article here.

The following query can be used for this purpose:

select *
from v$asm_alias
where SYSTEM_CREATED = 'N'
and (FILE_NUMBER, FILE_INCARNATION) in
(
  select distinct NUMBER_KFFXP, INCARN_KFFXP
  from x$kffxp
  where DISK_KFFXP = disk_number according to alert.log
  and LXN_KFFXP = 0
  and GROUP_KFFXP= group_number according to alert.log
);

The join columns reveal the purpose of NUMBER_KFFXP, and INCARN_KFFXP columns, GROUP_KFFXP and DISK_KFFXP are equal to v$asm_disk.GROUP_NUMBER and v$asm_disk.DISK_NUMBER, and LXN_KFFXP is the Logical Extent Number (0 = primary, 1 = mirrored copy).

Note for the proponents of set theory: Yes, I know that this query doesn't manifest an understanding of your beloved theory (and mine too, believe me!) but I found this syntax fast enough without requiring any hints to make sure the right join method is used.

Finally


The query revealed all the files (from different DBs) that were located on that specific ASM disk. Fortunately all of them were validated successfully by RMAN.

Eventually the segments of the problematic datafile were moved to a new tablespace, the data file was shrunk with no issues (and validated successfully by RMAN), and afterwards I was able to drop the ASM disk from the group.

Apparently some people have been living happily ever after.

p.s. I like to thank Martin Widlake and Neil Chandler for offering me advice for this post. However, I'm responsible for statements made here, so please do let me know if you notice any errors.