ASM nternals: tracking down failed ASM reads

On a live customer system, we've encountered repeated incidents of errors such as the following:
WARNING: Read Failed. group:1 disk:3 AU:86753 offset:524288 size:262144Since Oracle doesn't tell us what exactly is failing here, some research was in order. There's a few posts out there about mapping ASM allocation units (AU) to database extents. But I felt that some of them weren't entirely clear on what is being done, how and why. This prompted me to do some digging of my own. This is our starting point. We know that:
- The error happened on RAC instance 1 (since it was logged in the alert log of said instance).
- The ASM disk group number is 1.
- The ASM disk number is 3.
- The AU number is 86753.
- We can't read that AU.
- Database version is 11.2.0.4 on Linux.
- ASM disk group redundancy is external.
Note: you have to run this in an ASM instance. On a database instance, the table doesn't contain any rows (on the current version that I tested this on). The columns in this table aren't officially documented, but my own testing confirms that the information that can be found on google is fairly reliable in the current versions. What we used here is:
- GROUP_KFFXP - the ASM disk group number, 1 in our case.
- DISK_KFFXP - the ASM disk number, 3.
- AU_KFFXP - the AU number, 86753.
- NUMBER_KFFXP - the ASM file number (not to be confused with the Oracle data file number).
- PXN_KFFXP - the physical extent number in that file.
[ AU_SIZE ] * [ PXN ] / [ BLOCK_SIZE ]In our case, that becomes the following query: [code lang="sql"] select owner, segment_name, segment_type from dba_extents where file_id = 4 and 1048576 * 5526 / 8192 between block_id and block_id + blocks -1; OWNER SEGMENT_NAME SEGMENT_TYPE ------------------------------ ------------------------------- ------------------ FOO CUSTOMER_OLD TABLE [/code] We can also confirm that our result is correct by attempting to read it (note we are forcing a full scan to make sure we're actually reading the table segment): [code lang="sql"] select /*+ full(t) */ count(*) from FOO.CUSTOMER_OLD t * ERROR at line 1: ORA-03113: end-of-file on communication channel Process ID: 741 Session ID: 695 Serial number: 40797 [/code] And sure enough, we see our familiar error message in the alert log instantly:
Thu Dec 10 04:24:23 2015 WARNING: Read Failed. group:1 disk:3 AU:86753 offset:524288 size:262144 ERROR: unrecoverable error ORA-15188 raised in ASM I/O path; terminating process 741We found the affected segment, and can now proceed with the usual recovery scenarios that are available to us. In this particular case, the table can likely be dropped as it was a backup. Nonetheless, it is quite clear that the underlying disk (disk number 3 in group number 1) is faulty and must be replaced. There is one more thing, though, that we need to be mindful of. In order to replace the disk, Oracle has to be able to read all the allocated AUs on that disk as part of the re-balance process that is triggered when dropping/adding disks. How do we tell if there aren't any other segments that can't be read? We'd have to be able to retrieve a list of all extents that are located on the disk in question. Of course, we can simply go for it, and let the drop/re-balance operation fail, which would also tell us that there are additional areas with problems on that disk. Since this is production, I prefer to be in the know instead of running something blindly. Additionally, you may hit one error during the re-balance, correct that, re-try and then hit another one. Rinse and repeat. Doesn't sound too comforting, does it? So let's see how we can get that information together. There is but one problem we need to solve first. The data that we need is not available in the same place:
- X$KFFXP is only available on an ASM instance.
- DBA_EXTENTS is only available on a database instance.
- Tables can be full scanned.
- Indexes can either be rebuilt online, or also read with a fast full scan plan.
- Lobs can be read with a small PL/SQL block.
- Clusters should be okay as well if the contained tables are scanned. as that will read the respective blocks.
- Partitioned tables and indexes can be treated analogous to their non-partitioned counterparts.
- If undo segments are affected and can't be read, you may want to involve Oracle support at this point.