Blog | Pythian

ASM nternals: tracking down failed ASM reads

Written by Pythian Marketing | Dec 16, 2015 5:00:00 AM

 

On a live customer system, we've encountered repeated incidents of errors such as the following: WARNING: Read Failed. group:1 disk:3 AU:86753 offset:524288 size:262144.

Since Oracle doesn't tell us what exactly is failing here, some research was in order. There's a few posts out there about mapping ASM allocation units (AU) to database extents. But I felt that some of them weren't entirely clear on what is being done, how and why. This prompted me to do some digging of my own.

Establishing the Starting Point

This is our starting point. We know that:

  • The error happened on RAC instance 1 (since it was logged in the alert log of said instance).
  • The ASM disk group number is 1.
     
  • The ASM disk number is 3.
     
  • The AU number is 86753.
     
  • We can't read that AU.
     
  • Database version is 11.2.0.4 on Linux.
     
  • ASM disk group redundancy is external.
     

We can further tell, that the failed read was at byte offset 524288 (which is 512KB) into the AU, and it was a multi-block read of 32 blocks (262144 / 8192). Thus it was likely a full table scan. 

Mapping ASM Allocation Units to Database Extents

Disclaimer: what follows next is undocumented, and the usual disclaimers apply: check with Oracle support before running any of this against your production system. In an ASM instance, Oracle exposes the ASM AU map in the fixed table X$KFFXP. We can query that to get some additional details, using the information we already have:

select inst_id, group_kffxp, number_kffxp, pxn_kffxp  from x$kffxp  where group_kffxp=1 and disk_kffxp=3 and au_kffxp=86753;   INST_ID GROUP_KFFXP NUMBER_KFFXP PXN_KFFXP ---------- ----------- ------------ ---------- 1 1 287 5526  

Note: you have to run this in an ASM instance. On a database instance, the table doesn't contain any rows (on the current version that I tested this on). The columns in this table aren't officially documented, but my own testing confirms that the information that can be found on google is fairly reliable in the current versions. What we used here is:

 
  • GROUP_KFFXP - the ASM disk group number, 1 in our case.
     
  • DISK_KFFXP - the ASM disk number, 3.
     
  • AU_KFFXP - the AU number, 86753.
     

The view now tells us the first two pieces of the puzzle that we need to know:

  • NUMBER_KFFXP - the ASM file number (not to be confused with the Oracle data file number).
     
  • PXN_KFFXP - the physical extent number in that file.

Identifying the Damaged Database File

Armed with this information, we can now determine the details of the file that's experiencing read errors:

set lines 200 pages 999 col dg for a12 col name for a20 col fname for a40  select t.name, substr(f.name, instr(f.name,'/',-1) + 1) as fname, a.file_number, f.file# as DBFILE#, f.bytes/1024/1024 as file_mb  from v$datafile f, v$tablespace t, v$asm_diskgroup g, v$asm_alias a, v$asm_file af  where g.name(+) = substr(f.name,2,instr(f.name,'/')-2)  and a.name(+) = upper(substr(f.name, instr(f.name,'/',-1) + 1))  and a.file_number = af.file_number  and a.group_number = af.group_number  and f.ts# = t.ts#  and af.file_number = 287 /   NAME FNAME FILE_NUMBER DBFILE# FILE_MB -------------------- ------------------- ----------- ---------- ---------- USERS users.287.795706011 287 4 11895  

We can see that the file is a part of the USERS table-space, and has a data file ID of 4. Let's double check our environment: 

select allocation_unit_size from v$asm_diskgroup where group_number=1;  ALLOCATION_UNIT_SIZE -------------------- 1048576   select block_size from dba_tablespaces where tablespace_name='USERS';  BLOCK_SIZE ---------- 8192  

Finding the Specific Database Segment

Now we have all that we need to get the final piece of our puzzle. We can use the following formula to calculate the position of the extent in the file, and from there, hit DBA_EXTENTS to see what that is:

 
$$[ AU\_SIZE ] * [ PXN ] / [ BLOCK\_SIZE ]$$

In our case, that becomes the following query:

 
select owner, segment_name, segment_type  from dba_extents  where file_id = 4  and 1048576 * 5526 / 8192 between block_id and block_id + blocks -1;   OWNER SEGMENT_NAME SEGMENT_TYPE ------------------------------ ------------------------------- ------------------ FOO CUSTOMER_OLD TABLE  

We can also confirm that our result is correct by attempting to read it (note we are forcing a full scan to make sure we're actually reading the table segment):

select /*+ full(t) */ count(*) from FOO.CUSTOMER_OLD t * ERROR at line 1: ORA-03113: end-of-file on communication channel Process ID: 741 Session ID: 695 Serial number: 40797  

And sure enough, we see our familiar error message in the alert log instantly: Thu Dec 10 04:24:23 2015 WARNING: Read Failed. group:1 disk:3 AU:86753 offset:524288 size:262144 ERROR: unrecoverable error ORA-15188 raised in ASM I/O path; terminating process 741.

Recovery Strategy and Proactive Auditing

We found the affected segment, and can now proceed with the usual recovery scenarios that are available to us. In this particular case, the table can likely be dropped as it was a backup. Nonetheless, it is quite clear that the underlying disk (disk number 3 in group number 1) is faulty and must be replaced. 

Challenges in Disk Replacement

There is one more thing, though, that we need to be mindful of. In order to replace the disk, Oracle has to be able to read all the allocated AUs on that disk as part of the re-balance process that is triggered when dropping/adding disks.

How do we tell if there aren't any other segments that can't be read? We'd have to be able to retrieve a list of all extents that are located on the disk in question. Since this is production, I prefer to be in the know instead of running something blindly. So let's see how we can get that information together. There is but one problem we need to solve first. The data that we need is not available in the same place:

 
  1. X$KFFXP is only available on an ASM instance.
     
  2. DBA_EXTENTS is only available on a database instance.
     

Step 1: Exporting the ASM Map

I opted to go for the external table approach, and pull the data out of ASM first by creating the file /tmp/asm_map.sql with these contents: 

set echo off set feedback off set termout off set pages 0 spool /tmp/asm_map.txt  select x.number_kffxp || ',' || x.pxn_kffxp as data  from x$kffxp x  where x.group_kffxp=1 and x.disk_kffxp=3 and x.number_kffxp > 255 /  spool off  

Execute that script while connected to your ASM instance. Next, switch to the database instance, and run the following:

create directory tmp_asm as '/tmp' /  create table asm_map ( asm_file number, asm_pxn number )  organization external ( type oracle_loader default directory tmp_asm access parameters ( records delimited by newline fields terminated by ',' )  location ( 'asm_map.txt' ) ) /  

Step 2: Correlating Data Files and Tablespaces

Ensure that the data is properly readable:

select * from asm_map where rownum < 10 /   ASM_FILE ASM_PXN ---------- ---------- 256 4 256 7 256 21 256 28 256 35 256 49 256 52 256 75 256 88 9 rows selected.  

Now we can join into V$DATAFILE and DBA_EXTENTS to get the actual data we're after. Let's first build a list of table spaces and data files that are stored on this disk:

col ts_name for a30 col fname for a50 set lines 200 pages 999  select unique t.name as TS_NAME, substr(f.name, instr(f.name,'/',-1) + 1) as fname, a.file_number, f.file# as DBFILE#  from v$datafile f, v$tablespace t, v$asm_diskgroup g, v$asm_alias a, v$asm_file af, ( select distinct asm_file from asm_map ) m  where g.name(+) = substr(f.name,2,instr(f.name,'/')-2)  and a.name(+) = upper(substr(f.name, instr(f.name,'/',-1) + 1))  and a.file_number = af.file_number  and a.group_number = af.group_number  and f.ts# = t.ts#  and af.file_number = m.asm_file  order by 1,2 /  

Step 3: Generating the Complete Affected Segment List

Now let's expand that to also include DBA_EXTENTS. I am creating a copy of the contents of DBA_EXTENTS, which is known to often not perform in an optimal fashion on large databases.

create table tmp_extents tablespace users as select * from dba_extents /  

And now we're ready to get the list of all segments that would be affected by problems on this one disk:

col ts_name for a30 col obj for a100 set lines 200 pages 999 col segment_type for a18 set lines 200 pages 999  select unique t.name as TS_NAME, e.owner || '.' || e.segment_name as obj, e.segment_type, a.file_number, f.file# as DBFILE#  from v$datafile f, v$tablespace t, v$asm_diskgroup g, v$asm_alias a, v$asm_file af, asm_map m, tmp_extents e  where g.name(+) = substr(f.name,2,instr(f.name,'/')-2)  and a.name(+) = upper(substr(f.name, instr(f.name,'/',-1) + 1))  and a.file_number = af.file_number  and a.group_number = af.group_number  and f.ts# = t.ts#  and af.file_number = m.asm_file  and f.file# = e.file_id  and t.name = e.tablespace_name  and g.allocation_unit_size * m.asm_pxn / f.block_size between e.block_id and e.block_id + e.blocks -1  order by 1,3,2 /  

Conclusion

With this information, you can proceed to verify if any other segments exist which are unreadable:

  • Tables: Can be full scanned.
  • Indexes: Can be rebuilt online or read with a fast full scan.
  • Lobs: Can be read with a small PL/SQL block.
  • Clusters: Usually okay if the contained tables are scanned.
  • Undo Segments: If affected, involve Oracle support immediately.

By doing this, you ensure that potential problems are detected before applications are affected. Once the audit is complete, clean up the temporary objects:

drop table tmp_extents /  drop table asm_map /  drop directory tmp_asm /  

Discover more about our expertise in the world of Oracle.

Oracle Database Consulting Services

Ready to optimize your Oracle Database for the future?