Emergency Recovery or RMAN DUPLICATE Rerun Just Works!

Sep 12, 2010 / By Yury Velikanov

Tags: , , ,

Hello Dear Reader,

This weekend was most probably my longest oncall weekend ever. As part of our work at Pythian each team member covers weekends on regular basis. There are quiet weekends and weekends like this one ;)

One of the tasks that I was involved in was 700GB database emergency cloning for an important functional issue troubleshooting. It was 10.2.0.5 (10GR2) 64 Bit database on Linux. I used the following command to clone:

export NLS_DATE_FORMAT='YYYY-MM-DD:HH24:MI:SS'

rman target / nocatalog auxiliary sys/syspwd@TARGET  log=dup_TERGET.log

run {
set until time "to_date('10-JAN-2010 23:05:25', 'DD-MON-YYYY HH24:MI:SS')";
allocate auxiliary  channel c1 type disk;
allocate auxiliary  channel c2 type disk;
duplicate target database to TARGET;
}


To be honest with you I wasn’t a big fan of the DUPLICATE command in the past for several reasons. Some of them are:
-a- Reliability and transparency: As any DBA I want to control what is going on and troubleshoot it if necessary. What to do if the command fails? Should I rerun the whole process once again? In command by command method you just repeat the last command that failed after fixing the issue. What happens in DUPLICATE command case?
-b- Connection to the production database: The DUPLICATE commands requires connections to the productions database. However in certain configurations for security, performance and production isolation from Dev/Test environments we should avoid any unnecessary connections from/to production environment.   IMHO: The latest copy of a control file should be significant.
-c- Infrastructure must be configured in the certain way to be able to run the DUPLICATE command. Namely backups should be available in the same way in TARGET as in the source.

However, in this case the whole infrastructure supported the DUPLICATE process nicely. All backups have been centralized on NAS storage and have been shared across all hosts in the infrastructure. At the end of the day we shouldn’t use our old habits as a driving factor and our many years experience shouldn’t limit us from starting using proper process and new best practices.

So to make a long story short I gave the DUPLICATE command a try. I used the comand above to start the process. After 4h of the restore process I was horrified (you know how it is to work under time and management pressure) at seeing the following errors:

RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03002: failure of Duplicate Db command at 09/12/2010 03:07:06
RMAN-03015: error occurred in stored script Memory Script
ORA-19870: error reading backup piece /apps/backups/db/ebs-db-prd/dbbackup/bk_48952_1_729384010
ORA-19615: some files not found in backup set
ORA-19613: datafile 50 not found in backup set

RMAN threw the error stack (see the full error stack bellow) and died. From the first look I have the impression that there is some data missing in the backups sets and most probably we will need to rerun the recovery process after fixing the issue, if it is possbile to fix it at all without running additional backup (it would take another X hours).

After careful reviewing of the full stack it become clear that the issue is simple. We just ran out of space on the target node:

channel c2: reading from backup piece /apps/backups/db/ebs-db-prd/dbbackup/bk_48951_1_729383994
ORA-19870: error reading backup piece /apps/backups/db/ebs-db-prd/dbbackup/bk_48953_1_729384115
ORA-19502: write error on file "/u03/TRGD03/db/apps_st/data/undotbs1.362.667992025", blockno 631265 (blocksize=8192)
ORA-27061: waiting for async I/Os failed
<span style="color: #000080;">Linux-x86_64 Error: 28: No space left on device</span>
Additional information: -1
Additional information: 131072

This was a very good chance to test the DUPLICATE rerun feature. And you know what? It all worked very well. I have cleaned up space and rerun the comand without changing any single row.  After I reran, the following lines appear in the TARGET db alert log. Note how quickly the restore command finished for the files that have been restored during the first run:

Rerun the DUPLICATE command

...
Sun Sep 12 04:28:34 EDT 2010
Full restore complete of datafile 66 to datafile copy /u03/ERPD03/db/apps_st/data/undotbs1.362.667992025.  <span style="color: #000080;">Elapsed time: 0:03:47</span>
checkpoint is 9677001561711
last deallocation scn is 9677001521427
Sun Sep 12 04:29:30 EDT 2010
Full restore complete of datafile 94 to datafile copy /u03/ERPD03/db/apps_st/data/apps_ts_tx_data.303.721067151.  <span style="color: #000080;">Elapsed time: 0:04:43</span>
checkpoint is 9677001537802
last deallocation scn is 9677000724209
Sun Sep 12 04:30:55 EDT 2010
....

The DUPLICATE operation just simply worked and completed in next 90 minutes. It restored rest of the files and applied all archive logs (recreated control file and opened DB resetting logs).

The bottom line here is the DUPLICATE command did a very good job for me and saved 5-6 hours of my weekend time.

This is a good time to say that it looks like the DUPLICATE command is reliable in 10.2.0.5 version.

Thank you Oracle! Very Well Done!

PS It looks like I am getting older if I am giving these type of compliments … :)

Yury – a DBA from down under

.

.

.

.

The following is full RMAN error stack after first run:
restoring datafile 00044 to /u03/TRGD03/db/apps_st/data/apps_ts_tx_data.338.667991787
restoring datafile 00050 to /u03/TRGD03/db/apps_st/data/apps_ts_tx_idx.344.667991853
restoring datafile 00094 to /u03/TRGD03/db/apps_st/data/apps_ts_tx_data.303.721067151
channel c2: reading from backup piece /apps/backups/db/ebs-db-prd/dbbackup/bk_48951_1_729383994
ORA-19870: error reading backup piece /apps/backups/db/ebs-db-prd/dbbackup/bk_48953_1_729384115
ORA-19502: write error on file “/u03/TRGD03/db/apps_st/data/undotbs1.362.667992025″, blockno 631265 (blocksize=8192)
ORA-27061: waiting for async I/Os failed
Linux-x86_64 Error: 28: No space left on device
Additional information: -1
Additional information: 131072
ORA-19870: error reading backup piece /apps/backups/db/ebs-db-prd/dbbackup/bk_48951_1_729383994
ORA-19502: write error on file “/u03/TRGD03/db/apps_st/data/apps_ts_tx_data.303.721067151″, blockno 39665 (blocksize=8192)
ORA-27061: waiting for async I/Os failed
Linux-x86_64 Error: 28: No space left on device
Additional information: -1
Additional information: 131072
Outer dbms_backup_restore.restoreCancel() failed
channel c2: reading from backup piece /apps/backups/db/ebs-db-prd/dbbackup/bk_48952_1_729384010
channel c1: starting datafile backupset restore
channel c1: specifying datafile(s) to restore from backup set
restoring datafile 00027 to /u03/TRGD03/db/apps_st/data/apps_ts_summary.321.667991649
restoring datafile 00068 to /u03/TRGD03/db/apps_st/data/undotbs2.364.667992053
channel c1: reading from backup piece /apps/backups/db/ebs-db-prd/dbbackup/bk_48954_1_729384190
RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03002: failure of Duplicate Db command at 09/12/2010 03:07:06
RMAN-03015: error occurred in stored script Memory Script
ORA-19870: error reading backup piece /apps/backups/db/ebs-db-prd/dbbackup/bk_48952_1_729384010
ORA-19615: some files not found in backup set
ORA-19613: datafile 50 not found in backup set

Recovery Manager complete.

4 Responses to “Emergency Recovery or RMAN DUPLICATE Rerun Just Works!”

Leave a Reply

  • (will not be published)

XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>