Emergency Recovery or RMAN DUPLICATE Rerun Just Works!

Posted in: Technical Track

Hello Dear Reader,

This weekend was most probably my longest oncall weekend ever. As part of our work at Pythian each team member covers weekends on regular basis. There are quiet weekends and weekends like this one ;)

One of the tasks that I was involved in was 700GB database emergency cloning for an important functional issue troubleshooting. It was 10.2.0.5 (10GR2) 64 Bit database on Linux. I used the following command to clone:

export NLS_DATE_FORMAT='YYYY-MM-DD:HH24:MI:SS'

rman target / nocatalog auxiliary sys/syspwd@TARGET  log=dup_TERGET.log

run {
set until time "to_date('10-JAN-2010 23:05:25', 'DD-MON-YYYY HH24:MI:SS')";
allocate auxiliary  channel c1 type disk;
allocate auxiliary  channel c2 type disk;
duplicate target database to TARGET;
}


To be honest with you I wasn’t a big fan of the DUPLICATE command in the past for several reasons. Some of them are:
-a- Reliability and transparency: As any DBA I want to control what is going on and troubleshoot it if necessary. What to do if the command fails? Should I rerun the whole process once again? In command by command method you just repeat the last command that failed after fixing the issue. What happens in DUPLICATE command case?
-b- Connection to the production database: The DUPLICATE commands requires connections to the productions database. However in certain configurations for security, performance and production isolation from Dev/Test environments we should avoid any unnecessary connections from/to production environment.   IMHO: The latest copy of a control file should be significant.
-c- Infrastructure must be configured in the certain way to be able to run the DUPLICATE command. Namely backups should be available in the same way in TARGET as in the source.

However, in this case the whole infrastructure supported the DUPLICATE process nicely. All backups have been centralized on NAS storage and have been shared across all hosts in the infrastructure. At the end of the day we shouldn’t use our old habits as a driving factor and our many years experience shouldn’t limit us from starting using proper process and new best practices.

So to make a long story short I gave the DUPLICATE command a try. I used the comand above to start the process. After 4h of the restore process I was horrified (you know how it is to work under time and management pressure) at seeing the following errors:

RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03002: failure of Duplicate Db command at 09/12/2010 03:07:06
RMAN-03015: error occurred in stored script Memory Script
ORA-19870: error reading backup piece /apps/backups/db/ebs-db-prd/dbbackup/bk_48952_1_729384010
ORA-19615: some files not found in backup set
ORA-19613: datafile 50 not found in backup set

RMAN threw the error stack (see the full error stack bellow) and died. From the first look I have the impression that there is some data missing in the backups sets and most probably we will need to rerun the recovery process after fixing the issue, if it is possbile to fix it at all without running additional backup (it would take another X hours).

After careful reviewing of the full stack it become clear that the issue is simple. We just ran out of space on the target node:

channel c2: reading from backup piece /apps/backups/db/ebs-db-prd/dbbackup/bk_48951_1_729383994
ORA-19870: error reading backup piece /apps/backups/db/ebs-db-prd/dbbackup/bk_48953_1_729384115
ORA-19502: write error on file "/u03/TRGD03/db/apps_st/data/undotbs1.362.667992025", blockno 631265 (blocksize=8192)
ORA-27061: waiting for async I/Os failed
<span style="color: #000080;">Linux-x86_64 Error: 28: No space left on device</span>
Additional information: -1
Additional information: 131072

This was a very good chance to test the DUPLICATE rerun feature. And you know what? It all worked very well. I have cleaned up space and rerun the comand without changing any single row.  After I reran, the following lines appear in the TARGET db alert log. Note how quickly the restore command finished for the files that have been restored during the first run:

Rerun the DUPLICATE command

...
Sun Sep 12 04:28:34 EDT 2010
Full restore complete of datafile 66 to datafile copy /u03/ERPD03/db/apps_st/data/undotbs1.362.667992025.  <span style="color: #000080;">Elapsed time: 0:03:47</span>
checkpoint is 9677001561711
last deallocation scn is 9677001521427
Sun Sep 12 04:29:30 EDT 2010
Full restore complete of datafile 94 to datafile copy /u03/ERPD03/db/apps_st/data/apps_ts_tx_data.303.721067151.  <span style="color: #000080;">Elapsed time: 0:04:43</span>
checkpoint is 9677001537802
last deallocation scn is 9677000724209
Sun Sep 12 04:30:55 EDT 2010
....

The DUPLICATE operation just simply worked and completed in next 90 minutes. It restored rest of the files and applied all archive logs (recreated control file and opened DB resetting logs).

The bottom line here is the DUPLICATE command did a very good job for me and saved 5-6 hours of my weekend time.

This is a good time to say that it looks like the DUPLICATE command is reliable in 10.2.0.5 version.

Thank you Oracle! Very Well Done!

PS It looks like I am getting older if I am giving these type of compliments … :)

Yury – a DBA from down under

.

.

.

.

The following is full RMAN error stack after first run:
restoring datafile 00044 to /u03/TRGD03/db/apps_st/data/apps_ts_tx_data.338.667991787
restoring datafile 00050 to /u03/TRGD03/db/apps_st/data/apps_ts_tx_idx.344.667991853
restoring datafile 00094 to /u03/TRGD03/db/apps_st/data/apps_ts_tx_data.303.721067151
channel c2: reading from backup piece /apps/backups/db/ebs-db-prd/dbbackup/bk_48951_1_729383994
ORA-19870: error reading backup piece /apps/backups/db/ebs-db-prd/dbbackup/bk_48953_1_729384115
ORA-19502: write error on file “/u03/TRGD03/db/apps_st/data/undotbs1.362.667992025”, blockno 631265 (blocksize=8192)
ORA-27061: waiting for async I/Os failed
Linux-x86_64 Error: 28: No space left on device
Additional information: -1
Additional information: 131072
ORA-19870: error reading backup piece /apps/backups/db/ebs-db-prd/dbbackup/bk_48951_1_729383994
ORA-19502: write error on file “/u03/TRGD03/db/apps_st/data/apps_ts_tx_data.303.721067151”, blockno 39665 (blocksize=8192)
ORA-27061: waiting for async I/Os failed
Linux-x86_64 Error: 28: No space left on device
Additional information: -1
Additional information: 131072
Outer dbms_backup_restore.restoreCancel() failed
channel c2: reading from backup piece /apps/backups/db/ebs-db-prd/dbbackup/bk_48952_1_729384010
channel c1: starting datafile backupset restore
channel c1: specifying datafile(s) to restore from backup set
restoring datafile 00027 to /u03/TRGD03/db/apps_st/data/apps_ts_summary.321.667991649
restoring datafile 00068 to /u03/TRGD03/db/apps_st/data/undotbs2.364.667992053
channel c1: reading from backup piece /apps/backups/db/ebs-db-prd/dbbackup/bk_48954_1_729384190
RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03002: failure of Duplicate Db command at 09/12/2010 03:07:06
RMAN-03015: error occurred in stored script Memory Script
ORA-19870: error reading backup piece /apps/backups/db/ebs-db-prd/dbbackup/bk_48952_1_729384010
ORA-19615: some files not found in backup set
ORA-19613: datafile 50 not found in backup set

Recovery Manager complete.

Interested in working with Yury? Schedule a tech call.

About the Author

Yury is a nice person who enjoys meeting and working with people on challenging projects in the Oracle space. He started to work as an Oracle DBA 14 years ago (1997). For the past 10 years (since 2001) his main area of expertise is Oracle e-Business Suite. Yury is an OCP 7,8,9,10g and OCM 9i,10g. He is a frequent presenter at Oracle related conferences such as Hotsos, UKOUG and AUOUG. Yury is a socially active person. Apart from Social Media (Twitter, Bloging, Facebook) he is the primary organizer of Sydney Oracle Meetup group (250 people). So if you happen to be in Sydney (Australia) drop Yury a message and stop by at one of his Meetups.

4 Comments. Leave new

Yury,

If you have backup optimization turned on in your controlfile on source database, then this setting is carried forward to auxillary database as well. And it results in not needing to already backed/restored files again.

CONFIGURE BACKUP OPTIMIZATION ON;

By default this setting is set to “OFF”.

http://download.oracle.com/docs/cd/B19306_01/backup.102/b14191/rcmconc1.htm#BRADV113

Reply
Yury - a DBA from down under
September 19, 2010 5:26 pm

Hi Vishal,

Thank you for your comment. Just wonder based on your experience does “BACKUP OPTIMIZATION” applies on recovery operations as well as on backups? From the quick look on the documentation specified it looks like it have an effect on backup.

The connected question is doest it have any impact on DUPLICATE command?

Thanks,
Yury

Reply

I met such Duplicate command behaviour all the time – after error I restarted command again and it skipped all completed operations. Seems that it was default behaviour, since we didn’t turn on any non-default settings (like Vishal said OPTIMIZATION ON).

Reply
Yury - a DBA from down under
September 23, 2010 10:20 pm

Thank you Viacheslav for your update.
I am glad that I am not along envying that feature nowadays :)

Y

Reply

Leave a Reply

Your email address will not be published. Required fields are marked *