export ORACLE_SID=PRIMARY
RMAN_COMMAND=restore_PRIMARY
EMAIL_RECIPIENT=no-reply@pythian.com
export NLS_DATE_FORMAT='YYYY-MM-DD:hh24:mi:ss'
if [ ! -d "log" ]; then
mkdir log
fi
. oraenv <<< $ORACLE_SID
TS=`date +'%Y%m%d-%H.%M'`
WORK_DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )"
RMAN_LOG=${WORK_DIR}/log/RMAN-${RMAN_COMMAND}_${TS}.log
date
echo "Begin RMAN ${RMAN_COMMAND} on $SID"
rman cmdfile=${WORK_DIR}/${RMAN_COMMAND}.rcv log=${RMAN_LOG}
echo "#"
date
echo "End RMAN ${RMAN_COMMAND} on $SID"
mail -s "RMAN script ${RMAN_COMMAND} completed on ${SID}" ${EMAIL_RECIPIENT} < ${RMAN_LOG}
exit 0
And this is the very basic RMAN script.
connect target /
connect catalog rman/XXXXXX@RMANCAT
RUN
{
SET UNTIL SCN 123456789;
ALLOCATE CHANNEL D1 TYPE DISK;
ALLOCATE CHANNEL D2 TYPE DISK;
ALLOCATE CHANNEL D3 TYPE DISK;
ALLOCATE CHANNEL D4 TYPE DISK;
ALLOCATE CHANNEL D5 TYPE DISK;
ALLOCATE CHANNEL D6 TYPE DISK;
ALLOCATE CHANNEL D7 TYPE DISK;
ALLOCATE CHANNEL D8 TYPE DISK;
ALLOCATE CHANNEL D9 TYPE DISK;
ALLOCATE CHANNEL D10 TYPE DISK;
ALLOCATE CHANNEL D11 TYPE DISK;
ALLOCATE CHANNEL D12 TYPE DISK;
ALLOCATE CHANNEL D13 TYPE DISK;
ALLOCATE CHANNEL D14 TYPE DISK;
ALLOCATE CHANNEL D15 TYPE DISK;
RESTORE DATABASE;
RECOVER DATABASE;
}
After the restore completed it was time for the recover in which, yes you guessed right, there was a missing archived log and there is more fun to be had. Of course, there was a backup of the missing archived log on the site we were trying to rebuild so we recovered it and applied it to the standby but then, surprise! See the line 22 in the snippet below:
SQL> recover standby database;
ORA-00279: change 14943816301 generated at 04/09/2018 00:02:23 needed for thread 1
ORA-00289: suggestion : +FRA/conpro_uk2/archivelog/2018_04_11/thread_1_seq_15211.2442.973138947
ORA-00280: change 14943816301 for thread 1 is in sequence #15211
Specify log: {=suggested | filename | AUTO | CANCEL}
AUTO
ORA-00279: change 14943827295 generated at 04/09/2018 00:07:18 needed for thread 1
ORA-00289: suggestion : +FRA/conpro_uk2/archivelog/2018_04_11/thread_1_seq_15212.458.973138947
ORA-00280: change 14943827295 for thread 1 is in sequence #15212
ORA-00278: log file '+FRA/conpro_uk2/archivelog/2018_04_11/thread_1_seq_15211.2442.973138947' no longer needed for this recovery
ORA-00279: change 14943841236 generated at 04/09/2018 00:13:56 needed for thread 1
ORA-00289: suggestion : +FRA
ORA-00280: change 14943841236 for thread 1 is in sequence #15213
ORA-00278: log file '+FRA/conpro_uk2/archivelog/2018_04_11/thread_1_seq_15212.458.973138947' no longer needed for this recovery
ORA-00308: cannot open archived log '+FRA'
ORA-17503: ksfdopn:2 Failed to open file +FRA <-- at this point logs from 15213 are gone from FRA
ORA-15045: ASM file name '+FRA' is not in reference form
ORA-01547: warning: RECOVER succeeded but OPEN RESETLOGS would get error below
ORA-01152: file 1 was not restored from a sufficiently old backup
ORA-01110: data file 1: '+DATA/conpro_uk2/datafile/system.298.973129969'
Enter Oracle bugs. Yes ladies and gentlemen, upon restoring the missing archived log from the backup, the database started some FRA clean up process that deleted the newest archived logs! For these, of course, there was no backup as they had been created between the last backup and the crash. For those interested in the details, here is the bug we believe we may have swallowed while riding our bikes: "Bug 17370174 : NEW ARCHIVELOG FILES ARE DELETED FIRST FROM THE FRA". So, here we are, the old-primary, now an almost new-standby, is missing some archived logs to be able to sync up with the current primary. Fortunately, these archived logs had been shipped and applied to the old-standby, now the primary, before the crash, so we were able to restore and ship them to the other side of the Atlantic and everything was up and running again, including Data Guard Broker ... after a few hours. And, after applying some eighty archived logs, here we go again: ORA-00365
Media Recovery Log +FRA/PRIMARY/archivelog/2018_04_11/thread_1_seq_15291.2126.973169799 Errors with log +FRA/PRIMARY/archivelog/2018_04_11/thread_1_seq_15291.2126.973169799 MRP0: Background Media Recovery terminated with error 365 Errors in file /apps/oracle/diag/rdbms/PRIMARY/PRIMARY/trace/PRIMARY_pr00_13538.trc: ORA-00365: the specified log is not the correct next log Managed Standby Recovery not using Real Time Apply Recovery interrupted! MRP0: Background Media Recovery process shutdown (PRIMARY) Completed: ALTER DATABASE RECOVER MANAGED STANDBY DATABASE THROUGH ALL SWITCHOVER DISCONNECT USING CURRENT LOGFILEThis issue was more complicated. Another bug entered the game " Bug 19181583 - Allow Recover Until Time for a Redo Log that has the activation bit set (Doc ID 19181583.8)" and a severity 1 SR was opened with Oracle. Also, we escalated internally to our IPC for him to take a look. Working around this one required creating a fresh standby controlfile from the current primary, ship it over to the standby server and replace the current controlfiles with it. Also, as ASM is involved, we had to do some RMAN magic to have the controlfile pointing to the right datafiles. Not the cleanest process, it may have been better to start a bit down in the ASM directory tree to avoid trying to catalog files like the ASM password file but, again, it did the job.
RMAN> catalog start with '+DATA'; Starting implicit crosscheck backup at 11-APR-18 using target database control file instead of recovery catalog allocated channel: ORA_DISK_1 channel ORA_DISK_1: SID=1037 device type=DISK allocated channel: ORA_DISK_2 channel ORA_DISK_2: SID=3 device type=DISK allocated channel: ORA_DISK_3 channel ORA_DISK_3: SID=118 device type=DISK allocated channel: ORA_DISK_4 channel ORA_DISK_4: SID=234 device type=DISK Crosschecked 1111 objects Finished implicit crosscheck backup at 11-APR-18 Starting implicit crosscheck copy at 11-APR-18 using channel ORA_DISK_1 using channel ORA_DISK_2 using channel ORA_DISK_3 using channel ORA_DISK_4 Finished implicit crosscheck copy at 11-APR-18 searching for all files in the recovery area cataloging files... cataloging done List of Cataloged Files ======================= File Name: +FRA/PRIMARY/ARCHIVELOG/2018_04_11/thread_1_seq_15211.2442.973138947 File Name: +FRA/PRIMARY/ARCHIVELOG/2018_04_11/thread_1_seq_15212.458.973138947 File Name: +FRA/PRIMARY/ARCHIVELOG/2018_04_11/thread_1_seq_15291.2126.973169799 File Name: +FRA/PRIMARY/CONTROLFILE/current.256.922462525 searching for all files that match the pattern +DATA List of Files Unknown to the Database ===================================== File Name: +DATA/orapwasm File Name: +DATA/PRIMARY/spfileCONPRO.ora File Name: +DATA/PRIMARY/TEMPFILE/LMTEAMTEMP.275.973170913 (...) File Name: +DATA/TEST/CONTROLFILE/Current.257.922321133 File Name: +DATA/ASM/ASMPARAMETERFILE/REGISTRY.253.922315255 Do you really want to catalog the above files (enter YES or NO)? YES cataloging files... cataloging done List of Cataloged Files ======================= File Name: +DATA/PRIMARY/TEMPFILE/LMTEAMTEMP.275.973170913 File Name: +DATA/PRIMARY/TEMPFILE/MGMTTEMP.274.973170913 File Name: +DATA/PRIMARY/TEMPFILE/LICMANTEMP.273.973170913 File Name: +DATA/PRIMARY/TEMPFILE/TEMP2.272.973170913 File Name: +DATA/PRIMARY/TEMPFILE/TEMP2.271.973170913 (...) List of Files Which Where Not Cataloged ======================================= File Name: +DATA/orapwasm RMAN-07517: Reason: The file header is corrupted File Name: +DATA/PRIMARY/spfileCONPRO.ora RMAN-07518: Reason: Foreign database file DBID: 0 Database Name: File Name: +DATA/PRIMARY/ONLINELOG/group_20.381.948561469 RMAN-07529: Reason: catalog is not supported for this file type File Name: +DATA/PRIMARY/ONLINELOG/group_21.380.948561487 (...) RMAN> list copy of database; List of Datafile Copies ======================= Key File S Completion Time Ckp SCN Ckp Time ------- ---- - --------------- ---------- --------------- 248 1 A 11-APR-18 14947795005 09-APR-18 Name: +DATA/PRIMARY/datafile/system.298.973129969 231 2 A 11-APR-18 14947795005 09-APR-18 Name: +DATA/PRIMARY/datafile/sysaux.315.973127749 251 3 A 11-APR-18 14947795005 09-APR-18 (...) Name: +DATA/PRIMARY/datafile/sfsdata.302.973129281 246 93 A 11-APR-18 14947795005 09-APR-18 Name: +DATA/PRIMARY/datafile/sfsdata.300.973129815 RMAN> switch database to copy; datafile 1 switched to datafile copy "+DATA/PRIMARY/datafile/system.298.973129969" datafile 2 switched to datafile copy "+DATA/PRIMARY/datafile/sysaux.315.973127749" datafile 3 switched to datafile copy "+DATA/PRIMARY/datafile/bodata_new.295.973130253" datafile 4 switched to datafile copy "+DATA/PRIMARY/datafile/users.358.973111005" datafile 5 switched to datafile copy "+DATA/PRIMARY/datafile/wsidx.285.973130519" datafile 6 switched to datafile copy "+DATA/PRIMARY/datafile/wsdata.284.973130527" datafile 7 switched to datafile copy "+DATA/PRIMARY/datafile/gt_data.283.973130533" (...) RMAN> report schema; RMAN-06139: WARNING: control file is not current for REPORT SCHEMA Report of database schema for database with db_unique_name PRIMARY List of Permanent Datafiles =========================== File Size(MB) Tablespace RB segs Datafile Name ---- -------- -------------------- ------- ------------------------ 1 9000 SYSTEM *** +DATA/PRIMARY/datafile/system.298.973129969 2 27610 SYSAUX *** +DATA/PRIMARY/datafile/sysaux.315.973127749 3 1876 BODATA_NEW *** +DATA/PRIMARY/datafile/bodata_new.295.973130253 4 32715 USERS *** +DATA/PRIMARY/datafile/users.358.973111005 5 100 WSIDX *** +DATA/PRIMARY/datafile/wsidx.285.973130519 (...) 92 10240 SFSDATA *** +DATA/PRIMARY/datafile/sfsdata.302.973129281 93 10240 SFSDATA *** +DATA/PRIMARY/datafile/sfsdata.300.973129815 List of Temporary Files ======================= File Size(MB) Tablespace Maxsize(MB) Tempfile Name ---- -------- -------------------- ----------- -------------------- 1 8000 TEMP2 8000 +DATA/STANDBY/tempfile/temp2.360.967677721 2 1000 LMTEAMTEMP 30000 +DATA/STANDBY/tempfile/lmteamtemp.318.967677721 3 500 MGMTTEMP 30000 +DATA/STANDBY/tempfile/mgmttemp.309.967677721 4 500 LICMANTEMP 30000 +DATA/STANDBY/tempfile/licmantemp.345.967 (...)After this, a reboot of the database got it back up and working and applying archived logs again, although with a little noise in the alert log.
GMGRL> show database old-primary Database - old-primary Role: PHYSICAL STANDBY Intended State: APPLY-ON Transport Lag: 11 minutes 3 seconds <-- This means that something is not right, but let's hope that DG Broker takes care of it. Apply Lag: 0 seconds Real Time Query: OFF Instance(s): PRIMARY Database Status: SUCCESS DGMGRL> switchover to old-primary; Performing switchover NOW, please wait... Error: ORA-16775: target standby database in broker operation has potential data loss <-- No, it doesn't Failed. Unable to switchover, primary database is still "new-primary"It was the Broker changing log_archive_dest_1 again. Fixed it, the missing archived logs where shipped and now, yes, this is the one... no. The old-primary failed to shutdown. We believe that due to something like "Bug 20139391 - Dataguard Broker calls CRS to restart for Databases not being managed by CRS (Doc ID 20139391.8)" although this is not our case. We were running out of time and the maintenance window had been stretched way too far already. Our IPC got involved here again and helped complete a manual switchover after getting rid of the Broker and all was fine.
ALTER SYSTEM SET log_archive_trace=0 SCOPE=BOTH SID='PRIMARY'; ALTER SYSTEM SET log_archive_format='PRIMARY_%t_%r_%s.dbf' SCOPE=SPFILE SID='PRIMARY'; ALTER SYSTEM SET standby_file_management='AUTO' SCOPE=BOTH SID='*'; ALTER SYSTEM SET archive_lag_target=1800 SCOPE=BOTH SID='*'; ALTER SYSTEM SET log_archive_max_processes=4 SCOPE=BOTH SID='*'; ALTER SYSTEM SET log_archive_min_succeed_dest=1 SCOPE=BOTH SID='*'; ALTER SYSTEM SET fal_server='STANDBY' SCOPE=BOTH; ALTER SYSTEM SET log_archive_dest_1='location="USE_DB_RECOVERY_FILE_DEST"',' valid_for=(STANDBY_LOGFILE,STANDBY_ROLE)' SCOPE=BOTH SID='PRIMARY';So, it is the Broker that changes the parameters related to a standby configuration but, why these? According to Oracle documentation:
Associated with each database are various properties that the DMON process uses to control the database's behavior. The properties are recorded in the configuration file as a part of the database's object profile that is stored there. Many database properties are used to control database initialization parameters related to the Data Guard environment. To ensure that the broker can update the values of parameters in both the database itself and in the configuration file, you must use a server parameter file to control static and dynamic initialization parameters. The use of a server parameter file gives the broker a mechanism that allows it to reconcile property values selected by the database administrator (DBA) when using the broker with any related initialization parameter values recorded in the server parameter file. When you set values for database properties in the broker configuration, the broker records the change in the configuration file and propagates the change to all of the databases in the Data Guard configuration.So, we've learnt the hard way that once the Broker is in charge, we must use the Broker to change standby related parameters or it will revert the changes during the next startup time. In this case, we were unlucky to have a setting in place that added even more noise to the overall situation.
At the end of the day, even the best make assumptions and mistakes. The first assumption we made is that everything was properly set up and ready in case of a DR scenario. We didn't have much opportunity to test it, so we had to work with this assumption. It meant that the setup was quite good, but not good enough. The main mistake here was not a mistake per se in the sense that is was something done the wrong way, the mistake was not questioning what we were doing and, more importantly, how were we doing it. We simply assumed, and here it is again, that a bequeath connection that allows us to change the configuration and even a shutdown of the remote database will be valid for a switchover operation, which it was not. So, lessons learned here: test your DR scenarios, never make assumptions and question what you are doing, even more if you are not doing it frequently or following instructions given by others, and that includes Oracle support. There is always room to learn something new and the bigger the failure, the better the experience.
Ready to optimize your Oracle Database for the future?