In the last month, we have been hit with two clients’ large-scale failures. The first involved network issues; the second, disk failures. Fun fun.
We have one production and two standby environments. In the case of the first failure we went into our trusty dataguard config and performed what can only be described as a minor miracle.
First we connected to the standby that was to become the primary.
Welcome to DGMGRL, type "help" for information. DGMGRL> connect sys/xxxxx@rcatstdby2 Connected. DGMGRL> show configuration; Configuration Name: rcat Enabled: YES Protection Mode: MaxPerformance Fast-Start Failover: DISABLED Databases: rcat_aussw10_db - Physical standby database rcat_nflsw01_db - Primary database rcat_aussw20_db - Physical standby database Current status for "rcat": ^C
We had to control-c out because it hung: it could not connect to nflsw01 and we knew that at least dataguard thought there was a problem. Then we took the steps that separate the men from the boys.
oracle@aussw10-->(ORACLE_SID = rcat) ** prodstdby2, rcatstdby2, limtest, COREprod, ** /home/oracle> dgmgrl DGMGRL for Solaris: Version 10.2.0.2.0 - 64bit Production Copyright (c) 2000, 2005, Oracle. All rights reserved. Welcome to DGMGRL, type "help" for information. DGMGRL> connect sys/xxxxxxx@rcatstdby2 Connected. DGMGRL> remove site rcat_nflsw01_db you cannot remove the primary site DGMGRL> failover to rcat_aussw10_db graceful; Performing failover NOW, please wait... Failover succeeded, new primary is "rcat_aussw10_db" DGMGRL> show configuration Configuration Name: rcat Enabled: YES Protection Mode: MaxPerformance Fast-Start Failover: DISABLED Databases: rcat_aussw10_db - Primary database rcat_nflsw01_db - Physical standby database (disabled) rcat_aussw20_db - Physical standby database Current status for "rcat": SUCCESS DGMGRL>
That was it. What a let down! No magical status updates, no little “dots” going across the screen. Just one line and 15 seconds. Oracle, get your act together! I even had to use bold text to make it stand out.
What happened, you may ask. We took one of the standbys and turned it into a primary. The other standby then automagically started recovering from the new primary! Wow! Maybe you don’t know, but this used to be a huge, complicated process once upon a time. I remember doing — ugh! — manual failovers in the past and now it just feels so… 2005. It’s really amazing what dataguard can do isn’t it?
Sadly, it can’t take care of everything. The second disk failure managed to fubar dataguard enough that we had to do our trusty 2005 version of a manual failover. We were using standby redo logs, and when the failure hit, some of our actual archive logs hadn’t made it over because we weren’t using ASYNC. I’m still not sure how this happened, but when dataguard tried to failover, it found that we were missing archive logs and redo, and wouldn’t do a graceful failover.
Here’s this attempt with the accompanying alert log from the standby that we tried to activate.
DGMGRL> show configuration Configuration Name: prod Enabled: YES Protection Mode: MaxPerformance Databases: prod_aussw20_db - Primary database prod_nflsw01_db - Physical standby database prod_aussw10_db - Physical standby database Current status for "prod": Error: ORA-16625: cannot reach the primary DGMGRL> failover to prod_nflsw01_db graceful; Performing failover NOW. Please wait... Error: ORA-16771: Converting physical standby to primary failed in the failover operation. Failed. Can not proceed to failover. DGMGRL>
And here is the output from the alert log:
ARCH: Connecting to console port... Mon Oct 30 07:15:30 2006 Terminal Recovery: completion detected (prod) Mon Oct 30 07:15:30 2006 ARCH: Connecting to console port... Completed: ALTER DATABASE RECOVER MANAGED STANDBY DATABASE DI Mon Oct 30 07:15:30 2006 ALTER DATABASE COMMIT TO SWITCHOVER TO PRIMARY WAIT WITH SESSION SHUTDOWN Mon Oct 30 07:15:30 2006 ALTER DATABASE SWITCHOVER TO PRIMARY (prod) ARCH: Connecting to console port... If media recovery active, switchover will wait 900 seconds Database not available for switchover End-Of-REDO archived log file has not been recovered Database not available for switchover End-Of-REDO archived log file has not been recovered Switchover: Media recovery required - standby not in limbo ORA-16139 signalled during: ALTER DATABASE COMMIT TO SWITCHOVER TO PRIMARY WAI...
It unfortunate that it wasn’t smart enough to know that it had missing archive logs. I think it should have asked to perform a resetlogs
or something. In the end, we found two of the three archive logs on the other standby. The third must be in ora-0600 purgatory.
For reference, I’ve added the alert log from the successful dataguard failover that I described in the first part of the article. It’s neat-o. When it works, it works incredibly well.
MRP0: Background Media Recovery cancelled with status 16037 Thu Oct 19 15:07:28 2006 Errors in file /home/oracle/app/oracle/admin/rcat/bdump/rcat_mrp0_6688.trc: ORA-16037: user requested cancel of managed recovery operation Managed Standby Recovery not using Real Time Apply Recovery interrupted! Recovered data files to a consistent state at change 68214212 Thu Oct 19 15:07:29 2006 Errors in file /home/oracle/app/oracle/admin/rcat/bdump/rcat_mrp0_6688.trc: ORA-16037: user requested cancel of managed recovery operation Thu Oct 19 15:07:29 2006 MRP0: Background Media Recovery process shutdown (rcat) Thu Oct 19 15:07:29 2006 Terminal Recovery: Stopped real time apply Thu Oct 19 15:07:29 2006 Attempt to do a Terminal Recovery (rcat) Thu Oct 19 15:07:29 2006 Media Recovery Start: Managed Standby Recovery (rcat) Managed Standby Recovery not using Real Time Apply parallel recovery started with 5 processes Terminal Recovery timestamp is '10/19/2006 15:07:31' Terminal Recovery: applying standby redo logs. Terminal Recovery: thread 1 seq# 98673 redo required Terminal Recovery: /home/oracle/oradata/rcat/sbylog1_g6.dbf Identified End-Of-Redo for thread 1 sequence 98673 Thu Oct 19 15:07:31 2006 Incomplete recovery applied all redo ever generated. Recovery completed through change 68214224 Thu Oct 19 15:07:31 2006 Media Recovery Complete (rcat) Terminal Recovery: successful completion Begin: Standby Redo Logfile archival End: Standby Redo Logfile archival Resetting standby activation ID 375195464 (0x165d0748) Completed: ALTER DATABASE RECOVER MANAGED STANDBY DATABASE FINISH FORCE Thu Oct 19 15:07:35 2006 ALTER DATABASE COMMIT TO SWITCHOVER TO PRIMARY WAIT WITH SESSION SHUTDOWN Thu Oct 19 15:07:35 2006 ALTER DATABASE SWITCHOVER TO PRIMARY (rcat) If media recovery active, switchover will wait 900 seconds Standby terminal recovery start SCN: 68214212 SwitchOver after complete recovery through change 68214224 Online log /home/oracle/oradata/rcat/log1_g1.dbf: Thread 1 Group 1 was previously cleared Online log /home/oracle/oradata/rcat/log2_g1.dbf: Thread 1 Group 1 was previously cleared Online log /home/oracle/oradata/rcat/log1_g2.dbf: Thread 1 Group 2 was previously cleared Online log /home/oracle/oradata/rcat/log2_g2.dbf: Thread 1 Group 2 was previously cleared Online log /home/oracle/oradata/rcat/log1_g3.dbf: Thread 1 Group 3 was previously cleared Online log /home/oracle/oradata/rcat/log2_g3.dbf: Thread 1 Group 3 was previously cleared Standby became primary SCN: 68214211 Converting standby mount to primary mount. Thu Oct 19 15:07:36 2006 Switchover: Complete - Database mounted as primary (rcat) Thu Oct 19 15:07:36 2006 ARC0: STARTING ARCH PROCESSES Thu Oct 19 15:07:36 2006 Completed: ALTER DATABASE COMMIT TO SWITCHOVER TO PRIMARY WAIT WITH SESSION SHUTDOWN Thu Oct 19 15:07:36 2006 ARC2: Archival started ARC0: STARTING ARCH PROCESSES COMPLETE ARC0: Becoming the 'no SRL' ARCH ARC2 started with pid=13, OS id=14326 Thu Oct 19 15:07:37 2006 ALTER SYSTEM SET standby_archive_dest='' SCOPE=BOTH SID='rcat'; Thu Oct 19 15:07:37 2006 ALTER SYSTEM SET log_archive_dest_1='location="/export/backup/arch/rcat"', 'valid_for=(ONLINE_LOGFILE,ALL_ROLES)' SCOPE=BOTH SID='rcat'; Thu Oct 19 15:07:37 2006 ALTER SYSTEM SET log_archive_dest_state_1='ENABLE' SCOPE=BOTH SID='rcat'; Thu Oct 19 15:07:37 2006 ALTER SYSTEM SET log_archive_dest_2='service="(DESCRIPTION=(ADDRESS_LIST= (ADDRESS=(PROTOCOL=tcp)(HOST=aussw20)(PORT=1521))) (CONNECT_DATA=(SERVICE_NAME=rcat_aussw20_db_XPT.aussw20.lim.com) (INSTANCE_NAME=rcat)(SERVER=dedicated)))"',' ARCH SYNC NOAFFIRM delay=0 OPTIONAL max_failure=0 max_connections=1 reopen=300 db_unique_name="rcat_aussw20_db" register net_timeout=180 valid_for=(online_logfile,primary_role)' SCOPE=BOTH; Thu Oct 19 15:07:37 2006 ALTER SYSTEM SET log_archive_dest_state_2='ENABLE' SCOPE=BOTH; Thu Oct 19 15:07:37 2006 ALTER DATABASE OPEN Thu Oct 19 15:07:38 2006 Assigning activation ID 374183684 (0x164d9704) Thread 1 opened at log sequence 98674 Current log# 2 seq# 98674 mem# 0: /home/oracle/oradata/rcat/log1_g2.dbf Current log# 2 seq# 98674 mem# 1: /home/oracle/oradata/rcat/log2_g2.dbf Successful open of redo thread 1
Share this
You May Also Like
These Related Stories
No Comments Yet
Let us know what you think