How to test an Oracle database upgrade using a physical standby
It was a Thursday morning. I started my day at work and found out that I was tasked with running a test upgrade to 11.2.0.4 right on Friday. This is just to make clear that I did not have much time for planning and testing and that there may be better options to complete this task, but this is the way I found to work.
Let's start with the setup. This is a 11.2.0.3 single instance database with two physical standbys managed with Data Guard (DG). To add a bit of salt to the mix, Fast-Start Failover (FSFO) is enabled. Let’s call these our new database friends, A (the primary), B (one physical standby) and C (the other physical standby). There is a fourth partner in the party: the DG Observer. This is part of the FSFO architecture and is a process running, ideally, out of a server that is not hosting any of the databases. The idea of the test was to remove B from the DG setup, upgrade it to 11.2.0.4, downgrade to 11.2.0.3 and then put it back in the mix. Easily said, not so easy to accomplish. This blog post is a simplification of the whole process, straight to the point and only showing possible caveats if someone faces a requirement like this one in the future, myself included.The plan
So, the Pythian machinery started to work. Yes we are good, not only because we are individually good, but also because we collaborate internally to make things even better, hence "the machinery" term I've just coined. I started with a basic plan:- Changed the failover target in Data Guard to point to C
- Disable B physical standby in the Data Guard configuration
- Create a guaranteed restore point on B
- Activate the physical standby
- Upgrade to 11.2.0.4
- Downgrade to 11.2.0.3
- Flashback the database
- Enable it back into the Data Guard configuration
- Go get some rest
Get to the point
First things first, and this is a lesson I've learned the hard way: always use TNS to connect to Data Guard CLI dgmgrl. Why? Because most of the time you will be changing things or reviewing stuff, but when it comes to executing a switchover operation, a bequeath connection fails to connect to the database that goes down and the operation fails.
Now, I start by changing the FSFO target, initially pointing to B, to point to C database. This requires temporarily disabling FSFO or facing the following error:Error: ORA-16654: fast-start failover is enabled
So, we disable FSFO, change the database configuration and enable FSFO back:
DGMGRL> disable fast_start failover;
Disabled.
DGMGRL> edit database 'A' set property 'FastStartFailoverTarget' = 'C';
Property "FastStartFailoverTarget" updated
DGMGRL> enable fast_start failover;
Error: ORA-16651: requirements not met for enabling fast-start failover
Oops!! What happened here? A quick review of the MOS document "
Data Guard Broker - Configure Fast Start Failover, Data Protection Levels and the Data Guard Observer (Doc ID 1508729.1)" showed me that the
LogXPTMode of the C database was set to ASYNC, while it is required to be SYNC for the database to be eligible as a FSFO target. Let's do it then:
DGMGRL> edit database C set property 'LogXptMode' = 'SYNC';
Property "LogXptMode" updated
DGMGRL> disable fast_start failover;
Disabled.
DGMGRL> edit database 'A' set property 'FastStartFailoverTarget' = 'C';
Property "FastStartFailoverTarget" updated
DGMGRL> enable fast_start failover;
Enabled.
Right, the first step completed. I will now send some archived logs to the standby databases, just to make sure that everything is up to date before I disable the B database.
And here comes another piece of advice: Enable the time and the timing in SQL*Plus wherever you are working. It will give more sense to your notes and you can easily track back your work in case something goes wrong. Yes, I learned this one the hard way, too.
sys@A> set time on timing on
01:38:01 sys@A> alter system archive log current;
System altered.
Elapsed: 00:00:00.29
01:38:13 sys@A> alter system archive log current;
System altered.
Elapsed: 00:00:00.92
01:38:14 sys@A> alter system archive log current;
System altered.
Elapsed: 00:00:01.19
01:38:15 sys@A> alter system checkpoint;
System altered.
Elapsed: 00:00:00.18
And now I modify the DG configuration.
oracle@serverA> dgmgrl
DGMGRL for Linux: Version 11.2.0.3.0 - 64bit Production
Copyright (c) 2000, 2009, Oracle. All rights reserved.
Welcome to DGMGRL, type "help" for information.
DGMGRL> connect sys@A
Password:
Connected.
DGMGRL> show database 'B';
Database - B
Role: PHYSICAL STANDBY
Intended State: APPLY-ON
Transport Lag: 0 seconds
Apply Lag: 0 seconds
Real Time Query: OFF
Instance(s):
B
Database Status:
SUCCESS
DGMGRL> EDIT DATABASE 'B' SET STATE='APPLY-OFF';
Succeeded.
DGMGRL> disable database 'B';
Disabled.
DGMGRL> show configuration
Configuration - fsfo_A
Protection Mode: MaxAvailability
Databases:
A - Primary database
C - (*) Physical standby database
Error: ORA-16820: fast-start failover observer is no longer observing this database
B - Physical standby database (disabled)
Fast-Start Failover: ENABLED
Configuration Status:
ERROR
Wait, what? Another problem? This one was harder to spot. It turned out to be a problem with the Observer. It was unable to connect to the C database due to lack of proper credentials. An "ORA-01031: insufficient privileges" was the tip in the log file. Simply adding the credentials to the Oracle Wallet in use by the Observer fixed the issue, as I was able to verify from the very same connection:
oracle@observer> mkstore -wrl /home/oracle/wallet/.private -createCredential C sys ************
Oracle Secret Store Tool : Version 11.2.0.3.0 - Production
Copyright (c) 2004, 2011, Oracle and/or its affiliates. All rights reserved.
Enter wallet password:
Create credential oracle.security.client.connect_string7
oracle@observer> dgmgrl /@C (This is not a bequeath connection ;) )
DGMGRL for Linux: Version 11.2.0.3.0 - 64bit Production
Copyright (c) 2000, 2009, Oracle. All rights reserved.
Welcome to DGMGRL, type "help" for information.
Connected.
DGMGRL> disable fast_start failover
Disabled.
DGMGRL> enable fast_start failover
Enabled.
DGMGRL> show configuration verbose;
Configuration - fsfo_A
Protection Mode: MaxAvailability
Databases:
A - Primary database
C - (*) Physical standby database
B - Physical standby database (disabled)
(*) Fast-Start Failover target
Properties:
FastStartFailoverThreshold = '30'
OperationTimeout = '30'
FastStartFailoverLagLimit = '30'
CommunicationTimeout = '180'
FastStartFailoverAutoReinstate = 'TRUE'
FastStartFailoverPmyShutdown = 'TRUE'
BystandersFollowRoleChange = 'ALL'
Fast-Start Failover: ENABLED
Threshold: 30 seconds
Target: C
Observer: observer
Lag Limit: 30 seconds (not in use)
Shutdown Primary: TRUE
Auto-reinstate: TRUE
Configuration Status:
SUCCESS
DGMGRL> exit
At this point, we have B out of the Data Guard equation and can proceed with the upgrade/downgrade part.
Upgrade to 11.2.0.4 and downgrade to 11.2.03
In order to run the test, I have to activate the standby database, so I can open it as a primary and execute the upgrade/downgrade process. This is why I removed it from the DG configuration as a first step:to avoid facing serious trouble with two primary databases enabled.
So I start by creating a Guaranteed Restore Point (GRP) to easily revert the database back to its standby role. In order to be able to create the GRP, the redo apply must be stopped, which I did already.2:53:48 sys@B> CREATE RESTORE POINT BEFORE_UPGRADE_11204 GUARANTEE FLASHBACK DATABASE;
Restore point created.
Now that the GRP has been created, I activate the standby and proceed with the tests. The DG broker process on the database must be stopped to avoid conflicts with the DG configuration. I also cleaned the log_archive_config init parameter to be sure that nothing is getting out of this database.
02:54:00 sys@B> alter system set dg_broker_start=false scope=spfile;
System altered.
Elapsed: 00:00:00.04
02:54:24 sys@B> alter system set log_archive_config='' scope=both;
System altered.
Elapsed: 00:00:00.05
02:54:41 sys@B> shut immediate;
ORA-01109: database not open
Database dismounted.
ORACLE instance shut down.
02:55:01 sys@B> startup mount;
ORACLE instance started.
Total System Global Area 8551575552 bytes
Fixed Size 2245480 bytes
Variable Size 3607104664 bytes
Database Buffers 4932501504 bytes
Redo Buffers 9723904 bytes
Database mounted.
02:56:25 sys@B> alter database activate standby database;
Database altered.
Elapsed: 00:00:00.58
02:56:31 sys@B> alter database open;
Database altered.
It is time to upgrade and downgrade the database now. There is plenty of documentation about the process and I encountered no issues, so there is nothing about the process worth including here.
I'll post the script I used to run the catupgrd.sql script just in case it is useful to someone in the future. As you know, this script modifies the data dictionary to adjust it to the new version. Depending on the gap between versions and the size of the data dictionary, this script may run for quite a long time. Join this with the risk of a remote session going down over VPN or similar stuff, and you will want to make sure that your database session is still there when you come back. There are tools like screen or tmux but they may not be available or usable, so I usually rely on nohup and a simple bash script. This particular version is to be run after loading the proper Oracle environment variables with oraenv, but you can easily modify it to include this step if you want to schedule the script with cron or similar.#!/bin/bash
sqlplus >> EOF
conn / as sysdba
set time on timing on trimspool on pages 999 lines 300
alter session set nls_date_format='dd-mm-yyyy hh24:mi:ss';
spool /covisint/user/a_catbpd/11204Upgrade/upgrade_dryrun_output.log
@$ORACLE_HOME/rdbms/admin/catupgrd.sql
spool off;
exit
EOF
Once the script is ready, and with execution permissions, simply run it in
nohup mode with some logging in place:
nohup run_upgrade.sh > run_upgrade_`date "+%F_%H-%M"`.log 2>&1 &
The same script can be used to run
catdwgrd.sql and
catrelod.sql scripts for the downgrade by simply changing the relevant lines.
Back to the start point
After the upgrade and downgrade tests are successfully completed, it is time to bring everything back to what it looked like when we started this exercise.
The very first step is to flashback the B standby database to the GRP I created before. This is done still using the 11.2.0.4 binaries.05:03:19 sys@B> flashback database to restore point before_upgrade_11204;
Flashback complete.
Elapsed: 00:00:09.92
05:03:34 sys@B> alter database convert to physical standby;
Database altered.
Elapsed: 00:00:00.39
05:04:06 sys@B> shutdown immediate
ORA-01507: database not mounted
ORACLE instance shut down.
After the flashback is complete, we mount the database again but now with the 11.2.0.3 binaries and start the DG broker process
05:05:22 > startup nomount
ORACLE instance started.
Total System Global Area 8551575552 bytes
Fixed Size 2245480 bytes
Variable Size 1358957720 bytes
Database Buffers 7180648448 bytes
Redo Buffers 9723904 bytes
05:05:30 > alter system set dg_broker_start=true scope=Both;
System altered.
Elapsed: 00:00:00.01
05:05:38 > alter database mount standby database;
Database altered.
Elapsed: 00:00:05.25
Once the database is back up and ready to apply redo, I can enable it back in the DG configuration:
oracle@serverA> dgmgrl
DGMGRL for Linux: Version 11.2.0.3.0 - 64bit Production
Copyright (c) 2000, 2009, Oracle. All rights reserved.
Welcome to DGMGRL, type "help" for information.
DGMGRL> connect sys@A
Password:
Connected.
DGMGRL> show configuration
Configuration - fsfo_A
Protection Mode: MaxAvailability
Databases:
A - Primary database
C - (*) Physical standby database
B - Physical standby database (disabled)
Fast-Start Failover: ENABLED
Configuration Status:
SUCCESS
DGMGRL> enable database B
Enabled.
DGMGRL> EDIT DATABASE 'B' SET STATE='APPLY-ON';
Succeeded.
DGMGRL> show database B
Database - B
Role: PHYSICAL STANDBY
Intended State: APPLY-ON
Transport Lag: 0 seconds
Apply Lag: 1 hour(s) 11 minutes 49 seconds <== We have some lag here
Real Time Query: OFF
Instance(s):
B
Database Status:
SUCCESS
After a few minutes, the standby is back in sync with the primary
DGMGRL> show database B
Database - B
Role: PHYSICAL STANDBY
Intended State: APPLY-ON
Transport Lag: 0 seconds
Apply Lag: 0 seconds <== It catched up quite quickly
Real Time Query: OFF
Instance(s):
B
Database Status:
SUCCESS
I now set B again as the FSFO target and validate the final setup.
DGMGRL> DISABLE FAST_START FAILOVER
Disabled.
DGMGRL> edit database 'A' set property 'FastStartFailoverTarget' = 'B';
Property "FastStartFailoverTarget" updated
DGMGRL> edit database 'B' set property 'FastStartFailoverTarget' = 'A';
Property "FastStartFailoverTarget" updated
DGMGRL> edit database 'C' set property 'FastStartFailoverTarget' = '';
Property "FastStartFailoverTarget" updated
DGMGRL> ENABLE FAST_START FAILOVER
Enabled.
DGMGRL> show database A FastStartFailoverTarget;
FastStartFailoverTarget = 'B'
DGMGRL> show database B FastStartFailoverTarget;
FastStartFailoverTarget = 'A'
DGMGRL> show database C FastStartFailoverTarget;
FastStartFailoverTarget = ''
DGMGRL> show fast_start failover
Fast-Start Failover: ENABLED
Threshold: 30 seconds
Target: B
Observer: observer
Lag Limit: 30 seconds (not in use)
Shutdown Primary: TRUE
Auto-reinstate: TRUE
Configurable Failover Conditions
Health Conditions:
Corrupted Controlfile YES
Corrupted Dictionary YES
Inaccessible Logfile NO
Stuck Archiver NO
Datafile Offline YES
Oracle Error Conditions:
(none)
DGMGRL>
DGMGRL> show configuration
Configuration - fsfo_A
Protection Mode: MaxAvailability
Databases:
A - Primary database
B - (*) Physical standby database
C - Physical standby database
Fast-Start Failover: ENABLED
Configuration Status:
SUCCESS
Don't forget to drop the GRP on B or you may have a little alarm to deal with later :)
DROP RESTORE POINT BEFORE_UPGRADE_11204;
Final thoughts
This was an interesting exercise for several reasons.
First, this gave me the possibility of testing a procedure before it is actually executed in a production environment. This is always a good thing. No matter how many times have you done something, there is always a slight change, something done differently in a given installation or a bug hiding behind the bushes. Testing will help you prepare for what may come ahead. Reduce the surprises to a minimum.
It will also give you some deeper familiarity with the environment you are working on and confidence while running the process in production. We as consultants may not work frequently on a given customer and getting to know the environment we are working with eases the tasks.
Another reason I liked this exercise is that I, once more, got the support from my co-workers here at Pythian (special thanks to my team members). Throw an idea into a Slack channel and they will come back with more ideas, experiences, caveats and whatnot, making the task more enjoyable and executed with better quality.
If four eyes see better than two, imagine sixteen or twenty. There'll be still room for mistakes, issues and such but they will surely be quite rare.
Share this
You May Also Like
These Related Stories
No Comments Yet
Let us know what you think