How to accurately measure data guard lag events
If you work with Data Guard, the lag between the primary and standby database is an informative but difficult metric to measure. I've encountered some "solutions" that use the
COMPLETION_TIME
in
(G)V$ARCHIVED_LOG
for the standby destination as the time of apply, but this isn't accurate.
COMPLETION_TIME
on the standby represents the time the log was written to the filesystem, not when it was applied. To verify this, compare a query of
V$ARCHIVED_LOG
to the alert log entries on the standby:
SQL> select sequence#, dest_id, applied, completion_time from v$archived_log where sequence# = 47 order by 1, 2;
SEQUENCE# DEST_ID APPLIED COMPLETION_TIME
---------- ---------- --------- -------------------
47 1 NO 2019-03-13 21:13:02
47 2 YES 2019-03-13 21:13:02
[oracle@ol7-183-dg2 trace]$ egrep -B1 'Media Recovery Log.*o1\_mf\_1\_47' alert_cdb1.log
2019-03-13T21:21:15.148922+00:00
Media Recovery Log /u01/app/oracle/CDB1_STBY/archivelog/2019_03_13/o1_mf_1_47_g8lwlyws_.arc
--
2019-03-13T21:22:22.658726+00:00
PR00 (PID:4198): Media Recovery Log /u01/app/oracle/CDB1_STBY/archivelog/2019_03_13/o1_mf_1_47_g8lwlyws_.arc
--
2019-03-13T21:24:24.180440+00:00
Media Recovery Log /u01/app/oracle/CDB1_STBY/archivelog/2019_03_13/o1_mf_1_47_g8lwlyws_.arc
[oracle@ol7-183-dg2 trace]$
[oracle@ol7-183-dg2 trace]$ ls -l /u01/app/oracle/CDB1_STBY/archivelog/2019_03_13/o1_mf_1_47_g8lwlyws_.arc
-rw-r-----. 1 oracle oinstall 79360 Mar 13 21:13 /u01/app/oracle/CDB1_STBY/archivelog/2019_03_13/o1_mf_1_47_g8lwlyws_.arc
[oracle@ol7-183-dg2 trace]$
According to
V$ARCHIVED_LOG
, this log "completed" at 21:13:02. However, the last entry for this redo log in the alert log is timestamped 21:24:24. That's a difference of over 11 minutes! (The multiple entries in the alert log are due to interruptions in transport and failed apply. The timestamp of the last entry will be used later as the "apply date" for the log.) The timestamp of the log on the filesystem matches
COMPLETION_TIME
. I wanted a way to report the true lag—the delta between the primary's completion time and the apply time in the standby's alert log. The ideal solution would be accessible to a client and allow reporting on multiple Data Guard configurations. Surely Oracle wouldn't make this too difficult, would they? It turns out the solution is not that simple. Testing was done using Oracle 18c on Oracle Linux 7.6. This solution should work for other versions and operating systems, as well. I used Tim Hall's helpful
Vagrant Builds to quickly set up and configure my test environment. This method requires that both the primary and standby database are available. History can only be reported for events present in the current alert log. Keep this in mind if ADR or a similar process is used to rotate the standby's alert log. The
V$ARCHIVED_LOG
view on the primary is aware of logs on both the primary and standby databases. The same view on the standby database doesn't know about logs on the primary. The content of standby entries differs slightly, too. On the primary, the
NAME
field shows the file name for files on the primary. It's unaware of file names on the standby, which will be different with Oracle Managed Files. I get information on the archive log files from both primary and standby via a database link:
create database link cdb1_stby connect to MYUSER identified by "MyPassword" using 'cdb1_stby';
The meat of this technique takes advantage of the
X$DBGALERTEXT
table.
X$DBGALERTEXT
is an in-memory view of the database alert log.
V$DIAG_ALERT_EXT
may also be used, but this view sometimes exhibits performance problems. But X$ tables can only be read by privileged accounts and SELECT can't be granted:
SQL> conn myuser/MyPassword@cdb1
Connected.
SQL> select count(*) from sys.X$DBGALERTEXT;
select count(*) from sys.X$DBGALERTEXT
*
ERROR at line 1:
ORA-00942: table or view does not exist
SQL> conn / as sysdba
Connected.
SQL> grant select on sys.X$DBGALERTEXT to MYUSER;
grant select on sys.X$DBGALERTEXT to MYUSER
*
ERROR at line 1:
ORA-02030: can only select from fixed tables/views
Fortunately, there's a workaround. Create a view on the table and grant select on the view:
create view v_X$DBGALERTEXT as select * from sys.X$DBGALERTEXT;
select count(*) from v_X$DBGALERTEXT;
grant select on v_X$DBGALERTEXT to MYUSER;
If performance isn't a concern and the existing view is acceptable, grant SELECT privileges on V$DIAG_ALERT_EXT:
grant select on V$DIAG_ALERT_EXT to MYUSER;
The next step is to get records from the standby's alert log associated with redo log apply. These include "Media Recovery Log" in the
MESSAGE_TEXT
columns of both
X$DBGALERTEXT
and
V$DIAG_ALERT_EXT
. The alert log record needs to be joined to a matching row in the standby's
V$ARCHIVED_LOG
table. It might seem obvious to extract the thread and sequence directly but this may not always be straightforward. It would depend on the value of
LOG_ARCHIVE_FORMAT
and could change over the life of the alert log. This gets into potentially complex parsing of the message. Instead, I chose to extract the file name using
substr(a.message_text, instr(a.message_text, 'Log ') + 4)
. This pulls everything after the word "Log " in the message:
Media Recovery Log /u01/app/oracle/CDB1_STBY/archivelog/2019_03_13/o1_mf_1_47_g8lwlyws_.arc
That's the same value found for
NAME
in the standby
V$ARCHIVED_LOG
view. Or, nearly so. There's an extra character at the end:
SQL> select distinct length(substr(a.message_text, instr(a.message_text, 'Log ') + 4)) from sys.v_X$DBGALERTEXT@cdb1_stby a where a.message_text like '%Media Recovery Log %';
LENGTH(SUBSTR(A.MESSAGE_TEXT,INSTR(A.MESSAGE_TEXT,'LOG')+4))
------------------------------------------------------------
73
SQL> select distinct length(name) from v$archived_log@cdb1_stby where dest_id = 1;
LENGTH(NAME)
------------
72
That non-printing character can be removed with
REGEXP_REPLACE
:
regexp_replace(substr(a.message_text, instr(a.message_text, 'Log ') + 4), '[[:space:]]', '')
. The timestamp in the alert log also must be cast as a date and any time zone differences accounted for. There may be multiple entries for the same file so only the last (MAX) timestamp is needed. I now have a query to extract the actual apply time and redo log name from the standby alert log:
col name for a72
alter session set nls_date_format='YYYY-MM-DD HH24:MI:SS';
select max(cast((from_tz(cast(a.originating_timestamp as timestamp), 'GMT') at time zone DBTIMEZONE) as date)) as apply_time
, regexp_replace(substr(a.message_text, instr(a.message_text, 'Log ') + 4), '[[:space:]]', '') as name
from sys.v_X$DBGALERTEXT@cdb1_stby a
where a.message_text like '%Media Recovery Log %'
group by regexp_replace(substr(a.message_text, instr(a.message_text, 'Log ') + 4), '[[:space:]]', '');
APPLY_TIME NAME
------------------- ------------------------------------------------------------------------
2019-03-13 21:24:24 /u01/app/oracle/CDB1_STBY/archivelog/2019_03_13/o1_mf_1_49_g8lwo8xr_.arc
2019-03-13 21:24:24 /u01/app/oracle/CDB1_STBY/archivelog/2019_03_13/o1_mf_1_51_g8lwoct0_.arc
2019-03-13 21:24:24 /u01/app/oracle/CDB1_STBY/archivelog/2019_03_13/o1_mf_1_48_g8lwo81n_.arc
2019-03-13 21:24:24 /u01/app/oracle/CDB1_STBY/archivelog/2019_03_13/o1_mf_1_47_g8lwlyws_.arc
2019-03-13 21:24:24 /u01/app/oracle/CDB1_STBY/archivelog/2019_03_13/o1_mf_1_50_g8lwocsv_.arc
...
The redo log name can be used to join to
V$ARCHIVED_LOG
on the standby database:
select max(cast((from_tz(cast(a.originating_timestamp as timestamp), 'GMT') at time zone DBTIMEZONE) as date)) as apply_time
, s.sequence#
, s.thread#
, regexp_replace(substr(a.message_text, instr(a.message_text, 'Log ') + 4), '[[:space:]]', '') as name
from sys.v_X$DBGALERTEXT@cdb1_stby a
, v$archived_log@cdb1_stby s
where a.message_text like '%Media Recovery Log %'
and s.name = regexp_replace(substr(a.message_text, instr(a.message_text, 'Log ') + 4), '[[:space:]]', '')
group by s.sequence#
, s.thread#
, regexp_replace(substr(a.message_text, instr(a.message_text, 'Log ') + 4), '[[:space:]]', '')
order by s.sequence#, 1;
APPLY_TIME SEQUENCE# THREAD# NAME
------------------- ---------- ---------- ------------------------------------------------------------------------
2019-03-13 21:24:24 47 1 /u01/app/oracle/CDB1_STBY/archivelog/2019_03_13/o1_mf_1_47_g8lwlyws_.arc
2019-03-13 21:24:24 48 1 /u01/app/oracle/CDB1_STBY/archivelog/2019_03_13/o1_mf_1_48_g8lwo81n_.arc
2019-03-13 21:24:24 49 1 /u01/app/oracle/CDB1_STBY/archivelog/2019_03_13/o1_mf_1_49_g8lwo8xr_.arc
2019-03-13 21:24:24 50 1 /u01/app/oracle/CDB1_STBY/archivelog/2019_03_13/o1_mf_1_50_g8lwocsv_.arc
2019-03-13 21:24:24 51 1 /u01/app/oracle/CDB1_STBY/archivelog/2019_03_13/o1_mf_1_51_g8lwoct0_.arc
...
(If the standby's alert log is rotated or truncated, this query can be used as a basis to periodically insert historic redo timestamps into a separate history table.) The thread and sequence numbers are then joined to
V$ARCHIVED_LOG
on the primary database and use the primary
COMPLETION_TIME
to calculate the true lag between primary and standby for each log:
with lag as (
select p.thread#
, p.sequence#
, x.applied
, p.first_time
, p.next_time
, p.completion_time as primary_apply_time
, decode(x.applied, 'YES', x.completion_time, sysdate) as standby_completion_time
, x.standby_apply_time
, round(to_number(decode(x.applied, 'YES', x.standby_apply_time, sysdate) - p.completion_time) * 1440, 2) as apply_delta_minutes
from v$archived_log p
, v$archive_dest pd
,(select max(cast((from_tz(cast(a.originating_timestamp as timestamp), 'GMT') at time zone DBTIMEZONE) as date)) as standby_apply_time
, s.sequence#
, s.thread#
, s.applied
, s.completion_time
from sys.v_X$DBGALERTEXT@cdb1_stby a
, v$archived_log@cdb1_stby s
where a.message_text like '%Media Recovery Log %'
and s.name = regexp_replace(substr(a.message_text, instr(a.message_text, 'Log ') + 4), '[[:space:]]', '')
group by s.sequence#
, s.thread#
, s.applied
, s.completion_time
, regexp_replace(substr(a.message_text, instr(a.message_text, 'Log ') + 4), '[[:space:]]', '')
) x
where p.thread# = x.thread#
and p.sequence# = x.sequence#
and p.dest_id = pd.dest_id
and pd.target = 'PRIMARY'
and p.standby_dest = 'NO'
order by 9 desc, p.sequence#)
select *
from lag
order by apply_delta_minutes desc;
THREAD# SEQUENCE# APPLIED FIRST_TIME NEXT_TIME PRIMARY_APPLY_TIME STANDBY_COMPLETION_ STANDBY_APPLY_TIME APPLY_DELTA_MINUTES
---------- ---------- --------- ------------------- ------------------- ------------------- ------------------- ------------------- -------------------
1 47 YES 2019-03-13 21:10:36 2019-03-13 21:13:02 2019-03-13 21:13:02 2019-03-13 21:13:02 2019-03-13 21:24:24 11.37
1 48 YES 2019-03-13 21:13:02 2019-03-13 21:14:15 2019-03-13 21:14:15 2019-03-13 21:14:16 2019-03-13 21:24:24 10.15
1 49 YES 2019-03-13 21:14:15 2019-03-13 21:14:16 2019-03-13 21:14:16 2019-03-13 21:14:16 2019-03-13 21:24:24 10.13
1 50 YES 2019-03-13 21:14:16 2019-03-13 21:14:19 2019-03-13 21:14:19 2019-03-13 21:14:19 2019-03-13 21:24:24 10.08
1 51 YES 2019-03-13 21:14:19 2019-03-13 21:14:19 2019-03-13 21:14:19 2019-03-13 21:14:19 2019-03-13 21:24:24 10.08
If
APPLIED
for the log on the standby is
NO
,
SYSDATE
is substituted to show the current delay. Out of curiosity, I checked the lag reported by the standby's
COMPLETION_TIME
. I added
round(to_number(decode(x.applied, 'YES', x.completion_time, sysdate) - p.completion_time) * 1440, 2) as completion_delta_minutes
. It demonstrated this isn't an accurate way to report lag!:
THREAD# SEQUENCE# APPLIED FIRST_TIME NEXT_TIME PRIMARY_APPLY_TIME STANDBY_COMPLETION_ STANDBY_APPLY_TIME COMPLETION_DELTA_MINUTES APPLY_DELTA_MINUTES
---------- ---------- --------- ------------------- ------------------- ------------------- ------------------- ------------------- ------------------------ -------------------
1 47 YES 2019-03-13 21:10:36 2019-03-13 21:13:02 2019-03-13 21:13:02 2019-03-13 21:13:02 2019-03-13 21:24:24 0 11.37
1 48 YES 2019-03-13 21:13:02 2019-03-13 21:14:15 2019-03-13 21:14:15 2019-03-13 21:14:16 2019-03-13 21:24:24 .02 10.15
1 49 YES 2019-03-13 21:14:15 2019-03-13 21:14:16 2019-03-13 21:14:16 2019-03-13 21:14:16 2019-03-13 21:24:24 0 10.13
1 50 YES 2019-03-13 21:14:16 2019-03-13 21:14:19 2019-03-13 21:14:19 2019-03-13 21:14:19 2019-03-13 21:24:24 0 10.08
1 51 YES 2019-03-13 21:14:19 2019-03-13 21:14:19 2019-03-13 21:14:19 2019-03-13 21:14:19 2019-03-13 21:24:24 0 10.08
I even saw examples of negative lag where the log was written earlier on the standby than the primary!:
THREAD# SEQUENCE# APPLIED FIRST_TIME NEXT_TIME PRIMARY_APPLY_TIME STANDBY_COMPLETION_ STANDBY_APPLY_TIME COMPLETION_DELTA_MINUTES APPLY_DELTA_MINUTES
---------- ---------- --------- ------------------- ------------------- ------------------- ------------------- ------------------- ------------------------ -------------------
1 33 YES 2019-03-13 14:28:08 2019-03-13 14:28:13 2019-03-13 14:28:38 2019-03-13 14:28:13 2019-03-13 14:31:28 -.42 2.83
I must give credit and thanks to Franck Pachot, whose post
Data Guard Gap History inspired me to find a way to retrieve this information from a client using only SQL. This avoids the need to be on the host and doesn't require access to the filesystem. It also shouldn't be difficult to extend this technique to centralize reporting and monitoring for multiple Data Guard configurations.
Share this
You May Also Like
These Related Stories
A review of Tungsten Replicator: Part 1 - Installation
A review of Tungsten Replicator: Part 1 - Installation
Sep 1, 2011
5
min read
7 Best Practices for Preparing Your Database for Black Friday
7 Best Practices for Preparing Your Database for Black Friday
Oct 30, 2013
3
min read
What can your Google Workspace reseller do for you?
What can your Google Workspace reseller do for you?
Dec 12, 2022
3
min read
No Comments Yet
Let us know what you think