Blog | Pythian

What's eating my database? Where to focus performance efforts using Oracle AWR

Written by Jared Still | Sep 6, 2016 4:00:00 AM

Defining the Gold Standard: Average Active Sessions (AAS)

Traditional methods would include looking at CPU, load and memory statistics from sar or other sources. Perhaps a better method however is just to let the Oracle database tell you where to concentrate your efforts. The data is all available in AWR; all that is necessary is to go find it.

 

The gold standard for determining database activity is the Average Active Session (AAS), a metric that appeared when Oracle introduced ASH and AWR as performance monitoring tools. AAS measures just how active the database is, and is defined as the "number of sessions that are either working or waiting" at a point in time.

Armed with that information you can now dig in to AWR and find the most active times from an AAS perspective. Initial efforts to find the top 10 most active periods in a database looked like this, where the following SQL fragment is used to rank AWR snapshots by the maximum AAS value. 

--aas-1.sql with aas as (     select snap_id, value, dbid, instance_number, rownum my_rownum from (         select distinct snap_id, dbid, instance_number,          max(value) over (partition by snap_id order by snap_id) value         from dba_hist_sysmetric_history         where metric_name = 'Average Active Sessions'         order by value desc     ) ), top10aas as (     select sn.begin_interval_time begin_time, aas.instance_number,      aas.snap_id, aas.value, aas.dbid     from aas     join dba_hist_snapshot sn on sn.snap_id = aas.snap_id      and sn.instance_number = aas.instance_number     where my_rownum <= 10     order by value desc ) select * from top10aas 

Validating Data: Moving from Metrics to ASH Calculations

That seemed to work well until some apparently spurious values showed up. A server having 32 cores was seen to have an AAS of 3000+ during a period of time in one AWR snapshot. As there was no other evidence (CPU, Memory, IO) to support this, it seemed it was likely in error.

So rather than rely on AAS, a different metric was chosen. The number of sessions per snapshot that are on CPU, or waiting on some non-idle event are now being counted as DB Time. The ratio of $db\_time : elapsed\_time$ is now being considered to determine the most active periods.

--aas-2.sql with aas as (   select snap_id, value, dbid, instance_number, rownum my_rownum from (     select snap_id, round(db_time / elapsed_time,1) value, dbid, instance_number from (       select distinct h.snap_id, h.instance_number, h.dbid,        count(*) over (partition by h.snap_id, h.instance_number) * 10 db_time,       (extract( day from (s.end_interval_time - s.begin_interval_time) )*24*60*60)+       (extract( hour from (s.end_interval_time - s.begin_interval_time) )*60*60)+       (extract( minute from (s.end_interval_time - s.begin_interval_time) )*60)+       (extract( second from (s.end_interval_time - s.begin_interval_time))) elapsed_time       from dba_hist_active_sess_history h       join dba_hist_snapshot s on s.snap_id = h.snap_id        and s.instance_number = h.instance_number       where (wait_class is null or wait_class != 'Idle')     )     order by 2 desc   ) ), top10aas as (     select sn.begin_interval_time begin_time, aas.instance_number,      aas.snap_id, aas.value, aas.dbid     from aas     join dba_hist_snapshot sn on sn.snap_id = aas.snap_id      and sn.instance_number = aas.instance_number     where my_rownum <= 10     order by value desc ) select * from top10aas 

Comparative Results: Metric History vs. Active Session History

In hindsight, it may be a good idea to rename sub-factored queries from top10aas to topactivity. On the other hand, the methods can be switched back and forth without affecting any dependent scripts.

Here are examples from an active test database. The first example is using the query on dba_hist_sysmetric_history:

SQL> @aas-1 BEGIN_TIME                     INSTANCE_NUMBER    SNAP_ID      VALUE       DBID ------------------------------ --------------- ---------- ---------- ---------- 15-JUL-16 09.00.43.624 AM                    1      18087 294.076413 4107786006 21-JUL-16 11.00.59.752 AM                    1      18233 126.210386 4107786006 09-AUG-16 05.00.40.556 AM                    1      18683 77.0418398 4107786006 ... 

Next are the results of the query that is considering DB Time:

SQL> @aas-2 BEGIN_TIME                     INSTANCE_NUMBER    SNAP_ID      VALUE       DBID ------------------------------ --------------- ---------- ---------- ---------- 12-AUG-16 01.00.52.943 AM                    1      18751       27.7 4107786006 28-JUL-16 01.00.28.734 AM                    1      18391       26.4 4107786006 05-AUG-16 01.00.43.076 AM                    1      18583       25.4 4107786006 ... 

You probably noticed there is not a 1:1 match of periods chosen between these two methods. Due to the difference in these metrics, there will never be a 1:1 match, however both methods are valid and will find periods of high activity.

Automation: Preserving Evidence with AWR Baselines

Now that you can identify the top 10 activity periods, what is the next step? You might run AWR/ADDM reports, but you must first ensure the snapshots are not deleted by the retention policy.

The gold standard here is to create an AWR Baseline for each. The following SQL fragment is based on aas-2.sql to identify those snapshots:

with aas as (     select begin_snap_id, end_snap_id, value, dbid, instance_number, rownum my_rownum     from (         select begin_snap_id,          lead(begin_snap_id,1) over (partition by dbid, instance_number order by begin_snap_id) end_snap_id,         round(db_time / elapsed_time,1) value, dbid, instance_number         from (             select distinct h.snap_id begin_snap_id, h.instance_number, h.dbid,             count(*) over (partition by h.snap_id, h.instance_number) * 10 db_time,             (extract( day from (s.end_interval_time - s.begin_interval_time) )*24*60*60)+...             from dba_hist_active_sess_history h             join dba_hist_snapshot s on s.snap_id = h.snap_id              and s.instance_number = h.instance_number             and (h.wait_class is null or h.wait_class != 'Idle')             and s.end_interval_time > systimestamp - 5         )         order by value desc     ) ), top10 as (     select sn.begin_interval_time begin_time, aas.instance_number,      aas.begin_snap_id, aas.end_snap_id, aas.value, aas.dbid     from aas     join dba_hist_snapshot sn on sn.snap_id = aas.begin_snap_id      and sn.dbid = aas.dbid and sn.instance_number = aas.instance_number     where my_rownum <= 10     order by value desc ) 

Scripting Mass Baseline Creation with PL/SQL

Using the top10 logic above, we can now run create-baselines.sql. This script uses dbms_workload_repository.create_baseline to automatically preserve the data.

set serveroutput on size unlimited var n_expire_days number exec :n_expire_days := 1 -- change to something sensible  declare     v_baseline_pfx varchar2(30) := 'dw';      v_baseline_name varchar2(128);     i_expire_days integer := :n_expire_days;     e_baseline_exists exception;     pragma exception_init(e_baseline_exists, -13528); begin     for aasrec in ( @@top10 select begin_time, instance_number, begin_snap_id, end_snap_id, value, dbid from top10 ) loop         begin             v_baseline_name := v_baseline_pfx || '_' || to_char(aasrec.begin_snap_id) || '_' || to_char(aasrec.begin_time,'yyyymmdd-hh24:mi:ss');             dbms_workload_repository.create_baseline(                 start_snap_id => aasrec.begin_snap_id,                 end_snap_id   => aasrec.end_snap_id,                 baseline_name  => v_baseline_name,                 dbid          => aasrec.dbid,                 expiration    => i_expire_days             );         exception             when e_baseline_exists then                  dbms_output.put_line('!!Baseline ' || v_baseline_name || ' already exists');         end;     end loop; end; / 

Final Verification and Retention Strategy

Once the procedure completes, you can confirm the preserved snapshots in dba_hist_baseline: 

SQL> @show-awr-baselines BASELINE_NAME                BASELINE TYPE  START_SNAP_ID START_SNAP_TIME      EXPIRATION ---------------------------  -------------  ------------- -------------------- -------------------- dw_31888_20160723-08:00:10   STATIC                 31888 2016-07-23 09:00:24  2016-08-31 12:45:03 dw_31946_20160725-18:00:12   STATIC                 31946 2016-07-25 19:00:26  2016-08-31 12:45:03 ... 
 

The AWR snapshots preserved by the new baselines will not be aged out as per standard AWR retention policies. This allows ample time to run reports and drill down on the root cause of performance issues. Code from this article may also be found at https://github.com/jkstill/oracle-baselines.

Oracle Database Consulting Services

Ready to optimize your Oracle Database for the future?