Don’t forget to sign up for more updates here.
Statspack vs. AWR: Wrong number of SQL Executions
This post is about a Statspack anomaly I discovered while analyzing execution trends of a particular SQL on one of our clients’ databases. I decided to share it, since its appearance can be misleading. The database in question was an Oracle EE version 11.2.0.3. However, as Statspack is no longer maintained, the same problem can be reproduced on 19c. In this post, I’ll also verify whether the same problem shows up in AWR (spoiler: it doesn’t).
Problem description
Mining the Statspack repository for a certain SQL_ID using this simple script revealed a sharp increase in its number of executions —a SQL which is usually executed once or twice per snapshot and has apparently been executed 409 times:
SNAP_ID EXECUTIONS ELAPSED_TIME_SEC_TOTAL DISK_READS_PER_EXEC BUFFER_GETS_PER_EXEC ROWS_PROCESSED_PER_EXEC -------- ---------- ---------------------- ------------------- -------------------- ----------------------- 31451 2 1126.48 1767896 1768738 10904478 31452 2 1579.11 1760511.5 1761343.5 10904478 31453 1 498.76 1767896 1758346 10904478 31455 409 481113.8 1752410.67 2256430.3 10886122.5
A proper Statspack report reported the same numbers. Cross-checking with the application side, there were indications that this couldn’t be true.
Looking closer into the suspiciously high execution count
As Statspack’s code is available in $ORACLE_HOME/rdbms/admin, we can easily verify how snapshots are created (spcpkg.sql) and also how reports are generated (sprepins.sql).
Snapshot creation
The PL/SQL procedure “SNAP_SQL” located in STATSPACK’s package body contains the INSERT statement which captures SQL execution statistics. It keeps only the relevant parts for this investigation:
insert into stats$sql_summary select l_snap_id , p_dbid , p_instance_number , max(substrb(sql_text,1,31)) text_subset , max(sql_id) sql_id . . , sum(executions) executions , sum(px_servers_executions) px_servers_executions . . from v$sql sql where is_obsolete = 'N' and sql_id in (select sql_id from stats$v$sqlstats_summary sqlstats where ( buffer_gets > l_buffer_gets_th or disk_reads > l_disk_reads_th or parse_calls > l_parse_calls_th or executions > l_executions_th or sharable_mem > l_sharable_mem_th or version_count > l_version_count_th ) ) group by old_hash_value, address;
Notice that the SQL aggregates and stores execution statistics for a particular by “old_hash_value, address”:
- “old_hash_value” is the same for all child cursors of a particular SQL (similar as SQL_ID) – Tanel Poder explains it in details in this post https://tanelpoder.com/2009/02/22/sql_id-is-just-a-fancy-representation-of-hash-value/ .
- Same goes for “address”, as it represents the address of the handle to the parent for this cursor.
Since the “group by” doesn’t also include the CHILD_ADDRESS column, the SQL execution statistics are aggregated across all of the child cursors. It’s also worth noting that the values in stats$sql_summary are cumulative. This means we have to calculate the difference between two snapshots to obtain the actual delta for a particular execution statistic.
Report generation
This brings us to the code responsible for calculating the number of executions between two snapshots in a Statspack report, available in $ORACLE_HOME/rdbms/admin/sprepins.sql. Here again, is only the relevant part:
. . select old_hash_value , text_subset , sum(case when snap_id = &&begin_snap and prev_snap_id = -1 then 0 else case when (address != prev_address) or (executions < prev_executions) then executions else executions - prev_executions end end) delta_executions . . ) delta_rows_processed from (select snap_id , old_hash_value , text_subset . . , executions ,(lag(executions, 1, 0) over (partition by old_hash_value , dbid , instance_number order by snap_id)) prev_executions ) prev_rows_processed from stats$sql_summary s where s.snap_id between &&begin_snap and &&end_snap and s.dbid = &&dbid and s.instance_number = &&inst_num ) group by old_hash_value , text_subset , module
The relevant part for us is the logic in the ELSE part of the CASE statement. It returns the “executions” value stored in stats$sql_summary when:
- “address != prev_address“: this means that a new parent cursor was created for a specific SQL between the analyzed snapshots and there’s no delta to be computed. This could be due to the previous parent being flushed out of the library cache for various reasons (e.g. shared pool pressure, explicit purge of the cursor, cursor obsoleting) or an instance restart. The instance restart case is actually not relevant here, as the Statspack report generation script checks whether both snapshots have the same instance startup time.
- (executions < prev_executions)“: if the number of executions in the final snapshot is lower than in the previous one, a similar event as described above could have happened to the cursor—a child cursor could be flushed out and reloaded. This isn’t a problem with one child cursor. However, we’ll see what can, and has, happened when there are more.
In all other cases the “executions – prev_executions” figure is reported, which means we simply get a delta between two snapshots. This is the most common scenario.
The above logic works until there’s a scenario like this:
- There’s more than one child cursor for a SQL_ID.
- One (or more), but not all of the child cursors get flushed out the library cache. If there are no new executions of the observed SQL, this effectively reduces the number of executions summed over all of the children.
And this is exactly what happened on the database I was analyzing. Checking the contents of stats$sql_summary for this particular SQL, we can see that SNAP_ID=31454 reports 410 executions, however the next snapshot reports 409 executions:
SQL> select snap_id, executions from stats$sql_summary where sql_id = '1v5wcft51pmy1' order by snap_id; SNAP_ID EXECUTIONS ---------- ---------- 31451 407 31452 409 31453 410 31454 410 <-- 31455 409 <-- 31456 409 31457 410
Knowing how Statspack collects and reports data from stats$sql_summary, I checked the status in v$sql and there were various child cursors for this particular SQL:
SQL> SELECT child_number, old_hash_value, address, child_address, first_load_time, last_load_time, last_active_time, executions, invalidations, parse_calls FROM v$sql WHERE sql_id = '1v5wcft51pmy1' ORDER BY 1; CHILD_NUMBER OLD_HASH_VALUE ADDRESS CHILD_ADDRESS FIRST_LOAD_TIME LAST_LOAD_TIME LAST_ACTIVE_TIME EXECUTIONS INVALIDATIONS PARSE_CALLS ------------ -------------- ---------------- ---------------- ------------------- ------------------- -------------------- ---------- ------------- ----------- 0 1140771677 000000D13F779218 000000D139974CF8 2021-05-20/23:11:33 2021-09-01/07:21:49 01.sep.2021 21:13:29 29 0 339 1 1140771677 000000D13F779218 000000D139E94B40 2021-05-20/23:11:33 2021-09-01/21:41:37 02.sep.2021 08:03:50 7 0 85 4 1140771677 000000D13F779218 000000D139E0EC88 2021-05-20/23:11:33 2021-08-26/22:11:38 27.aug.2021 09:43:43 10 0 123 7 1140771677 000000D13F779218 000000D139C1B398 2021-05-20/23:11:33 2021-08-28/07:26:53 28.aug.2021 19:43:24 26 0 304 10 1140771677 000000D13F779218 000000D13FAA3E40 2021-05-20/23:11:33 2021-08-29/21:41:34 30.aug.2021 07:25:36 6 0 69 11 1140771677 000000D13F779218 000000D13838C840 2021-05-20/23:11:33 2021-08-30/08:38:21 30.aug.2021 21:44:38 29 0 317 12 1140771677 000000D13F779218 000000D137BDC5E0 2021-05-20/23:11:33 2021-08-31/08:28:23 31.aug.2021 21:13:29 29 0 338 7 rows selected.
My theory about why the execution count decreased from 410 to 409 in stats$sql_summary is that one or more of the child cursors was purged from the library cache between snapshots 31454 and 31455. When taking snapshot 31455, the “sum(executions)” over all child cursors returned 409 and that was stored in stats$sql_summary. A further indication is that the values in the CHILD_NUMBER column have some missing values, meaning child cursors were purged from the library cache. When generating the Statspack report, the “(executions < prev_executions)” condition became true, so the 409 executions were reported.
Note: The values from v$sql displayed above were not taken immediately after SNAP_ID=31455 was taken, so the “sum(executions)” can’t be directly compared. The important takeaway here is the number of child cursors and the missing values in the child_number column.
Reproducing the problem
The main motivation for reproducing the problem is to verify the above theory and perhaps even more importantly, check how AWR behaves in such situations.
The plan to reproduce the issue was relatively simple; create a parent cursor with many child cursors, flush only some of them from the library cache and check what Statspack and AWR report.
Purging only some of the cursors turned out as the challenging part. At first I tried those techniques:
- dbms_shared_pool.purge: As explained by Jonathan Lewis in this post https://jonathanlewis.wordpress.com/2019/12/20/purge_cursor/, dbms_shared_pool.purge purges all of the child cursors.
- Setting “_cursor_obsolete_threshold” to say, 100. When the threshold is reached, a new parent cursor with the same SQL_ID is generated, then when no session is using the child cursors under the “old” parent, it gets flushed out the library cache along with its child cursors. I tried to keep one child open in a session by pausing the process before completing all of the FETCH calls, but that didn’t help.
Eventually, I opted for an unscientific approach— generating space pressure on the shared pool by heavily hard parsing and waiting for some of my observed child cursors to be purged out automatically by Oracle.
Test execution and observation of results
The script I used to execute the whole test is available here and the full log of the test is here. (Note that if you intend to re-run it, it might need a bit of tweaking depending on the sizes of your SGA and shared pool. The test was executed on a Oracle 19.5 EE running on Linux x64. Also, because of the heavy hard parsing, you should run it in an isolated database.) Finally, the observed SQL was selected from a real table because I wanted to check how other statistics behave. Here’s what we could use to speed up the test:
qtext="select /*"$(date)"*/ 1 from dual"
Description of the most relevant steps in the test script:
- Generate a distinct SQL text used to generate many child cursors by including a comment with the current date
- “Color” the SQL using dbms_workload_repository.add_colored_sql in order to guarantee it’s captured by AWR
- Flush the shared pool and create an initial Statspack (SNAP_ID=571) and AWR (SNAP_ID=311) snapshot
- Generate 200 child cursors by changing the optimizer environment parameter “optimizer_index_cost_adj” to a different value for each new child. After the code runs, this is the status from v$sql:
CHILD_NUMBER EXECUTIONS ------------ ---------- 0 100 1 100 2 100 3 100 . . 197 100 198 100 199 100 200 rows selected. CNT SUM(EXECUTIONS) ---------- --------------- 200 20000
At this point there are 200 child cursors, each with 100 executions. Next, a Statspack (SNAP_ID=572) and AWR (SNAP_ID=312) snapshot is created.
5. Flood the shared pool by hard parsing in a loop. This is achieved by generating and executing unique SQLs in a loop. Exit the loop as soon as the number of child cursors for our observed SQL decreases. After that, check the number of child cursors in v$sql:
** Child cursor status from v$sql after the number of child cursors decreased: ** COUNT(*) SUM(EXECUTIONS) ---------- --------------- 41 4100
At this point, 159 child cursors were purged and there are only 41 left in the library cache.
6. Set optimizer_index_cost_adj=1000 (a value which wasn’t used yet) in order to force a new child cursor creation, and execute the observed SQL only once. The number of child cursors increases from 41 to 42 and the total number of executions increases by one to 4101:
** Child cursor status from v$sql: ** COUNT(*) SUM(EXECUTIONS) ---------- --------------- 42 4101
7. Generate another set of Statspack (SNAP_ID = 573) and AWR (SNAP_ID=313) snapshots.
Note: On some occasions, the act of creating Statspack and AWR snapshots purges some of the observed child cursors out of the library cache. We need to check the figures reported in the Statspack and AWR reports so they match with the above output, otherwise the test needs to be repeated.
Result analysis
Finally, we can verify the data captured in AWR and Statspack repositories.
Statspack
SNAP_ID EXECUTIONS ---------- ---------- 572 20000 573 4101
Excerpt from the Statspack report for snapshots 572-573:
CPU CPU per Elapsd Old Time (s) Executions Exec (s) %Total Time (s) Buffer Gets Hash Value ---------- ------------ ---------- ------ ---------- --------------- ---------- 16.28 4,101 0.00 366.4 16.32 6,393,462 1915833037 Module: SQL*Plus select /*Wed Sep 22 17:33:31 CEST 2021*/ max(t_1.object_id) from t1 t_1
Observation: Even though there was only one execution of our observed SQL between snapshots 572 and 573, Statspack reports 4101 executions. This is what most likely happened also on the production database, and we just reproduced it.
AWR
AWR doesn’t have the same problem as Statspack, as it correctly reports only one execution for SNAP_ID=313:
SNAP_ID NODE BEGIN_INTERVAL_TIME SQL_ID PLAN_HASH_VALUE EXECS ROWS_PROCESSED ROWS_PROCESSED_PER_EXEC AVG_ETIME AVG_LIO AVG_PIO ---------- ------ ------------------------------ ------------- --------------- ------------ -------------- ----------------------- ------------ -------------- ---------- 312 1 22-SEP-21 05.33.32.745 PM g96vqh5wyw7q9 3724264953 20,000 20000 1 .004 1,559.0 0 313 1 22-SEP-21 05.34.51.241 PM g96vqh5wyw7q9 1 1 1 .005 1,562.0 0
The code to capture AWR snapshots isn’t as easily accessible as Statspack’s. Suffice to say that when researching how AWR captures the snapshots while inserting into the WRH$_SQLSTAT table (which is the table underlying the DBA_HIST_SQLSTAT view), it does so by running an INSERT INTO SELECT from X$KGLCURSOR_CHILD_SQLIDPH, which stores data for each child cursor. Therefore, it’s likely that runtime statistics at the child cursor level are taken into account when generating the AWR snapshots.
All of the files generated during this test are uploaded to the GitHub repository.
Summary
In cases where only some of the child cursors are purged from library cache, Statspack can report an erroneously high number of executions for a particular SQL. AWR doesn’t seem to suffer from the same problems.
Final notes
Be aware that in case you try to re-run the test with a higher number of child cursors, you might get “strange” results in AWR. Below is what AWR reports when running the test with 500 child cursors (which is still below the _cursor_obsolete_threshold=1024 limit in 19c). Despite there were 500 child cursors with 50000 executions in v$sql, AWR didn’t capture their runtime statistics:
SQL> SELECT ss.snap_id, begin_interval_time, plan_hash_value, executions_delta, rows_processed_delta, buffer_gets_delta, disk_reads_delta, elapsed_time_delta FROM dba_hist_sqlstat s, dba_hist_snapshot ss WHERE sql_id = 'fs76gd8ck6gft' AND ss.snap_id = S.snap_id AND ss.instance_number = S.instance_number ORDER BY 1, 2, 3; SNAP_ID BEGIN_INTERVAL_TIME PLAN_HASH_VALUE EXECUTIONS_DELTA ROWS_PROCESSED_DELTA BUFFER_GETS_DELTA DISK_READS_DELTA ELAPSED_TIME_DELTA ---------- ---------------------------------------- --------------- ---------------- -------------------- ----------------- ---------------- ------------------ 341 24-SEP-21 01.53.41.341 PM 0 199321839 342 24-SEP-21 01.57.03.885 PM 0 6858
This behavior is described in My Oracle Support note: “Why “Executions” and “Elap per Exec(s)” are ‘Blank’ in AWR for Some SQL Statements (Doc ID 1522547.1)”:
“This is expected behavior. Since 10.2, “Executions” and “Elap per Exec(s)” statistics of high version count cursor(> 200 by default) will no longer be collected.”
Share this
You May Also Like
These Related Stories
No Comments Yet
Let us know what you think