I was recently looking at an issue where a large database server was running out of temp space. The issue had occurred in the past, so the goal was to identify what had been causing the system to run out of temp. Although ORA-1555 messages will appear in the database alert log (with query text) when a query runs out of temporary space, they only talk about the query that was “unlucky” enough to be attempting to get space when the tablespace was full, not who was using the temp space.
A quick web search shows that most people seem to be querying current tables. Coskan Gundogar blogged about a historical monitoring script he wrote but only drilled down into a specific session and serial number.
A quick look at AWR showed that v$tempseg_usage had no direct historical view. There is dba_hist_tbspc_usage info for the temp tablespace, but it only shows aggregate usage.
However, dba_hist_active_sess_history, the AWR samples of active session history data, has a new column in 11gR2: TEMP_SPACE_ALLOCATED. From the docs:
Amount of TEMP memory (in bytes) consumed by this session at the time this sample was taken.
Having temp usage in a table like ASH is actually much more useful than a v$tempseg_usage capture since it lets us drill down not only to the query, but also to the exact SQL plan and plan step where the temp space is being used.
Querying out the data requires a bit of elbow grease though: There is no delta column like for the I/O and time model stats, so we need to use some fancy analytic functions to compute the delta values ourselves.
dba_hist_active_sess_history itself is a view composed of multiple joins on the back end, so it takes quite a while to query. I generally create a fixed version of the table with the timestamp I want to analyze, in order to speed up the queries later. And since this is an Exadata system, I’ll take advantage of hybrid columnar compression. Adjust the begin_interval_time
range to match the date range you’re looking at.
CREATE TABLE marc_ash COMPRESS FOR QUERY LOW as SELECT * FROM dba_hist_active_sess_history WHERE dbid = (SELECT dbid FROM v$database) AND snap_id IN (SELECT snap_id FROM dba_hist_snapshot WHERE dbid = (SELECT dbid FROM v$database) AND begin_interval_time BETWEEN To_date( '20-DEC-2012 05:00', 'dd-mon-yyyy hh24:mi') AND To_date( '20-DEC-2012 10:00', 'dd-mon-yyyy hh24:mi'));
(Side note: I’m trying out an online SQL beautifier tool. Readers, do you find this format easier to read?)
Once the base table is created, I can query it:
(Thanks to Randolf Geist for pointing out a missing grouping for parallel execution)
SELECT * FROM (SELECT Rank() over ( ORDER BY SUM(Nvl(temp_space_delta, 0)) DESC) position, sql_id, sql_plan_hash_value, sql_plan_operation, sql_plan_line_id, Count(DISTINCT sql_exec_id) total_execs, Trunc(SUM(Nvl(temp_space_delta, 0)) / 1024 / 1024) ||'m' temp_usage FROM (SELECT sql_exec_id, sql_id, sql_plan_hash_value, sql_plan_operation, sql_plan_line_id, temp_space_allocated - Nvl(Lag(temp_space_allocated, 1) over ( PARTITION BY sql_exec_id, sql_id, session_id ORDER BY sample_id), 0) temp_space_delta FROM marc_ash) GROUP BY sql_id, sql_plan_operation, sql_plan_hash_value, sql_plan_line_id) WHERE position <= 10 ORDER BY position;
A bit of explanation is in order: The core of the query selects from marc_ash using the lag() analytic function to get the previous temp space allocation for the same execution, SQL statement, and parallel query process, and subtracts it to get the difference in temp space usage for the current sample. I then take this result, plus some information about the SQL step executing, and group it by SQL ID, plan operation, and line ID. And finally, I sort the results, keeping the top 10, and apply some formatting to the result.
And as is usual with ASH, since I’m only looking at 10-second samples of data, the results don’t reflect 100% what really happened. Though the longer the execution step runs, the more accurate the end results will be.
Here’s a sample result (with a bit of cleanup of the column headers):
POSITION SQL_ID SQL_PLAN_HASH SQL_PLAN_OP SQL_PLAN_LINE_ID TOTAL_EXECS TEMP_USAGE ------- --------------- --------------- --------------- --------------- --------------- ---------- 1 a9hrk7g2ahbtk 1062456949 HASH JOIN 15 1 142607m 2 b40ahcxu8nxdq 2741224596 HASH JOIN 21 1 139959m 3 b62gn3yqyau4k 3673174370 SORT 1 4 92378m 4 5rf3btrz44d6v 3879859951 SORT 3 1 35128m 5 5rf3btrz44d6v 3879859951 SORT 6 1 34095m 6 8ucjdxhxwz93g 2599798119 HASH 15 1 10873m 7 901mxasy6gu2j 801798936 SORT 2 28 6268m 8 4b7jatzjaf4z3 270793325 SORT 3 3 5876m 9 0h5cvsbs92cyp 1935762178 SORT 1 3 4522m 10 84z212am4j987 3445415588 SORT 1 2 4182m
During this time period, there was high temp usage from the two plan steps of SQL ID 5rf3btrz44d6v. Even though it only ran once, it consumed close to 70G of temp space during this period. The benefit of using ASH data, though, is that I not only know which SQL statement is responsible, but I also have an idea of which plan steps are responsible: two sort operations.
We can plug in the SQL ID and plan hash value in the output to dbms_xplan.display_awr
to see the plan itself. The results show the full query text as well as the execution pan involved.
A short snippet of the plan:
------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | Id | Operation | Name | Rows | Bytes |TempSpc| Cost (%CPU)| Time | Pstart| Pstop | TQ |IN-OUT| PQ Distrib | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | 0 | SELECT STATEMENT | | | | | 100K(100)| | | | | | | | 1 | PX COORDINATOR | | | | | | | | | | | | | 2 | PX SEND QC (ORDER) | :TQ10009 | 76M| 18G| | 100K (1)| 00:20:04 | | | Q1,09 | P->S | QC (ORDER) | | 3 | SORT ORDER BY | | 76M| 18G| 21G| 100K (1)| 00:20:04 | | | Q1,09 | PCWP | | | 4 | PX RECEIVE | | 76M| 18G| | 100K (1)| 00:20:04 | | | Q1,09 | PCWP | | | 5 | PX SEND RANGE | :TQ10008 | 76M| 18G| | 100K (1)| 00:20:04 | | | Q1,08 | P->P | RANGE | | 6 | SORT GROUP BY | | 76M| 18G| 21G| 100K (1)| 00:20:04 | | | Q1,08 | PCWP | | | 7 | PX RECEIVE | | 76M| 18G| | 100K (1)| 00:20:04 | | | Q1,08 | PCWP | | | 8 | PX SEND HASH | :TQ10007 | 76M| 18G| | 100K (1)| 00:20:04 | | | Q1,07 | P->P | HASH | | 9 | HASH GROUP BY | | 76M| 18G| 21G| 100K (1)| 00:20:04 | | | Q1,07 | PCWP | | | 10 | VIEW | VW_DAG_0 | 76M| 18G| | 67638 (1)| 00:13:32 | | | Q1,07 | PCWP | | | 11 | HASH GROUP BY | | 76M| 16G| 17G| 67638 (1)| 00:13:32 | | | Q1,07 | PCWP | | | 12 | HASH JOIN | | 76M| 16G| | 37552 (2)| 00:07:31 | | | Q1,07 | PCWP | | | 13 | BUFFER SORT | | | | | | | | | Q1,07 | PCWC | | | 14 | PX RECEIVE | | 2709K| 67M| | 5287 (1)| 00:01:04 | | | Q1,07 | PCWP | | | 15 | PX SEND HASH | :TQ10004 | 2709K| 67M| | 5287 (1)| 00:01:04 | | | | S->P | HASH | | 16 | TABLE ACCESS STORAGE FULL | TAB1 | 2709K| 67M| | 5287 (1)| 00:01:04 | | | | | |
A very quick look at this execution plan seems to indicate that it’s returning an absolutely massive dataset to the database client and sorting it twice. It would benefit from more conditions to restrict results to what the client is actually looking for.
Any comments, readers? How are you monitoring for temp space usage? Anyone using resource management to control temp usage?
Share this
You May Also Like
These Related Stories
No Comments Yet
Let us know what you think