Sleuthing for Temporary Space Spendthrifts

5 min read
Dec 20, 2012 12:00:00 AM

I was recently looking at an issue where a large database server was running out of temp space. The issue had occurred in the past, so the goal was to identify what had been causing the system to run out of temp. Although ORA-1555 messages will appear in the database alert log (with query text) when a query runs out of temporary space, they only talk about the query that was “unlucky” enough to be attempting to get space when the tablespace was full, not who was using the temp space.

The Hunt for Historical Temp Data

A quick web search shows that most people seem to be querying current tables. Coskan Gundogar blogged about a historical monitoring script he wrote but only drilled down into a specific session and serial number.

Limitations of Standard Oracle Views

A quick look at AWR showed that v$tempseg_usage had no direct historical view. There is dba_hist_tbspc_usage info for the temp tablespace, but it only shows aggregate usage. 

Leveraging ASH for Granular Monitoring

However, dba_hist_active_sess_history, the AWR samples of active session history data, has a new column in 11gR2: TEMP_SPACE_ALLOCATED. From the docs:

Amount of TEMP memory (in bytes) consumed by this session at the time this sample was taken.

Having temp usage in a table like ASH is actually much more useful than a v$tempseg_usage capture since it lets us drill down not only to the query, but also to the exact SQL plan and plan step where the temp space is being used.

Creating the Analysis Base Table

Querying out the data requires a bit of elbow grease though: There is no delta column like for the I/O and time model stats, so we need to use some fancy analytic functions to compute the delta values ourselves.

dba_hist_active_sess_history itself is a view composed of multiple joins on the back end, so it takes quite a while to query. I generally create a fixed version of the table with the timestamp I want to analyze, in order to speed up the queries later. And since this is an Exadata system, I’ll take advantage of hybrid columnar compression. Adjust the begin_interval_time range to match the date range you’re looking at.

CREATE TABLE marc_ash COMPRESS FOR QUERY LOW as SELECT * FROM   dba_hist_active_sess_history WHERE  dbid = (SELECT dbid                FROM   v$database)        AND snap_id IN (SELECT snap_id                        FROM   dba_hist_snapshot                        WHERE  dbid = (SELECT dbid                                       FROM   v$database)                               AND begin_interval_time BETWEEN To_date(                                   '20-DEC-2012 05:00',                                   'dd-mon-yyyy hh24:mi')                                   AND                                   To_date(                                   '20-DEC-2012 10:00', 'dd-mon-yyyy hh24:mi')); 

(Side note: I’m trying out an online SQL beautifier tool. Readers, do you find this format easier to read?)

Calculating Temp Usage Deltas

Once the base table is created, I can query it: (Thanks to Randolf Geist for pointing out a missing grouping for parallel execution)

SELECT * FROM   (SELECT Rank()                  over (                    ORDER BY SUM(Nvl(temp_space_delta, 0)) DESC) position,                sql_id,                sql_plan_hash_value,                sql_plan_operation,                sql_plan_line_id,                Count(DISTINCT sql_exec_id)                      total_execs,                Trunc(SUM(Nvl(temp_space_delta, 0)) / 1024 / 1024)                ||'m'                                            temp_usage         FROM   (SELECT sql_exec_id,                        sql_id,                        sql_plan_hash_value,                        sql_plan_operation,                        sql_plan_line_id,                        temp_space_allocated - Nvl(Lag(temp_space_allocated, 1)                        over (                          PARTITION BY sql_exec_id, sql_id, session_id                          ORDER BY sample_id), 0)                        temp_space_delta                 FROM   marc_ash)         GROUP  BY sql_id,                   sql_plan_operation,                   sql_plan_hash_value,                   sql_plan_line_id) WHERE  position <= 10 ORDER  BY position; 

A bit of explanation is in order: The core of the query selects from marc_ash using the lag() analytic function to get the previous temp space allocation for the same execution, SQL statement, and parallel query process, and subtracts it to get the difference in temp space usage for the current sample. I then take this result, plus some information about the SQL step executing, and group it by SQL ID, plan operation, and line ID. And finally, I sort the results, keeping the top 10, and apply some formatting to the result.

Interpreting the Results

And as is usual with ASH, since I’m only looking at 10-second samples of data, the results don’t reflect 100% what really happened. Though the longer the execution step runs, the more accurate the end results will be.

Here’s a sample result (with a bit of cleanup of the column headers):

POSITION SQL_ID         SQL_PLAN_HASH   SQL_PLAN_OP     SQL_PLAN_LINE_ID TOTAL_EXECS    TEMP_USAGE ------- --------------- --------------- --------------- --------------- --------------- ---------- 1       a9hrk7g2ahbtk   1062456949      HASH JOIN       15              1               142607m 2       b40ahcxu8nxdq   2741224596      HASH JOIN       21              1               139959m 3       b62gn3yqyau4k   3673174370      SORT            1               4               92378m 4       5rf3btrz44d6v   3879859951      SORT            3               1               35128m 5       5rf3btrz44d6v   3879859951      SORT            6               1               34095m 6       8ucjdxhxwz93g   2599798119      HASH            15              1               10873m 7       901mxasy6gu2j   801798936       SORT            2               28              6268m 8       4b7jatzjaf4z3   270793325       SORT            3               3               5876m 9       0h5cvsbs92cyp   1935762178      SORT            1               3               4522m 10      84z212am4j987   3445415588      SORT            1               2               4182m 

Drill-Down into Problematic SQL

During this time period, there was high temp usage from the two plan steps of SQL ID 5rf3btrz44d6v. Even though it only ran once, it consumed close to 70G of temp space during this period. The benefit of using ASH data, though, is that I not only know which SQL statement is responsible, but I also have an idea of which plan steps are responsible: two sort operations. 

We can plug in the SQL ID and plan hash value in the output to dbms_xplan.display_awr to see the plan itself. The results show the full query text as well as the execution plan involved. 

A short snippet of the plan:

------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | Id  | Operation                                         | Name          | Rows  | Bytes |TempSpc| Cost (%CPU)| Time     | Pstart| Pstop |    TQ  |IN-OUT| PQ Distrib | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |   0 | SELECT STATEMENT                                  |               |       |       |       |   100K(100)|          |       |       |        |      |            | |   1 |  PX COORDINATOR                                   |               |       |       |       |            |          |       |       |        |      |            | |   2 |   PX SEND QC (ORDER)                              | :TQ10009      |    76M|    18G|       |   100K  (1)| 00:20:04 |       |       |  Q1,09 | P->S | QC (ORDER) | |   3 |    SORT ORDER BY                                  |               |    76M|    18G|    21G|   100K  (1)| 00:20:04 |       |       |  Q1,09 | PCWP |            | |   4 |     PX RECEIVE                                    |               |    76M|    18G|       |   100K  (1)| 00:20:04 |       |       |  Q1,09 | PCWP |            | |   5 |      PX SEND RANGE                                | :TQ10008      |    76M|    18G|       |   100K  (1)| 00:20:04 |       |       |  Q1,08 | P->P | RANGE      | |   6 |       SORT GROUP BY                               |               |    76M|    18G|    21G|   100K  (1)| 00:20:04 |       |       |  Q1,08 | PCWP |            | |   7 |        PX RECEIVE                                 |               |    76M|    18G|       |   100K  (1)| 00:20:04 |       |       |  Q1,08 | PCWP |            | |   8 |         PX SEND HASH                              | :TQ10007      |    76M|    18G|       |   100K  (1)| 00:20:04 |       |       |  Q1,07 | P->P | HASH       | |   9 |          HASH GROUP BY                            |               |    76M|    18G|    21G|   100K  (1)| 00:20:04 |       |       |  Q1,07 | PCWP |            | |  10 |           VIEW                                    | VW_DAG_0      |    76M|    18G|       | 67638   (1)| 00:13:32 |       |       |  Q1,07 | PCWP |            | |  11 |            HASH GROUP BY                          |               |    76M|    16G|    17G| 67638   (1)| 00:13:32 |       |       |  Q1,07 | PCWP |            | |  12 |             HASH JOIN                             |               |    76M|    16G|       | 37552   (2)| 00:07:31 |       |       |  Q1,07 | PCWP |            | |  13 |              BUFFER SORT                          |               |       |       |       |            |          |       |       |  Q1,07 | PCWC |            | |  14 |               PX RECEIVE                          |               |  2709K|    67M|       |  5287   (1)| 00:01:04 |       |       |  Q1,07 | PCWP |            | |  15 |                PX SEND HASH                       | :TQ10004      |  2709K|    67M|       |  5287   (1)| 00:01:04 |       |       |        | S->P | HASH       | |  16 |                 TABLE ACCESS STORAGE FULL         | TAB1          |  2709K|    67M|       |  5287   (1)| 00:01:04 |       |       |        |      |            | 

Conclusion and Recommendations

 A very quick look at this execution plan seems to indicate that it’s returning an absolutely massive dataset to the database client and sorting it twice. It would benefit from more conditions to restrict results to what the client is actually looking for.

Any comments, readers? How are you monitoring for temp space usage? Anyone using resource management to control temp usage?

Oracle Database Consulting Services

Ready to optimize your Oracle Database for the future?

 

 
On this page

Ready to unlock value from your data?

With Pythian, you can accomplish your data transformation goals and more.