Case Study: DBA Under Pressure for Performance Troubleshooting on Friday 13

Jul 22, 2012 / By Yury Velikanov

Tags: , , ,

On Friday 13 of June 2012, I was an on-call Oracle DBA. I got to work on a high priority client’s request. One of our client’s financial team was executing final steps of a year closing process in their ERP production environment and had a problem. The step that I received a complain about took 2 hours and still didn’t complete. I was told that last year the same step completed in 20 minutes. Looking back, I think it may be beneficial for some of you to read how the issue was troubleshooted and resolved. It may give you a chance to learn something or share some better diagnosing techniques with me and others.  Please do not hesitate to use the comments section of this blog post.

Pressure

I should mention that the whole troubleshooting story from the first call to the final resolution took ~3 hours. I bet I spent 2 hours out of 3 on communications. There were 4 different persons (including 2 managers) I had to discuss with and inform on the investigation progress. One of the managers was on a conference call with several other senior people from our client’s financial team. At some point in time, the manager was asking for a status update on 2-5 minute intervals. At this stage, you should get an idea of the amount of pressure I had on me while working on the incident. This was one of those cases where it doesn’t matter how technically correct/beautiful/precise your actions or solutions are. The only thing that matters is how quickly you resolve the issue.

Initial Call

It all started with a page from the client. A person asked to call him to discuss an issue. From the call, I learned the following:

  • There is a batch job (EBS – Concurrent Request) executing in the production environment.
  • The last time the job executed successfully was a year ago (read = no execution history, should start the troubleshooting from scratch).
  • It previously took 20 minutes to execute.
  • At the time, it had been running 2+ hours and still didn’t complete.
  • As usual in such cases, I was asked to first check if the job wasn’t blocked by anything, if there was enough temp space, etc.

I promised to have an initial look and get back to the person with the case assessment. It wasn’t too difficult to find a database session associated with the Concurrent Request. (I used the run_cr.sql script from my personal SQL library.)

Troubleshooting

Next thing that I typically check in such cases is SEQ# and some others fields from V$SESSION view to get some initial idea of what we are dealing with. I got the following initial information:

SQL> select SEQ#, SQL_ID, PREV_SQL_ID, MODULE, LAST_CALL_ET, EVENT, P1, WAIT_TIME, SECONDS_IN_WAIT, STATE, SQL_TRACE  from v$session where sid=2542;

           SEQ# SQL_ID        PREV_SQL_ID   MODULE                                                              LAST_CALL_ET
--------------- ------------- ------------- ---------------------------------------------------------------- ---------------
EVENT                                                                         P1       WAIT_TIME SECONDS_IN_WAIT STATE               SQL_TRAC
---------------------------------------------------------------- --------------- --------------- --------------- ------------------- --------
           3814               48tu5cy75d7z4 UPGLLEZL                                                                    9106
latch: cache buffers chains                                          17408691056              -1               5 WAITED SHORT TIME   DISABLED

1 row selected.

SQL> /

           SEQ# SQL_ID        PREV_SQL_ID   MODULE                                                              LAST_CALL_ET
--------------- ------------- ------------- ---------------------------------------------------------------- ---------------
EVENT                                                                         P1       WAIT_TIME SECONDS_IN_WAIT STATE               SQL_TRAC
---------------------------------------------------------------- --------------- --------------- --------------- ------------------- --------
           3814               48tu5cy75d7z4 UPGLLEZL                                                                    9203
latch: cache buffers chains                                          17408691056              -1             102 WAITED SHORT TIME   DISABLED

1 row selected.

Notes: No diagnostic pack licence was available in this environment. During the research activities, I had a quick check with “Beginning Performance Tuning” article from Arup Nanda (Tweeter, Blog) to quickly validate my understanding of V$SESSION view’s fields.

We can learn quite a bit from the two lines above. Some of you would notice that there was a 97 second interval between the two V$SESSION snapshots (LAST_CALL_ET 9203 – 9106). The session was executing a single of call for 2h and 33 minutes (LAST_CALL_ET 9203 seconds). It was working (running on CPU) and not waiting on anything, and the session didn’t call any other SQL between the V$SESSION snapshots (SEQ# is the same). However, there was one unusual thing in this output: No value was provided for the SQL_ID field (currently executing SQL). What was the session executing? Was the session executing an SQL? If it was, what’s the SQL?

How to find the current SQL if SQL_ID is null

The first idea I have tried out was switching on 10046 event (SQL trace) and see if I can figure out something from there. This is that I found in a trace file:

*** 2012-07-12 20:39:49.065
WAIT #47325300207976: nam='db file sequential read' ela= 9835 file#=156 block#=930982 blocks=1 obj#=393791 tim=1342139989065730
WAIT #47325300207976: nam='db file sequential read' ela= 621 file#=156 block#=930983 blocks=1 obj#=393791 tim=1342139989066806
WAIT #47325300207976: nam='db file sequential read' ela= 24 file#=156 block#=930984 blocks=1 obj#=393791 tim=1342139989067076

*** 2012-07-12 20:40:30.282
WAIT #47325300207976: nam='latch: cache buffers chains' ela= 84 address=17411387456 number=155 tries=0 obj#=393791 tim=1342140030282022

*** 2012-07-12 20:41:32.846
WAIT #47325300207976: nam='db file sequential read' ela= 6258 file#=156 block#=930869 blocks=1 obj#=393791 tim=1342140092846169

*** 2012-07-12 20:44:50.771
WAIT #47325300207976: nam='db file sequential read' ela= 6512 file#=146 block#=1023949 blocks=1 obj#=393791 tim=1342140290771848

*** 2012-07-12 20:44:51.785
WAIT #47325300207976: nam='db file sequential read' ela= 5782 file#=156 block#=931047 blocks=1 obj#=393791 tim=1342140291785475

*** 2012-07-12 20:45:49.957
WAIT #47325300207976: nam='db file sequential read' ela= 6098 file#=156 block#=930851 blocks=1 obj#=393791 tim=1342140349957004
WAIT #47325300207976: nam='db file sequential read' ela= 3614 file#=156 block#=930853 blocks=1 obj#=393791 tim=1342140349961811

As you see, it didn’t give me any clue on what SQL the session was executing. Let me ask you: What bits of useful information can we get from the output above? First of all, there were relatively rare and short wait events (tim, ela). Any single wait took a tiny fraction of time (max ela = 9835 = 9ms). File#=156 for all db file sequential read events was the same. I found that the file associated with file#=156 was an index file using the SQL bellow and that the other obj#=393791 parameter (it was an 11.2.0.2 database) pointed to one of the DB object the session was working with.

SQL> select name from v$datafile where file#=156;

NAME
--------------------------------------------------------------------------------
/u01/.path./APPS_TS_TX_IDX19.dbf
SQL> select owner, object_name, object_type from dba_objects where object_id=393791;

OWNER
------------------------------
OBJECT_NAME
--------------------------------------------------------------------------------
OBJECT_TYPE
-------------------
GL
GL_CODE_COMBINATIONS_N7
INDEX

At this stage, everything indicated that the session was probably running a single SQL. (It was using a lot of CPU and from time-to-time read data from an INDEX’s blocks on disk, etc.) However, for one reason or another, Oracle failed to show the current SQL under V$SESSION view.

I was running out of ideas. (Keep in mind that I was under pressure.) What else can I quickly try out without any additional research? The client was demanding results, so I decided to use a “Phone A Friend” option (c) “Who Wants to Be a Millionaire?” :) At Pythian, we have a very good option to escalate a problem to a “sword” team’s representatives in case of emergencies. At that time, I called Gleb Otochkin (Twitter, Blog). Gleb is a Cluster Technical Lead at Pythian (one position below STO). He is very knowledgeable and easy to talk to. We had just a 3-minute conversation, and it was enough to generate a few good ideas.

Have a look at the following output:

SQL> set lines 180 pages 1000 numwidth 15
select a.sql_id, a.executions, a.CPU_TIME, a.DISK_READS, a.BUFFER_GETS from V$SQLAREA a
where a.sql_id in (select o.sql_id from V$OPEN_CURSOR o where o.sid=2542) order by a.BUFFER_GETS;
SQL>   2
SQL_ID          EXECUTIONS        CPU_TIME   DISK_READS   BUFFER_GETS
------------- ------------ --------------- ------------ -------------
3sypbzc2t2pg8          338            9998            0             0
...
48tu5cy75d7z4           28         4489318        30139        420631
ggvskj3ku51s7        59872         4119356           86        538879
a9811wy55zu3u       170601         6818972         1731        703579
fmfdkztk8vx2        502530        20736841         3458       1926394
2tqd4wxg82v80      1921256       519361033         1352     125590226
8q6xn4yqwpr2u            1     14930380237         3208    1897893773

35 rows selected.

The output pointed me to the SQL with SQL_ID=8q6xn4yqwpr2u. I would call this method an “educated guess”. The V$SQLAREA and V$OPEN_CURSOR views did not point to the SQL that the session was executing at the moment. There may have been others sessions that executed the same SQLs from the list provided. However, in this particular case, I was lucky. The output clearly indicated the SQL that had just 1 execution, utilized a lot of CPU resources, and read though 1897893773*8/1024/1024/1024 = 14TB of data! In addition, I executed the SQL several times and noticed that the CPU utilization and logical read statistics steadily increased.

The mystery resolved

The rest was more or less straight forward. I found an execution plan for the SQL and DB objects involved.

SQL> select * from table(dbms_xplan.DISPLAY_CURSOR('8q6xn4yqwpr2u'));

PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
SQL_ID  8q6xn4yqwpr2u, child number 0
-------------------------------------
select int.rowid  ,  decode(int.SEGMENT1|| int.SEGMENT2||
int.SEGMENT3|| int.SEGMENT4|| int.SEGMENT5|| int.SEGMENT6||
int.SEGMENT7 , '',   decode (ccid_cc.code_combination_id, NULL, NULL,
replace(ccid_cc.SEGMENT4,'-',' ') || '-' ||
replace(ccid_cc.SEGMENT1,'-',' ') || '-' ||
replace(ccid_cc.SEGMENT2,'-',' ') || '-' ||
replace(ccid_cc.SEGMENT3,'-',' ') || '-' ||
replace(ccid_cc.SEGMENT5,'-',' ') || '-' ||
replace(ccid_cc.SEGMENT6,'-',' ') || '-' ||
replace(ccid_cc.SEGMENT7,'-',' ')  ), replace(int.SEGMENT4,'-',' ') ||
'-' || replace(int.SEGMENT1,'-',' ') || '-' ||
replace(int.SEGMENT2,'-',' ') || '-' || replace(int.SEGMENT3,'-',' ')
|| '-' || replace(int.SEGMENT5,'-',' ') || '-' ||
replace(int.SEGMENT6,'-',' ') || '-' || replace(int.SEGMENT7,'-',' ') )
flexfield , nvl(flex_cc.code_combination_id,
nvl(int.code_combination_id, -4))  , decode(int.SEGMENT1||
int.SEGMENT2|| int.SEGMENT3|| int.SEGMENT4|| int.SEGMENT5||
int.SEGMENT6|| int.SEGMENT7 , '', decode(ccid_cc.code_combination_id,
null

Plan hash value: 1344705414

----------------------------------------------------------------------------------------------------------
| Id  | Operation                      | Name                    | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT               |                         |       |       |     5 (100)|          |
|   1 |  SORT ORDER BY                 |                         |     1 |   171 |     5  (20)| 00:00:01 |
|   2 |   NESTED LOOPS OUTER           |                         |     1 |   171 |     4   (0)| 00:00:01 |
|   3 |    NESTED LOOPS OUTER          |                         |     1 |   112 |     2   (0)| 00:00:01 |
|*  4 |     TABLE ACCESS BY INDEX ROWID| GL_INTERFACE            |     1 |    53 |     1   (0)| 00:00:01 |
|*  5 |      INDEX RANGE SCAN          | GL_INTERFACE_N1         |     1 |       |     1   (0)| 00:00:01 |
|*  6 |     TABLE ACCESS BY INDEX ROWID| GL_CODE_COMBINATIONS    |     1 |    59 |     1   (0)| 00:00:01 |
|*  7 |      INDEX UNIQUE SCAN         | GL_CODE_COMBINATIONS_U1 |     1 |       |     1   (0)| 00:00:01 |
|*  8 |    TABLE ACCESS BY INDEX ROWID | GL_CODE_COMBINATIONS    |     1 |    59 |     2   (0)| 00:00:01 |
|*  9 |     INDEX RANGE SCAN           | GL_CODE_COMBINATIONS_N7 |   417 |       |     2   (0)| 00:00:01 |
----------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   4 - filter("INT"."STATUS"<>'PROCESSED')
   5 - access("INT"."USER_JE_SOURCE_NAME"='Invest' AND "INT"."SET_OF_BOOKS_ID"=1 AND "GROUP_ID"=2)
   6 - filter(("CCID_CC"."TEMPLATE_ID" IS NULL AND "CCID_CC"."CHART_OF_ACCOUNTS_ID"=101))
   7 - access("CCID_CC"."CODE_COMBINATION_ID"="INT"."CODE_COMBINATION_ID")
   8 - filter(("FLEX_CC"."TEMPLATE_ID" IS NULL AND "FLEX_CC"."CHART_OF_ACCOUNTS_ID"=101 AND
              "FLEX_CC"."SEGMENT1"="INT"."SEGMENT1" AND "FLEX_CC"."SEGMENT2"="INT"."SEGMENT2" AND
              "FLEX_CC"."SEGMENT3"="INT"."SEGMENT3" AND "FLEX_CC"."SEGMENT4"="INT"."SEGMENT4" AND
              "FLEX_CC"."SEGMENT5"="INT"."SEGMENT5" AND "FLEX_CC"."SEGMENT6"="INT"."SEGMENT6"))
   9 - access("FLEX_CC"."SEGMENT7"="INT"."SEGMENT7")

52 rows selected.

SQL>

The SQL was based on two tables: GL_INTERFACE (31MB big) and GL_CODE_COMBINATIONS (506MB). Did you notice the size of the tables and the amount of LIO the SQL read though? It seems a bit funny how SQL based on tables with a total size of 537MB can read though 15TB of data. Obviously, the nested loops weren’t the best way to join the tables in this case. The first thing I tried is to gather statistics on the GL_INTERFACE table.

SQL> begin sys.dbms_stats.create_stat_table(ownname=>'SYSTEM',stattab=>'cr884406stats',tblspace=>'users'); end;
/
  2
PL/SQL procedure successfully completed.

SQL> begin DBMS_STATS.export_table_stats(ownname=>'GL',tabname=>'GL_INTERFACE',stattab=>'cr884406stats',statid=>'y01',cascade=>true,statown=>'SYSTEM');
end;
/
2 3 4
PL/SQL procedure successfully completed.
SQL> alter session set workarea_size_policy=manual;
alter session set sort_area_size=1000000000;

Session altered.

SQL>
Session altered.

SQL> exec dbms_stats.gather_table_stats('GL',tabname=>'GL_INTERFACE',estimate_percent=>100,cascade=>TRUE);
/

PL/SQL procedure successfully completed.

SQL>
Session altered.

SQL> begin
DBMS_STATS.export_table_stats(ownname=>'GL',tabname=>'GL_INTERFACE',stattab=>'cr624406stats',statid=>'y02',cascade=>true,statown=>'SYSTEM');
end;
/
  2    3    4
PL/SQL procedure successfully completed.

SQL>

To be on the safe side, I backed up the table statistics to a separate table. I know that starting from version 10G, DBMS_STATS package makes a backup of any statistic it overwrites. However, I still took a precautionary measure, just in case my activity made the other SQLs’ execution worse. After the stats were updated, the SQL reduced its total execution time to 1.4 seconds and changed its execution plan to the following:

SQL> select * from table(dbms_xplan.display);

PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Plan hash value: 4195742536

---------------------------------------------------------------------------------------------------------
| Id  | Operation                     | Name                    | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT              |                         | 90953 |    21M| 18816   (2)| 00:03:46 |
|   1 |  SORT ORDER BY                |                         | 90953 |    21M| 18816   (2)| 00:03:46 |
|   2 |   NESTED LOOPS OUTER          |                         | 90953 |    21M| 18811   (2)| 00:03:46 |
|*  3 |    HASH JOIN OUTER            |                         | 90953 |    17M| 18810   (2)| 00:03:46 |
|*  4 |     TABLE ACCESS FULL         | GL_INTERFACE            | 90953 |    13M|  1062   (2)| 00:00:13 |
|*  5 |     TABLE ACCESS FULL         | GL_CODE_COMBINATIONS    |  2863K|   106M| 17730   (2)| 00:03:33 |
|*  6 |    TABLE ACCESS BY INDEX ROWID| GL_CODE_COMBINATIONS    |     1 |    45 |     1   (0)| 00:00:01 |
|*  7 |     INDEX UNIQUE SCAN         | GL_CODE_COMBINATIONS_U1 |     1 |       |     1   (0)| 00:00:01 |
---------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   3 - access("FLEX_CC"."SEGMENT1"(+)="INT"."SEGMENT1" AND
              "FLEX_CC"."SEGMENT2"(+)="INT"."SEGMENT2" AND "FLEX_CC"."SEGMENT3"(+)="INT"."SEGMENT3" AND
              "FLEX_CC"."SEGMENT4"(+)="INT"."SEGMENT4" AND "FLEX_CC"."SEGMENT5"(+)="INT"."SEGMENT5" AND
              "FLEX_CC"."SEGMENT6"(+)="INT"."SEGMENT6" AND "FLEX_CC"."SEGMENT7"(+)="INT"."SEGMENT7")
   4 - filter("INT"."SET_OF_BOOKS_ID"=1 AND "INT"."USER_JE_SOURCE_NAME"='Invest' AND
              "GROUP_ID"=2 AND "INT"."STATUS"<>'PROCESSED')
   5 - filter("FLEX_CC"."TEMPLATE_ID"(+) IS NULL AND "FLEX_CC"."CHART_OF_ACCOUNTS_ID"(+)=101)
   6 - filter("CCID_CC"."TEMPLATE_ID"(+) IS NULL AND "CCID_CC"."CHART_OF_ACCOUNTS_ID"(+)=101)
   7 - access("CCID_CC"."CODE_COMBINATION_ID"(+)="INT"."CODE_COMBINATION_ID")

27 rows selected.

SQL>

We killed the GL process and re-executed it. The process finished in less than 20 minutes, and I never heard back from the client. :)

Summary

I hope that you learned something  from my Friday the 13th story. The following are several points I would like to highlight.

  • It doesn’t matter how knowledgeable you think you are. Sometimes you just need to use the “Phone A Friend” option to get some more ideas quickly. Gleb Otochkin (TwitterBlog) made my day.
  • It’s always handy to have good references for performance analysis basics. The Beginning Performance Tuningarticle from Arup Nanda (TwitterBlog) is one to have in your bookmarks. :)
  • You may need to be creative and use methods that may not seem right for the occasion at first glance (e.g. V$SQLAREA and V$OPEN_CURSOR views join). All that matters during emergencies is a result, and if the method helps to address the problem then who cares? We should use it. Right? :)
  • I did a follow up with the problem and discussed it with another smart Oracle DBA from Pythian – Maris Elsins (Twitter, Blog). He pointed out that “oradebug dump errorstack 1” is another method to keep in an Oracle DBA toolbox. It dumps the session’s state, including the current SQL.
  • Friday the 13th stories don’t always end horribly. :)

10 Responses to “Case Study: DBA Under Pressure for Performance Troubleshooting on Friday 13”

Leave a Reply

  • (will not be published)

XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>