Case Study: DBA under pressure or performance troubleshooting on Friday 13

Jul 22, 2012 / By Yury Velikanov

Tags:

Friday 13 of June 2012 I was oncall  Oracle DBA. I got to work on a high priority client’s request. A financial team of one of my team’s clients was executing final steps of a year closing process in their ERP production environment and had a problem. The step that I received a complain about took 2 hours and still didn’t complete. I was told that last year the same step completed in 20 minutes. Looking back I think it may be beneficial for someone of you to read how the issue has been troubleshooted and resolved. It may give you a chance to learn something or share some better diagnosing techniques with me and others.  Please do not hesitate to use comments section of this blog post.

Pressure

I should mention that the whole troubleshooting story from the first call to the final resolution took ~3 hours. I bet I spent 2 hours out of 3 on communications. There have been 4 different persons (including 2 managers) I had to discussed and inform them on the investigation progress. One of the managers was on a conference call with several other senior people from client’s financial team. At some point in time the manager was asking for a status update on 2-5 minute intervals. At this stage you should get some feeling of  an amount of pressure I have got working on the incident. This was one of those cases where it doesn’t matter how technically correct/beautiful/precise your actions/solutions are. The only thing that matters is how quickly you resolve the issue.

Initial Call

It all started with a page from the client. A person asked to call him to discuss an issue. From the call I learned the following

  • there is a batch job (EBS – Concurrent Request) executing in the production environment
  • the last time the job executed successfully a year ago (read = no execution history, should start the troubleshooting from scratch)
  • Last time it took 20 minutes to execute
  • This time it was running 2+ hours and still didn’t complete
  • As usual in such cases first of all I was asked to check if the job ins’t blocked by anything, if there is enough temp space etc.

I promised to have an initial look and get back to the person with the case assessment. It wasn’t too difficult to find a database session associated with the Concurrent Request (I used the run_cr.sql script from my personal SQL library).

Troubleshooting

Next thing that I typically check in such cases is SEQ# and some others fields from V$SESSION view to get some initial idea what we are dealing with. I have got the following initial information:

SQL> select SEQ#, SQL_ID, PREV_SQL_ID, MODULE, LAST_CALL_ET, EVENT, P1, WAIT_TIME, SECONDS_IN_WAIT, STATE, SQL_TRACE  from v$session where sid=2542;

           SEQ# SQL_ID        PREV_SQL_ID   MODULE                                                              LAST_CALL_ET
--------------- ------------- ------------- ---------------------------------------------------------------- ---------------
EVENT                                                                         P1       WAIT_TIME SECONDS_IN_WAIT STATE               SQL_TRAC
---------------------------------------------------------------- --------------- --------------- --------------- ------------------- --------
           3814               48tu5cy75d7z4 UPGLLEZL                                                                    9106
latch: cache buffers chains                                          17408691056              -1               5 WAITED SHORT TIME   DISABLED

1 row selected.

SQL> /

           SEQ# SQL_ID        PREV_SQL_ID   MODULE                                                              LAST_CALL_ET
--------------- ------------- ------------- ---------------------------------------------------------------- ---------------
EVENT                                                                         P1       WAIT_TIME SECONDS_IN_WAIT STATE               SQL_TRAC
---------------------------------------------------------------- --------------- --------------- --------------- ------------------- --------
           3814               48tu5cy75d7z4 UPGLLEZL                                                                    9203
latch: cache buffers chains                                          17408691056              -1             102 WAITED SHORT TIME   DISABLED

1 row selected.

notes: No diagnostic pack licence was available in this environment. During the research activities I had a quick check with “Beginning Performance Tuning” article from Arup Nanda (Tweeter, Blog) to quickly validate my understanding of V$SESSION view’s fields.

We can learn quite a bit from those 2 lines above. Some of you would notice that there was a 97 seconds interval in between those 2 V$SESSION snapshots (LAST_CALL_ET 9203 – 9106). The session was executing a single of call for 2h and 33 minutes (LAST_CALL_ET 9203 seconds). It is working (running on CPU) and is not waiting on anything. The session didn’t call any other SQL in between those to V$SESSION snapshots (SEQ# is the same). However there is one unusual thing in this output. There is no value provided for SQL_ID field (currently executing SQL). What was the session executing? Was the session executing an SQL? It it was what is the SQL?

How to find the current SQL if SQL_ID is null?

The very first idea that I have tried out was to switch on 10046 event (SQL trace) and see if we can figure out something from there. This is that I found in a trace file:

*** 2012-07-12 20:39:49.065
WAIT #47325300207976: nam='db file sequential read' ela= 9835 file#=156 block#=930982 blocks=1 obj#=393791 tim=1342139989065730
WAIT #47325300207976: nam='db file sequential read' ela= 621 file#=156 block#=930983 blocks=1 obj#=393791 tim=1342139989066806
WAIT #47325300207976: nam='db file sequential read' ela= 24 file#=156 block#=930984 blocks=1 obj#=393791 tim=1342139989067076

*** 2012-07-12 20:40:30.282
WAIT #47325300207976: nam='latch: cache buffers chains' ela= 84 address=17411387456 number=155 tries=0 obj#=393791 tim=1342140030282022

*** 2012-07-12 20:41:32.846
WAIT #47325300207976: nam='db file sequential read' ela= 6258 file#=156 block#=930869 blocks=1 obj#=393791 tim=1342140092846169

*** 2012-07-12 20:44:50.771
WAIT #47325300207976: nam='db file sequential read' ela= 6512 file#=146 block#=1023949 blocks=1 obj#=393791 tim=1342140290771848

*** 2012-07-12 20:44:51.785
WAIT #47325300207976: nam='db file sequential read' ela= 5782 file#=156 block#=931047 blocks=1 obj#=393791 tim=1342140291785475

*** 2012-07-12 20:45:49.957
WAIT #47325300207976: nam='db file sequential read' ela= 6098 file#=156 block#=930851 blocks=1 obj#=393791 tim=1342140349957004
WAIT #47325300207976: nam='db file sequential read' ela= 3614 file#=156 block#=930853 blocks=1 obj#=393791 tim=1342140349961811

As you see it didn’t give me any clue on what SQL the session was executing. Let me ask you, what bits of useful information we can get from in the output above? First of all there were relatively rare and short wait events (tim, ela) . Any single wait took a tiny fraction of time (max ela = 9835 = 9ms). file#=156 for all db file sequential read events is the same. I found that the file associated with file#=156 is an index file using the SQL bellow. The other obj#=393791 parameter (this was an 11.2.0.2 database) pointed to one of the DB object the session was working with.

SQL> select name from v$datafile where file#=156;

NAME
--------------------------------------------------------------------------------
/u01/.path./APPS_TS_TX_IDX19.dbf
SQL> select owner, object_name, object_type from dba_objects where object_id=393791;

OWNER
------------------------------
OBJECT_NAME
--------------------------------------------------------------------------------
OBJECT_TYPE
-------------------
GL
GL_CODE_COMBINATIONS_N7
INDEX

At this stage everything pointed that the session most probably was running a single SQL (it using a lot of CPU, time to time reads data from an INDEX’s blocks on disk, etc). However for one reason or another Oracle failed to show the current SQL under V$SESSION view. At that stage I was running out of ideas (keep in mind that I was under pressure). What else can I quickly try out without an additional research? The client was demanding results and I decided to use a “Phone A Friend” option (c) “Who Wants to Be a Millionaire?” :) At Pythian we have a very good option to escalate a problem to a “sword” team’s representatives in case of emergencies. At this time I was calling Gleb Otochkin (Twitter, Blog). Gleb is a Cluster Technical Lead at Pythian(one position below STO). He is very knowledgeable and an easy to talk to person. We had just a 3 minutes conversation and it was enough to generate a few good ideas what to try next.

Have a look on the following output:

SQL> set lines 180 pages 1000 numwidth 15
select a.sql_id, a.executions, a.CPU_TIME, a.DISK_READS, a.BUFFER_GETS from V$SQLAREA a
where a.sql_id in (select o.sql_id from V$OPEN_CURSOR o where o.sid=2542) order by a.BUFFER_GETS;
SQL>   2
SQL_ID          EXECUTIONS        CPU_TIME   DISK_READS   BUFFER_GETS
------------- ------------ --------------- ------------ -------------
3sypbzc2t2pg8          338            9998            0             0
...
48tu5cy75d7z4           28         4489318        30139        420631
ggvskj3ku51s7        59872         4119356           86        538879
a9811wy55zu3u       170601         6818972         1731        703579
fmfdkztk8vx2        502530        20736841         3458       1926394
2tqd4wxg82v80      1921256       519361033         1352     125590226
8q6xn4yqwpr2u            1     14930380237         3208    1897893773

35 rows selected.

The output pointed me to the SQL with SQL_ID=8q6xn4yqwpr2u. I would call this method an “educated guess”. The V$SQLAREA and V$OPEN_CURSOR views join doesn’t point you to the SQL that the session is executing at the moment. There may be others sessions that execute the same SQLs from the list provided. However in this particular case I was lucky. The output clearly indicated the SQL that had just 1 execution, utilized a lot of CPU resources and read though 1897893773*8/1024/1024/1024 = 14TB of data! In addition I have executed the SQL several times and noticed that the CPU utilization and logical reads statistics steadily increased.

The mystery resolved

The rest was more or less straight forward. I found an execution plan for the SQL and DB objects involved.

SQL> select * from table(dbms_xplan.DISPLAY_CURSOR('8q6xn4yqwpr2u'));

PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
SQL_ID  8q6xn4yqwpr2u, child number 0
-------------------------------------
select int.rowid  ,  decode(int.SEGMENT1|| int.SEGMENT2||
int.SEGMENT3|| int.SEGMENT4|| int.SEGMENT5|| int.SEGMENT6||
int.SEGMENT7 , '',   decode (ccid_cc.code_combination_id, NULL, NULL,
replace(ccid_cc.SEGMENT4,'-',' ') || '-' ||
replace(ccid_cc.SEGMENT1,'-',' ') || '-' ||
replace(ccid_cc.SEGMENT2,'-',' ') || '-' ||
replace(ccid_cc.SEGMENT3,'-',' ') || '-' ||
replace(ccid_cc.SEGMENT5,'-',' ') || '-' ||
replace(ccid_cc.SEGMENT6,'-',' ') || '-' ||
replace(ccid_cc.SEGMENT7,'-',' ')  ), replace(int.SEGMENT4,'-',' ') ||
'-' || replace(int.SEGMENT1,'-',' ') || '-' ||
replace(int.SEGMENT2,'-',' ') || '-' || replace(int.SEGMENT3,'-',' ')
|| '-' || replace(int.SEGMENT5,'-',' ') || '-' ||
replace(int.SEGMENT6,'-',' ') || '-' || replace(int.SEGMENT7,'-',' ') )
flexfield , nvl(flex_cc.code_combination_id,
nvl(int.code_combination_id, -4))  , decode(int.SEGMENT1||
int.SEGMENT2|| int.SEGMENT3|| int.SEGMENT4|| int.SEGMENT5||
int.SEGMENT6|| int.SEGMENT7 , '', decode(ccid_cc.code_combination_id,
null

Plan hash value: 1344705414

----------------------------------------------------------------------------------------------------------
| Id  | Operation                      | Name                    | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT               |                         |       |       |     5 (100)|          |
|   1 |  SORT ORDER BY                 |                         |     1 |   171 |     5  (20)| 00:00:01 |
|   2 |   NESTED LOOPS OUTER           |                         |     1 |   171 |     4   (0)| 00:00:01 |
|   3 |    NESTED LOOPS OUTER          |                         |     1 |   112 |     2   (0)| 00:00:01 |
|*  4 |     TABLE ACCESS BY INDEX ROWID| GL_INTERFACE            |     1 |    53 |     1   (0)| 00:00:01 |
|*  5 |      INDEX RANGE SCAN          | GL_INTERFACE_N1         |     1 |       |     1   (0)| 00:00:01 |
|*  6 |     TABLE ACCESS BY INDEX ROWID| GL_CODE_COMBINATIONS    |     1 |    59 |     1   (0)| 00:00:01 |
|*  7 |      INDEX UNIQUE SCAN         | GL_CODE_COMBINATIONS_U1 |     1 |       |     1   (0)| 00:00:01 |
|*  8 |    TABLE ACCESS BY INDEX ROWID | GL_CODE_COMBINATIONS    |     1 |    59 |     2   (0)| 00:00:01 |
|*  9 |     INDEX RANGE SCAN           | GL_CODE_COMBINATIONS_N7 |   417 |       |     2   (0)| 00:00:01 |
----------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   4 - filter("INT"."STATUS"<>'PROCESSED')
   5 - access("INT"."USER_JE_SOURCE_NAME"='Invest' AND "INT"."SET_OF_BOOKS_ID"=1 AND "GROUP_ID"=2)
   6 - filter(("CCID_CC"."TEMPLATE_ID" IS NULL AND "CCID_CC"."CHART_OF_ACCOUNTS_ID"=101))
   7 - access("CCID_CC"."CODE_COMBINATION_ID"="INT"."CODE_COMBINATION_ID")
   8 - filter(("FLEX_CC"."TEMPLATE_ID" IS NULL AND "FLEX_CC"."CHART_OF_ACCOUNTS_ID"=101 AND
              "FLEX_CC"."SEGMENT1"="INT"."SEGMENT1" AND "FLEX_CC"."SEGMENT2"="INT"."SEGMENT2" AND
              "FLEX_CC"."SEGMENT3"="INT"."SEGMENT3" AND "FLEX_CC"."SEGMENT4"="INT"."SEGMENT4" AND
              "FLEX_CC"."SEGMENT5"="INT"."SEGMENT5" AND "FLEX_CC"."SEGMENT6"="INT"."SEGMENT6"))
   9 - access("FLEX_CC"."SEGMENT7"="INT"."SEGMENT7")

52 rows selected.

SQL>

The SQL have been based on two tables: GL_INTERFACE (31MB big) and GL_CODE_COMBINATIONS (506MB). Did you notice the size of the tables and the amount of LIO the SQL read though? It seems a bit funny how SQL based on tables with a total size 537MB can read though 15TB of data. Obviously the nested loops wasn’t the best way to join the tables in this case. The very fist thing we have tried is to gather statistics on the GL_INTERFACE table.

SQL> begin sys.dbms_stats.create_stat_table(ownname=>'SYSTEM',stattab=>'cr884406stats',tblspace=>'users'); end;
/
  2
PL/SQL procedure successfully completed.

SQL> begin

DBMS_STATS.export_table_stats(ownname=>’GL’,tabname=>’GL_INTERFACE’,stattab=>’cr884406stats’,statid=>’y01′,cascade=>true,statown=>’SYSTEM’);
end;
/
2 3 4
PL/SQL procedure successfully completed.

SQL> alter session set workarea_size_policy=manual;
alter session set sort_area_size=1000000000;

Session altered.

SQL>
Session altered.

SQL> exec dbms_stats.gather_table_stats('GL',tabname=>'GL_INTERFACE',estimate_percent=>100,cascade=>TRUE);
/

PL/SQL procedure successfully completed.

SQL>
Session altered.

SQL> begin
DBMS_STATS.export_table_stats(ownname=>'GL',tabname=>'GL_INTERFACE',stattab=>'cr624406stats',statid=>'y02',cascade=>true,statown=>'SYSTEM');
end;
/
  2    3    4
PL/SQL procedure successfully completed.

SQL>

To be on the safe side I have backed up the table statistics to a separate table. I know that starting from 10G version DBMS_STATS package makes a backup of any statistic it overwriting. However I still took a precautionary measure, just in case if my activity would make the other SQLs’ execution worse. After the stats got updated the SQL changed it’s execution plan to the following and the total execution time reduced to 1.4 seconds :)

SQL> select * from table(dbms_xplan.display);

PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Plan hash value: 4195742536

---------------------------------------------------------------------------------------------------------
| Id  | Operation                     | Name                    | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT              |                         | 90953 |    21M| 18816   (2)| 00:03:46 |
|   1 |  SORT ORDER BY                |                         | 90953 |    21M| 18816   (2)| 00:03:46 |
|   2 |   NESTED LOOPS OUTER          |                         | 90953 |    21M| 18811   (2)| 00:03:46 |
|*  3 |    HASH JOIN OUTER            |                         | 90953 |    17M| 18810   (2)| 00:03:46 |
|*  4 |     TABLE ACCESS FULL         | GL_INTERFACE            | 90953 |    13M|  1062   (2)| 00:00:13 |
|*  5 |     TABLE ACCESS FULL         | GL_CODE_COMBINATIONS    |  2863K|   106M| 17730   (2)| 00:03:33 |
|*  6 |    TABLE ACCESS BY INDEX ROWID| GL_CODE_COMBINATIONS    |     1 |    45 |     1   (0)| 00:00:01 |
|*  7 |     INDEX UNIQUE SCAN         | GL_CODE_COMBINATIONS_U1 |     1 |       |     1   (0)| 00:00:01 |
---------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   3 - access("FLEX_CC"."SEGMENT1"(+)="INT"."SEGMENT1" AND
              "FLEX_CC"."SEGMENT2"(+)="INT"."SEGMENT2" AND "FLEX_CC"."SEGMENT3"(+)="INT"."SEGMENT3" AND
              "FLEX_CC"."SEGMENT4"(+)="INT"."SEGMENT4" AND "FLEX_CC"."SEGMENT5"(+)="INT"."SEGMENT5" AND
              "FLEX_CC"."SEGMENT6"(+)="INT"."SEGMENT6" AND "FLEX_CC"."SEGMENT7"(+)="INT"."SEGMENT7")
   4 - filter("INT"."SET_OF_BOOKS_ID"=1 AND "INT"."USER_JE_SOURCE_NAME"='Invest' AND
              "GROUP_ID"=2 AND "INT"."STATUS"<>'PROCESSED')
   5 - filter("FLEX_CC"."TEMPLATE_ID"(+) IS NULL AND "FLEX_CC"."CHART_OF_ACCOUNTS_ID"(+)=101)
   6 - filter("CCID_CC"."TEMPLATE_ID"(+) IS NULL AND "CCID_CC"."CHART_OF_ACCOUNTS_ID"(+)=101)
   7 - access("CCID_CC"."CODE_COMBINATION_ID"(+)="INT"."CODE_COMBINATION_ID")

27 rows selected.

SQL>

We killed the GL process and re-executed it. The process finished in less than 20 minutes and I never have heard back from the client :)

Summary

I hope that you learned something  from my Friday’s the 13th story. The following are several points I would like to highlight

  • It doesn’t matter how knowledgeable you think you are. Sometimes you just need to use a “Phone A Friend” option to get some more ideas quickly. Gleb Otochkin (TwitterBlog) made my day.
  • It always handy to have good references to performance analysis basics. The Beginning Performance Tuningarticle from Arup Nanda (Tweeter, Blog) is one to have in bookmarks :)
  • Sometimes you may need to be creative and use methods that from the first look may not be right for the occasion (e.g. V$SQLAREA and V$OPEN_CURSOR views join). The only matter during emergencies alike is a result and if the method helps to address the problem then who cares? We should use it. Right ? :)
  • I did a follow up in the problem and discuss it with another smart Oracle DBA from Pythian – Maris Elsins (Tweeter, Blog). He pointed out that “oradebug dump errorstack 1” is another method to keep in an Oracle DBA toolbox. It dumps session’s state including the current SQL.
  • Sometimes Friday’s the 13th stories don’t have a horrible end :)

9 comments on “Case Study: DBA under pressure or performance troubleshooting on Friday 13

  1. Joao Paulo Lopes on said:

    Cool article, learned a lot from it. Thank you :)

  2. GregG on said:

    Thanks Yury for sharing this .
    Regards
    GregG

  3. Paul Vallee on said:

    Great story Yury, thanks for sharing it.

    • Yury Velikanov on said:

      Wow! Folks,

      Thank you for your support. It keep me going.
      Honestly I had big doubts on publishing it. I was wondering if it would add any value in the format chosen. It looks like it does.

      Thanks Paul, GregG and Joao Paulo for your feedback,

      Yury

  4. Asif Momen on said:

    Nice post.

    Did you try querying V$SESSION_LONGOPS view?

  5. Navdeep on said:

    Nice post, i have faced issues like this related to stats, this post will help me investigate further down to tables instead of regenerating stats for the schema :)

  6. Jared on said:

    Nice job Yury.

    I must say that it is quite unreasonable for a client to request an update every 2-5 minutes.

    If they want a problem resolved, they need to allow for greater periods of concentration.

  7. Fahd Mirza on said:

    Snapper by Tanel Poder is extremely handy in such pressure spots. It has saved my day several times. Thanks Yury for this story, which truly is a true blog post.

Leave a Reply

Your email address will not be published. Required fields are marked *

*

HTML tags are not allowed.