Post-Mortem Analysis Tools: Systemstate Dump

Editor’s note: This “post-mortem analysis” series consists of four posts in total. Make sure you also check out:
- Post-Mortem Analysis Tools: ASH Dump
- Post-Mortem Analysis Tools: Hanganalyze
- Post-Mortem Analysis Tools: Using Preliminary Connection (publishing Thursday, April 22)
Now, onto today’s post:
Imagine this situation: The database is hanging and you can’t figure out what’s going on with the required timing. So to avoid affecting any SLA you decide to restart the database OR your leader/boss instructed you to do it OR you know the database is going to go down, anyway …
Every good DBA has faced this kind of situation at least once. Bottom line, after restarting everything was OK and the “problem” was solved. But now you’re being asked about RCA (root cause analysis—aka what caused this situation?). Let’s say the database was hanging, so no snap was closed for AWR (Automatic Workload Repository) and you lost the ASH (Active Session History) data due to restart. What can you do?
Well, there are a few tools that I’ll be covering in further posts. The first one and our subject for today is the systemstate dump.
Systemstate is basically composed of the process state for all processes in the instance (or instances) at the time it’s called. Through a systemstate it’s possible to identify enqueues, rowcache locks, mutexes, library cache pins and locks, latch free situations and other kinds of chains.
Systemstate is a good tool to add in an SR (systematic review), but it’s quite hard to habituate on reading/interpreting the file. To understand exactly how to read a systemstate the best recommendation I can make is, read the manual! The doc Reading and Understanding Systemstate Dumps (Doc ID 423153.1) has a very good explanation with examples; I’m not able to do it better.
I feel it’s necessary to share some information about the systemstate levels, though, as I had some difficulty finding it myself. SAFE HARBOR STATEMENT: It’s important to mention that Oracle does not recommend you to use “numeric events’’ without an SR (MOS), according to Note: 75713.1.
So, systemstate dump has the following levels:
Level | Content |
2 | dump (not including the lock element) |
10 | dump |
11 | dump + global cache of RAC |
256 | short stack (function stack) |
258 | 256 + 2 -> short stack + dump (not including the lock element) |
266 | 256 + 10 -> short stack + dump |
267 | 256 + 11 -> short stack + dump + global cache of RAC |
Levels 11 and 267 will dump the global cache and will generate a large trace file—under normal circumstances these are not recommended.
Generally, if there aren’t too many processes, 266 is the recommended level once it can dump out the processes’ functions stack. The dump can be used to analyze the process in detail. However, even to generate a short stack, if there are too many processes—e.g. 2000+—it may take more than 30 minutes.
In this case, you can generate a level 10 or level 258 dump. Level 258 will collect more than a level 10 short stack, but less than a regular level 10, including some lock element data.
If using a RAC system, please pay attention to Bug 11800959 – A SYSTEMSTATE dump with level> = 10 in RAC dumps huge BUSY GLOBAL CACHE ELEMENTS – can hang / crash instances (Doc ID 11800959.8). The bug is fixed in 11.2.0.3. For versions <= 11.2.0.2 of the RAC, when the system locks an element for a length of time, when executing level 10, 266 or 267 of systemstate dump, it can cause database hang or crash. It may be solved by using level 258.
To generate it in the local node:
oradebug setmypid; oradebug unlimit; oradebug dump systemstate 266 oradebug tracefile_name
OR, for all nodes in the cluster:
oradebug setmypid; oradebug unlimit; oradebug -g all dump systemstate 266 oradebug tracefile_name
An example of execution:
SQL> oradebug setmypid; Statement processed. SQL> oradebug unlimit; Statement processed. SQL> oradebug dump systemstate 266 Statement processed. SQL> oradebug tracefile_name /db/oracle/diag/rdbms/pythiandb/pythiandb/trace/pythiandb_ora_18256.trc
If you cannot create a connection to the database with SQL*Plus (even as SYSDBA), because of a hang situation, you can use a preliminary connection.
I’ll explain more about this in a further post. But if you need help ASAP, I’d recommend you read:
- How to Collect Systemstate Dumps When you Cannot Connect to Oracle (Doc ID 121779.1)
- Important Customer information about using Numeric Events (Doc ID 75713.1)
An example of a systemstate level 266 dumpfile:
*** 2016-06-15 16:59:00.180 Processing Oradebug command 'dump systemstate 266' =================================================== SYSTEM STATE (level=10, with short stacks) ------------ System global information: processes: base 0xbb5b6850, size 1000, cleanup 0xbb63b780 allocation: free sessions 0xbba6fb48, free calls (nil) control alloc errors: 0 (process), 0 (session), 0 (call) PMON latch cleanup depth: 0 seconds since PMON's last scan for dead processes: 39 system statistics: 0 OS CPU Qt wait time 727657760 Requests to/from client 7106305 logons cumulative 274 logons current 231675649 opened cursors cumulative 1737 opened cursors current 43388218 user commits 1682106 user rollbacks 835078962 user calls 752889391 recursive calls 12720247 recursive cpu usage 150 pinned cursors current 15100586388 session logical reads 0 session logical reads in local numa group 0 session logical reads in remote numa group [...] PROCESS 1: ---------------------------------------- SO: 0xbb63a6d0, type: 2, owner: (nil), flag: INIT/-/-/0x00 if: 0x3 c: 0x3 proc=0xbb63a6d0, name=process, file=ksu.h LINE:12616, pg=0 (process) Oracle pid:1, ser:0, calls cur/top: (nil)/(nil) flags : (0x20) PSEUDO flags2: (0x0), flags3: (0x10) intr error: 0, call error: 0, sess error: 0, txn error 0 intr queue: empty ksudlp FALSE at location: 0 (post info) last post received: 0 0 0 last post received-location: No post last process to post me: none last post sent: 0 0 0 last post sent-location: No post last process posted by me: none (latch info) wait_event=0 bits=0 O/S info: user: , term: , ospid: (DEAD) OSD pid info: Unix process pid: 0, image: PSEUDO ---------------------------------------- SO: 0x6000c838, type: 5, owner: 0xbb63a6d0, flag: INIT/-/-/0x00 if: 0x3 c: 0x3 proc=(nil), name=kss parent, file=kss2.h LINE:138, pg=0 PSO child state object changes : Dump of memory from 0x00000000BB5BA8E8 to 0x00000000BB5BAAF0 0BB5BA8E0 00000000 00000000 [........] 0BB5BA8F0 00000000 00000000 00000000 00000000 [................] Repeat 31 times PROCESS 2: PMON ---------------------------------------- SO: 0xbb63b780, type: 2, owner: (nil), flag: INIT/-/-/0x00 if: 0x3 c: 0x3 proc=0xbb63b780, name=process, file=ksu.h LINE:12616, pg=0 (process) Oracle pid:2, ser:1, calls cur/top: 0xbaea88a8/0xbaea88a8 flags : (0xe) SYSTEM flags2: (0x0), flags3: (0x10) intr error: 0, call error: 0, sess error: 0, txn error 0 intr queue: empty ksudlp FALSE at location: 0 (post info) last post received: 0 0 16 last post received-location: ksu.h LINE:13945 ID:ksupsc last process to post me: bb75a490 25 0 last post sent: 0 0 19 last post sent-location: ksu.h LINE:13957 ID:ksuxfd last process posted by me: bb63b780 1 14 (latch info) wait_event=0 bits=0 Process Group: DEFAULT, pseudo proc: 0xbba5e848 O/S info: user: oracle, term: UNKNOWN, ospid: 14438 OSD pid info: Unix process pid: 14438, image: oracle@devdb09 (PMON) Short stack dump: [...]
I hope this was helpful! If you have questions or thoughts, please leave them in the comments.
You can find other posts in this series here:
- Post-Mortem Analysis Tools: ASH Dump
- Post-Mortem Analysis Tools: Hanganalyze
- Post-Mortem Analysis Tools: Using Preliminary Connection (publishing Thursday, April 22)