Traceless DB Time Diagnosis
Apr 5, 2006 / By Christo Kutrovsky
“It’s the database, I am telling you!”
How many times have you heard this?
Here’s a quick way to diagnose where the database is spending its time for a specific process/operation.
This is very similar to what statspack is doing, except it’s targeted towards a specific process, and does include the time where the database waited for the application.
To utilise this method reliably, you need to compare snapshots taken during the specific operation or process that you are trying to diagnose. This works in connection pooling environments too, which is sometime a great bonus as tracing an application that uses multiple or pooled connections can be “a pain”.
Wait for your test to start. At least a couple of seconds within your test create a snapshot with a command similar to this one:
create table k_sysevnt1a as select t.*, sysdate as dt from v$session_event t where sid in (select sid from v$session where program like 'ent@%');
In that case I was only interested in the “ent” program, so that’s what I put. The table k_sysevnt1a is a temporary object created solely for the purpose of this analysis.
Towards the end of the test, but not after it has finished, take another snapshot (notice the table name changing):
create table k_sysevnt1b as select t.*, sysdate as dt from v$session_event t where sid in (select sid from v$session where program like 'ent@%');
All tables can be dropped, once the performance analysis is complete. You could keep them for a while as a reference, but once you have the results, their job is done. You can create as many snapshots as you need, and compare any two of them. I usually do 4-5 so that I can see if the average numbers change during the test duration.
Now all you need to do is compare them via the following query. Remember to always replace the tables in the from section:
select sid,event,waits, round(time_micro/1000000,1) as sec, round(time_micro/nullif(waits,0)/1000,3) as avg_ms, round(ratio_to_report(time_micro) over (partition by sid)*100,1) as P, round(per_sec,2) as per_sec from (select a.sid, decode(grouping_id(a.event),1,'CPU (~)',a.event) as event, to_number(decode(grouping_id(a.event),1, null,sum(b.total_waits - a.total_waits))) as waits, to_number(decode(grouping_id(a.event),1, max((b.dt-a.dt)*86400*1000000)- sum(b.time_waited_micro-a.time_waited_micro), sum(b.time_waited_micro-a.time_waited_micro))) as time_micro, decode(grouping_id(a.event),0, sum(b.total_waits - a.total_waits)/ nullif( sum(b.time_waited_micro-a.time_waited_micro),0)*1000000 ) as per_sec from k_sysevnt1a a, k_sysevnt1b b where a.event = b.event and a.sid=b.sid /*and a.sid in(116,111,145,103)*/ group by a.sid, rollup((a.event)) ) t order by time_micro desc
In this case the result looks like this:
|103||SQL*Net message from client||84762||59.1||0.697||48.8||1434.03|
|145||SQL*Net message from client||85492||58.7||0.687||48.5||1455.93|
|111||SQL*Net message from client||85572||58.7||0.686||48.5||1458.24|
|116||SQL*Net message from client||83520||58.1||0.695||48||1438.33|
|111||log file sync||12692||20.4||1.61||16.9||621.03|
|145||log file sync||12692||20.3||1.599||16.8||625.38|
|103||log file sync||12509||20.1||1.605||16.6||623.04|
|116||log file sync||12317||19.8||1.607||16.4||622.22|
|111||db file sequential read||1811||14.2||7.854||11.8||127.32|
|145||db file sequential read||1830||14||7.656||11.6||130.61|
|116||db file sequential read||1701||13.9||8.187||11.5||122.14|
|103||db file sequential read||1711||13.5||7.919||11.2||126.28|
|145||enq: TX – row lock contention||1334||8.4||6.329||7||158|
|111||enq: TX – row lock contention||1351||8.2||6.046||6.8||165.39|
|116||enq: TX – row lock contention||1305||8||6.118||6.6||163.45|
|103||enq: TX – row lock contention||1256||7.8||6.194||6.4||161.45|
|103||enq: TX – index contention||18||0.2||9.33||0.1||107.18|
|111||enq: TX – index contention||17||0.1||8.008||0.1||124.87|
|111||events in waitclass Other||118||0.1||1.049||0.1||953.37|
|116||enq: TX – index contention||13||0.1||8.851||0.1||112.98|
|116||read by other session||28||0.1||4.052||0.1||246.76|
|103||SQL*Net message to client||84762||0.1||0.001||0.1||754875.9|
|116||SQL*Net message to client||83520||0.1||0.001||0.1||751674|
|103||read by other session||20||0.1||5.209||0.1||191.96|
|145||enq: TX – index contention||12||0.1||8.28||0.1||120.78|
|145||SQL*Net message to client||85492||0.1||0.001||0.1||882744.1|
The “P” column is percentage of time for the session.
The “per_sec” column is how many times this event fired per second. You can use it to gauge how many IO/s per second your process is doing or how many packets are flying back and forth between your application and the database.
Note: the “sid” clause is in comments. Usually I run the analysis query, then note the highest counts of SQLNET activity, and then limit my scope to only those SIDs.
Now you have a nice breakdown of where the time was spent. In this case, the interesting parts are:
- log file sync – average time is ~1.5 ms – this is rather high, modern SANs can achieve 0.5 ms
- db file sequential read – average time is ~8ms – rather high, usually it’s in the 4 ms range on fast disks
- SQL*Net message from client – do not ignore this event. The database waited for the application to respond about half the time. You can read this the other way, saying that the application waited for the database about half the time. So even if your database was infinitely fast, you would only get twice the speed.
- SQL*Net message from client – 1400 per second – the application is communicating with the server almost twice every ms. Rather high activity, might consider batching a bit more
- Remember CPU time is an assumed value. It usually very close to actual CPU time, but if it’s representing a significant amount of time it’s worth double checking.
What you get from this method is a breakdown of where time was spent, without using sql_trace which may change your response time. You can use it to assess whichever area of your system needs improvement. The most important part of this breakdown is how much time the database waited for your application.
One Response to “Traceless DB Time Diagnosis”
Leave a Reply