Traceless DB Time Diagnosis

Apr 5, 2006 / By Christo Kutrovsky

Tags: ,

“It’s the database, I am telling you!”

How many times have you heard this?

Here’s a quick way to diagnose where the database is spending its time for a specific process/operation.

This is very similar to what statspack is doing, except it’s targeted towards a specific process, and does include the time where the database waited for the application.

To utilise this method reliably, you need to compare snapshots taken during the specific operation or process that you are trying to diagnose. This works in connection pooling environments too, which is sometime a great bonus as tracing an application that uses multiple or pooled connections can be “a pain”.

Wait for your test to start. At least a couple of seconds within your test create a snapshot with a command similar to this one:

create table k_sysevnt1a as select t.*, sysdate as dt from v$session_event t where sid in (select sid from v$session where program like 'ent@%');

In that case I was only interested in the “ent” program, so that’s what I put. The table k_sysevnt1a is a temporary object created solely for the purpose of this analysis.

Towards the end of the test, but not after it has finished, take another snapshot (notice the table name changing):

create table k_sysevnt1b as select t.*, sysdate as dt from v$session_event t where sid in (select sid from v$session where program like 'ent@%');

All tables can be dropped, once the performance analysis is complete. You could keep them for a while as a reference, but once you have the results, their job is done. You can create as many snapshots as you need, and compare any two of them. I usually do 4-5 so that I can see if the average numbers change during the test duration.

Now all you need to do is compare them via the following query. Remember to always replace the tables in the from section:

select sid,event,waits,
	round(time_micro/1000000,1) as sec,
	round(time_micro/nullif(waits,0)/1000,3) as avg_ms,
	round(ratio_to_report(time_micro) 
		over (partition by sid)*100,1) as P,
	round(per_sec,2) as per_sec
from
	(select a.sid,
		decode(grouping_id(a.event),1,'CPU (~)',a.event) as event,
		to_number(decode(grouping_id(a.event),1,
		null,sum(b.total_waits - a.total_waits))) as waits,
		to_number(decode(grouping_id(a.event),1,
		max((b.dt-a.dt)*86400*1000000)-
		sum(b.time_waited_micro-a.time_waited_micro),
		sum(b.time_waited_micro-a.time_waited_micro))) as time_micro,
		decode(grouping_id(a.event),0,
		sum(b.total_waits - a.total_waits)/
		nullif(
			sum(b.time_waited_micro-a.time_waited_micro),0)*1000000
			) as per_sec
	from k_sysevnt1a a, k_sysevnt1b b
	where a.event = b.event
		and a.sid=b.sid
		/*and a.sid in(116,111,145,103)*/
		group by a.sid, rollup((a.event))
	) t
order by time_micro desc

In this case the result looks like this:

SID EVENT WAITS SEC AVG_MS P PER_SEC
103 SQL*Net message from client 84762 59.1 0.697 48.8 1434.03
145 SQL*Net message from client 85492 58.7 0.687 48.5 1455.93
111 SQL*Net message from client 85572 58.7 0.686 48.5 1458.24
116 SQL*Net message from client 83520 58.1 0.695 48 1438.33
116 CPU (~) 20.8 17.2
111 log file sync 12692 20.4 1.61 16.9 621.03
145 log file sync 12692 20.3 1.599 16.8 625.38
103 log file sync 12509 20.1 1.605 16.6 623.04
103 CPU (~) 20 16.5
116 log file sync 12317 19.8 1.607 16.4 622.22
145 CPU (~) 19.1 15.8
111 CPU (~) 19 15.7
111 db file sequential read 1811 14.2 7.854 11.8 127.32
145 db file sequential read 1830 14 7.656 11.6 130.61
116 db file sequential read 1701 13.9 8.187 11.5 122.14
103 db file sequential read 1711 13.5 7.919 11.2 126.28
145 enq: TX – row lock contention 1334 8.4 6.329 7 158
111 enq: TX – row lock contention 1351 8.2 6.046 6.8 165.39
116 enq: TX – row lock contention 1305 8 6.118 6.6 163.45
103 enq: TX – row lock contention 1256 7.8 6.194 6.4 161.45
103 enq: TX – index contention 18 0.2 9.33 0.1 107.18
111 enq: TX – index contention 17 0.1 8.008 0.1 124.87
111 events in waitclass Other 118 0.1 1.049 0.1 953.37
116 enq: TX – index contention 13 0.1 8.851 0.1 112.98
116 read by other session 28 0.1 4.052 0.1 246.76
103 SQL*Net message to client 84762 0.1 0.001 0.1 754875.9
116 SQL*Net message to client 83520 0.1 0.001 0.1 751674
103 read by other session 20 0.1 5.209 0.1 191.96
145 enq: TX – index contention 12 0.1 8.28 0.1 120.78
145 SQL*Net message to client 85492 0.1 0.001 0.1 882744.1

The “P” column is percentage of time for the session.

The “per_sec” column is how many times this event fired per second. You can use it to gauge how many IO/s per second your process is doing or how many packets are flying back and forth between your application and the database.

Note: the “sid” clause is in comments. Usually I run the analysis query, then note the highest counts of SQLNET activity, and then limit my scope to only those SIDs.

Now you have a nice breakdown of where the time was spent. In this case, the interesting parts are:

- log file sync – average time is ~1.5 ms – this is rather high, modern SANs can achieve 0.5 ms

- db file sequential read – average time is ~8ms – rather high, usually it’s in the 4 ms range on fast disks

- SQL*Net message from client – do not ignore this event. The database waited for the application to respond about half the time. You can read this the other way, saying that the application waited for the database about half the time. So even if your database was infinitely fast, you would only get twice the speed.

- SQL*Net message from client – 1400 per second – the application is communicating with the server almost twice every ms. Rather high activity, might consider batching a bit more

- Remember CPU time is an assumed value. It usually very close to actual CPU time, but if it’s representing a significant amount of time it’s worth double checking.

What you get from this method is a breakdown of where time was spent, without using sql_trace which may change your response time. You can use it to assess whichever area of your system needs improvement. The most important part of this breakdown is how much time the database waited for your application.

One Response to “Traceless DB Time Diagnosis”

  • Graham Wood says:

    Looks like you have invented session level BSTAT/ESTAT. ;-)
    Please also note that as you are including idle events this is NOT DB Time analysis. The definition of DB Time excludes idle events.

Leave a Reply

  • (will not be published)

XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>