Traceless DB Time Diagnosis

Posted in: Technical Track

“It’s the database, I am telling you!”

How many times have you heard this?

Here’s a quick way to diagnose where the database is spending its time for a specific process/operation.

This is very similar to what statspack is doing, except it’s targeted towards a specific process, and does include the time where the database waited for the application.

To utilise this method reliably, you need to compare snapshots taken during the specific operation or process that you are trying to diagnose. This works in connection pooling environments too, which is sometime a great bonus as tracing an application that uses multiple or pooled connections can be “a pain”.

Wait for your test to start. At least a couple of seconds within your test create a snapshot with a command similar to this one:

create table k_sysevnt1a as select t.*, sysdate as dt from v$session_event t where sid in (select sid from v$session where program like 'ent@%');

In that case I was only interested in the “ent” program, so that’s what I put. The table k_sysevnt1a is a temporary object created solely for the purpose of this analysis.

Towards the end of the test, but not after it has finished, take another snapshot (notice the table name changing):

create table k_sysevnt1b as select t.*, sysdate as dt from v$session_event t where sid in (select sid from v$session where program like 'ent@%');

All tables can be dropped, once the performance analysis is complete. You could keep them for a while as a reference, but once you have the results, their job is done. You can create as many snapshots as you need, and compare any two of them. I usually do 4-5 so that I can see if the average numbers change during the test duration.

Now all you need to do is compare them via the following query. Remember to always replace the tables in the from section:

select sid,event,waits,
	round(time_micro/1000000,1) as sec,
	round(time_micro/nullif(waits,0)/1000,3) as avg_ms,
		over (partition by sid)*100,1) as P,
	round(per_sec,2) as per_sec
	(select a.sid,
		decode(grouping_id(a.event),1,'CPU (~)',a.event) as event,
		null,sum(b.total_waits - a.total_waits))) as waits,
		sum(b.time_waited_micro-a.time_waited_micro))) as time_micro,
		sum(b.total_waits - a.total_waits)/
			) as per_sec
	from k_sysevnt1a a, k_sysevnt1b b
	where a.event = b.event
		and a.sid=b.sid
		/*and a.sid in(116,111,145,103)*/
		group by a.sid, rollup((a.event))
	) t
order by time_micro desc

In this case the result looks like this:

103SQL*Net message from client8476259.10.69748.81434.03
145SQL*Net message from client8549258.70.68748.51455.93
111SQL*Net message from client8557258.70.68648.51458.24
116SQL*Net message from client8352058.10.695481438.33
116CPU (~)20.817.2
111log file sync1269220.41.6116.9621.03
145log file sync1269220.31.59916.8625.38
103log file sync1250920.11.60516.6623.04
103CPU (~)2016.5
116log file sync1231719.81.60716.4622.22
145CPU (~)19.115.8
111CPU (~)1915.7
111db file sequential read181114.27.85411.8127.32
145db file sequential read1830147.65611.6130.61
116db file sequential read170113.98.18711.5122.14
103db file sequential read171113.57.91911.2126.28
145enq: TX – row lock contention13348.46.3297158
111enq: TX – row lock contention13518.26.0466.8165.39
116enq: TX – row lock contention130586.1186.6163.45
103enq: TX – row lock contention12567.86.1946.4161.45
103enq: TX – index contention180.29.330.1107.18
111enq: TX – index contention170.18.0080.1124.87
111events in waitclass Other1180.11.0490.1953.37
116enq: TX – index contention130.18.8510.1112.98
116read by other session280.14.0520.1246.76
103SQL*Net message to client847620.10.0010.1754875.9
116SQL*Net message to client835200.10.0010.1751674
103read by other session200.15.2090.1191.96
145enq: TX – index contention120.18.280.1120.78
145SQL*Net message to client854920.10.0010.1882744.1

The “P” column is percentage of time for the session.

The “per_sec” column is how many times this event fired per second. You can use it to gauge how many IO/s per second your process is doing or how many packets are flying back and forth between your application and the database.

Note: the “sid” clause is in comments. Usually I run the analysis query, then note the highest counts of SQLNET activity, and then limit my scope to only those SIDs.

Now you have a nice breakdown of where the time was spent. In this case, the interesting parts are:

– log file sync – average time is ~1.5 ms – this is rather high, modern SANs can achieve 0.5 ms

– db file sequential read – average time is ~8ms – rather high, usually it’s in the 4 ms range on fast disks

– SQL*Net message from client – do not ignore this event. The database waited for the application to respond about half the time. You can read this the other way, saying that the application waited for the database about half the time. So even if your database was infinitely fast, you would only get twice the speed.

– SQL*Net message from client – 1400 per second – the application is communicating with the server almost twice every ms. Rather high activity, might consider batching a bit more

– Remember CPU time is an assumed value. It usually very close to actual CPU time, but if it’s representing a significant amount of time it’s worth double checking.

What you get from this method is a breakdown of where time was spent, without using sql_trace which may change your response time. You can use it to assess whichever area of your system needs improvement. The most important part of this breakdown is how much time the database waited for your application.

Interested in working with Christo? Schedule a tech call.

About the Author

An Oracle ACE with a deep understanding of databases, application memory, and input/output interactions, Christo is an expert at optimizing the performance of the most complex infrastructures. Methodical and efficiency-oriented, he equates the role of an ATCG Principal Consultant in many ways to that of a data analyst: both require a rigorous sifting-through of information to identify solutions to often large and complex problems. A dynamic speaker, Christo has delivered presentations at the IOUG, the UKOUG, the Rocky Mountain Oracle Users Group, Oracle Open World, and other industry conferences.

1 Comment. Leave new

Looks like you have invented session level BSTAT/ESTAT. ;-)
Please also note that as you are including idle events this is NOT DB Time analysis. The definition of DB Time excludes idle events.


Leave a Reply

Your email address will not be published. Required fields are marked *