Using the Oracle Wait Interface to Troubleshoot I/O Issues

Oct 30, 2006 / By Shakir Sadikali

Tags: , ,

Or, Why I Like Broccoli and Wait Events

There’s been a fair bit of discussion with respect to whose vegetables are the freshest in Ottawa. From one DBA to another, I’m here to tell you that St. Laurent Fruit and Vegetable (beside the Mideast Meat Market) is probably the place to go for the best prices on… well… fruits and vegetables. They also carry lots of bags, and they’re strategically placed so you’ll never need old men to help you get more :-)

What’s more interesting about grocery shopping is that it lets you daydream and go over all the good work you’ve done over the last week (assuming you go weekly and a certain amount good work done). Last week was particularly tough. One of our more stable clients had some serious “mystery” production problems. To paraphrase, their app that was running “OK” starting running “sucky”. I knew that nothing had changed from our end and so being the intrepid super-sleuths that we are, we determined that the problems were related to I/O. How? Here’s how (forgive me, but I haven’t really gone into massive amounts of detail on this one, it’s more of a general framework).

Step 1

Run a handy query originally written by Steve Adams. I use this as my very first step to troubleshooting anything that resembles an Oracle performance issue, and work from there.

-----------------------------------------------------------
--
-- Script:      waiters.sql
-- Purpose:     to count the waiters for each event type
-- For:         8.0 and higher
--
-- Copyright:   (c) 2000 Ixora Pty Ltd
-- Author:      Steve Adams
--
-----------------------------------------------------------
select /*+ ordered */
  substr(n.name, 1, 29)  event,
  t0,
  t1,
  t2,
  t3,
  t4,
  t5,
  t6,
  t7,
  t8,
  t9
from
  sys.v_$event_name  n,
  (select event e0, count(*)  t0 from sys.v_$session_wait group by event),
  (select event e1, count(*)  t1 from sys.v_$session_wait group by event),
  (select event e2, count(*)  t2 from sys.v_$session_wait group by event),
  (select event e3, count(*)  t3 from sys.v_$session_wait group by event),
  (select event e4, count(*)  t4 from sys.v_$session_wait group by event),
  (select event e5, count(*)  t5 from sys.v_$session_wait group by event),
  (select event e6, count(*)  t6 from sys.v_$session_wait group by event),
  (select event e7, count(*)  t7 from sys.v_$session_wait group by event),
  (select event e8, count(*)  t8 from sys.v_$session_wait group by event),
  (select event e9, count(*)  t9 from sys.v_$session_wait group by event)
where
  n.name != 'Null event' and
  n.name != 'null event' and
  n.name != 'rdbms ipc message' and
  n.name != 'pipe get' and
  n.name != 'virtual circuit status' and
  n.name not like '%timer%' and
  n.name not like '%slave wait' and
  n.name not like 'SQL*Net message from %' and
  n.name not like 'io done' and
  n.name != 'queue messages' and
  e0 (+) = n.name and
  e1 (+) = n.name and
  e2 (+) = n.name and
  e3 (+) = n.name and
  e4 (+) = n.name and
  e5 (+) = n.name and
  e6 (+) = n.name and
  e7 (+) = n.name and
  e8 (+) = n.name and
  e9 (+) = n.name and
  nvl(t0, 0) + nvl(t1, 0) + nvl(t2, 0) + nvl(t3, 0) + nvl(t4, 0) +
  nvl(t5, 0) + nvl(t6, 0) + nvl(t7, 0) + nvl(t8, 0) + nvl(t9, 0) > 0
order by
  nvl(t0, 0) + nvl(t1, 0) + nvl(t2, 0) + nvl(t3, 0) + nvl(t4, 0) +
  nvl(t5, 0) + nvl(t6, 0) + nvl(t7, 0) + nvl(t8, 0) + nvl(t9, 0)
/

It’s really a nifty “group by” from v$session_wait. So, if you really want, you can go straight there. This query showed me the following:

SQL*Net more data from client 1 1 1 1 1 1 1 1 1 1
latch free   2 2 2 2 2 2 2 2 2 2
db file sequential read  2 2 2 2 2 2 2 2 2 2
db file scattered read  3 3 3 3 3 3 3 3 3 3
enqueue    8 8 8 8 8 8 8 8 8 8
buffer busy waits  15 15 15 15 15 15 15 15 15 15

Ok, so now I know we definitely had a problem with enqueues and buffer_busy_waits. In addition, when I ran this a few times, I also encountered significant latch free waits. So what’s the deal? Poking around the web, you’ll come up with a few standard answers for each of these. To summarize, here’s a great paper on resolving latch free waits.

Looking in v$session_wait told me the latch I was waiting on was the kghalo latch for the shared_pool. This was expected. Our client was known for not binding any SQL, and so we were parsing everything. This wasn’t unusual.

The P2 column in v$session_wait tells you the latch# it’s waiting for, which can be looked up in v$latchname. In addition, for some system stats, you can run this:

select * from v$latch_misses order by 5 desc;

Step 2

What was unusual, however, were the absolute number of enqueues and buffer_busy_waits. Linking the sid from v$session_wait to v$session and then to v$sqlarea, a trend emerged. All the buffer_busy_waits were against the same object (in memory). In addition, the enqueues were also on the same object. How did I know this? Well, the waiting queries were one indication, however, I also looked at v$session_wait to get the information I needed.

When you look at buffer_busy waits, the P1 column gives you the file_name and the P2 column gives you the block#. P3 gives you the reason you’re waiting. Run the following to get the object you’re waiting on.

select * from dba_extents where file_id = [[your_file] and [your_block] between block_id and block_id + blocks;

Typically, you’re hitting hot blocks. You can try to use some of the techniques discussed here.

Step 3

Troubleshooting enqueue uses a similar strategy. For enqueues, the P2 column gives you the object_id of the object, and from there you figure out what’s going on.

select * from dba_objects where object_id = [your_P2_object];

In 10G, the interface is a little more useful, giving you the actual type of enqueue and the reason. In 8173, we’re not so fortunate. Either way, we managed to isolate a number of bad running queries this way, and it was obvious that we were hitting an I/O bottleneck based on one table in the database. My recommendation to the client was to check the disks. After they assured us that everything was okay, I went ahead and looked for query plan flips, etc.

At the end of 3 days of query tuning, one of the clients’ sys admins finally discovered that a battery had failed on the disk controller, causing all write cacheing to no longer be available. I’m sure there must have been a way to see this, but I just don’t know how.

To summarize folks, always check your batteries. But if you don’t, and your DBA starts lifting the hood and poking around, they’re sure to find tons of other stuff that can be fixed and that came to light only when the hardware didn’t behave nicely. Kinda like a nice stress test, eh?

What my team and I discovered was that the object in question needed some serious love, since the client does plan on growing their operation. Because of all the queries we fixed, their app is that much faster. So in the long run, this was good for us. Just like broccoli.

I’m off to dinner.

13 Responses to “Using the Oracle Wait Interface to Troubleshoot I/O Issues”

  • Kevin says:

    Good post. I actually like the fact that modern storage arrays are capable of doing the right thing in the event of a failure like that. But you are right, with 8iR3 there is only so much info available… seems sar or iostat (what OS?) would have showed helpfull information with regard to increase write service times.

    Surprised there wasn’t a spike in free buffer waits and log file sync… apparently a very read-intensive system…

  • Shakir Sadikali says:

    It’s true, we did notice an increase in write times. That’s why our initial assessment was to ask the client to check the disks. It was only after they “confirmed” that everything looked ok that we started looking for DB related causes for higher i/o. Our gut reaction was to check stats across the board and to look for query plan flips.

    There was a spike in buffer busy waits and enqueues. There was also an increase in log file sync as you suggest. I didn’t mention it in the article, however. Thanks for bringing it up!

  • Nigel Thomas says:

    Is there any particular reason why you like to see every wait stat 10 times? Why not throw away t1 to t9 and stick with results:

    SQL*Net more data from client 1
    latch free 2
    db file sequential read 2
    db file scattered read 3
    enqueue 8
    buffer busy waits 15

    Or is this an edited version of the query, with something significant removed?

    Regards Nigel

  • Paul Vallee says:

    Nigel – sometimes the real underlying issues are transient and as a result might only be caught on one or two of the observations. Latch free is the typical culprit in those circumstances.

    I dug up an old observation where the underlying problem is not latch free, however the latch free events exhibits the behaviour I’m talking about:

    EVENT T0 T1 T2 T3 T4 T5 T6 T7 T8 T9
    log file sync 1
    SQL*Net more data from client 1
    SQL*Net break/reset to client 1
    db file parallel write 1 1 1 1 1 1 1 1 1 1
    enqueue 1 1 1 1 1 1 2 2 3 1
    SQL*Net message to client 1 2 2 1 2 1 1 1 4
    db file scattered read 4 4 3 3 3 4 4 4 4 3
    db file sequential read 21 21 24 23 23 23 22 19 17 21
    latch free 1 5 23 41 46 57 66 6
    buffer busy waits 180 180 181 181 181 181 181 181 181 181
    file open 196 200 199 199 194 193 192 191 190 194

    See how for some observations, it doesn’t appear at all? Meanwhile, for another observation only microseconds later, a full 66 sessions are simultaneously waiting on a latch? That’s what doing 10 observations allows you to catch when doing only one you might miss it and come to the wrong conclusion.

    Cheers

    Paul

  • Paul Vallee says:

    Nigel –
    I found an example where latch free is the actual underlying problem, but a single observation would likely have masked that:

    EVENT T0 T1 T2 T3 T4 T5 T6 T7 T8 T9
    row cache lock 1
    enqueue 1 1
    SQL*Net more data from client 3 2 1 1 1
    db file scattered read 1 1 1 1 1 1 1 2
    SQL*Net more data to client 1 1 5 2 2 1 1
    io done 2 1 1 2 1 2 2 2 2
    file open 8 4 1 6 6 1 10 3 3
    slave wait 5 5 5 5 5 5 5 5 5 5
    buffer busy waits 19 4 5 12 6 10 14 5 4
    SQL*Net message to client 38 13 29 37 39 24 36 32 11 10
    log file sync 27 7 84 5 11 36 71 35 14 4
    db file sequential read 59 37 31 44 46 29 51 37 25 30
    latch free 249 250 355 340 178 284 126 220 13 17

    See how, seen as together, the latch free stnads out as being clearly the problem here. In this case, I believe we were living with the old library cache hashing approach and we had a versioning problem. In any event, imagine if you only ran one observation and ended up with the one with 13… you might easily conclude this is a disk IO issue, what with the log file sync and db file seq read trumping your latch free.

    Cheers
    Paul

  • Shakir Sadikali says:

    Hi Nigel,

    Paul illustrates the use of query perfectly. It’s important to get frequent observations, and querying v$session_wait manually a few times in a row really leaves a lot of room to miss quick transient events. An alternative could be to look at v$latch_misses (for example) to see an overall picture, but it’s not really granular in that you see cumulative stats and not a picture of what’s happening “now”. The other obvious limitation is that it’s looking at just the latches and nothing else. Likewise, you could query all the other dictionary tables individually to look for other waits. Steve’s query nicely points you in the right direction from the get-go.

    Cheers,
    Shakir

  • Thanks for the great article!
    This article helped me to find out, which objects are causing a lot of latch free waits.

    Cheers

    Dejan

  • sadikali says:

    Thanks, Dejan! I’m always glad to help!

  • [...] found lately that munching on carrots with French dressing is more satisfying than broccoli. Maybe it’s the tang-and-crunch combination. In any case, I was crunching away yesterday [...]

  • Lou Avrami says:

    Thanks for the very informative article.
    It appears that the ‘hot blocks” link it no longer valid … could you suggest another reference?
    Thanks.

  • Amit says:

    hi below is the o/p am geting showing prob with Stream AQ.
    NAME,EVENT,T0,T1,T2,T3,T4,T5,T6,T7,T8,T9
    SQL*Net more data to client,SQL*Net more data to client,,,,1,1,1,,,,
    Streams AQ: qmn slave idle wait,Streams AQ: qmn slave idle wa,1,1,1,1,1,1,1,1,1,1
    Streams AQ: waiting for time management or cleanup tasks,Streams AQ: waiting for time ,1,1,1,1,1,1,1,1,1,1
    Streams AQ: qmn coordinator idle wait,Streams AQ: qmn coordinator i,1,1,1,1,1,1,1,1,1,1
    wait for unread message on broadcast channel,wait for unread message on br,4,4,4,4,4,4,4,4,4,4
    Streams AQ: waiting for messages in the queue,Streams AQ: waiting for messa,8,8,8,8,8,8,8,8,8,8
    How to deal with this?

  • Shakir Sadikali says:

    Hi Amit, what is your aq_tm_processes parameter set to?

  • Carrots improve night vision, a key in tuning Oracle.

Leave a Reply

  • (will not be published)

XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>