Author Archive

Log Buffer #89: a Carnival of the Vanities for DBAs

By Shakir March 21st, 2008 at 11:52 am
Posted in Group Blog PostsLog BufferMySQLOraclePostgreSQLSQL Server
Tags:

Welcome to the 89th edition of Log Buffer, the weekly review of database blogs.

Welcome, welcome everyone.

In writing this week’s Log Buffer, I’ve had a chance to sit down and read some excellent posts on all sorts of platforms. The depth and breadth of what’s available to house and retrieve data is astonishing.

Many of you who have read my posts will know that I’m a fan of vegetables. They are something most of us don’t eat enough of. Come on DBAs! I think we need to make a collective effort to get healthy. We need you to keep all these systems alive. I say this because I have a new found appreciation for the work we do day in and day out.

Six months ago my wife and I said hello to our baby girl for the first time. I don’t say this to elicit any type of congratulations, but to illustrate something entirely different. If you have ever been to a hospital for any reason — to celebrate, to hope, or to say goodbye — you know the sheer complexity of the vast numbers of systems that need to interact. Daily, these systems save lives and help bring new ones to this world. I saw first-hand how the work I do on a everyday keeps the wheels turning.

Some of our customers run systems used by hospitals and I saw them in action. In a simple world, treating people can be done without technology, but this is an issue of scale, and our involvement directly affects the sheer masses of people whose lives are better because of our behind-the-scenes support. It’s true here, and it’s true for the most serious, most mission-critical systems, to the least critical and most trivial systems. The work done by DBAs from all platforms should be recognized for what it is.

I’m proud of what I do for a living and happy that I get to work in an industry filled with so many savvy folks. Oracle, Microsoft, MySQL, Postgres, IBM, and countless other organizations, and the people involved in them have together created an industry filled with opportunities and challenges, and above all, they have together elevated our ability to communicate and share. It’s in this spirit that Log Buffer was created, so let us proceed!

Since I’m an Oracle guy, we’ll let Oracle go first this time.

(more…)

How To Build Scalable Database Architectures

By Shakir March 12th, 2008 at 3:03 pm
Posted in Group Blog PostsOracle
Tags:

I’ve found lately that munching on carrots with French dressing is more satisfying than broccoli. Maybe it’s the tang-and-crunch combination. In any case, I was crunching away yesterday while thinking about how to answer a question one of our newer start-up clients asked me.

No one has ever come out and formally asked me for a document that states “Best Practices to Scale Application X”. It is an unusual demand, since it’s something many of us at Pythian have implemented, but it’s been more of an ad hoc, iterative process — and rightly so, since architectures must be so organic, and so tailored to the application. What’s more, no one has ever brought us on board so early in the game that we have a hand in actually — gasp! — doing the design and data-model from the get-go. Woo hoo!

Now, a little background. I have built and maintained a few systems. Some of them even supported over 100k concurrent users. These databases didn’t run RAC either (although I do support two very high profile RAC environments now). So, having been in the trenches and knowing what it takes to make a DB move, I got to thinking about some of the basic fundamentals. There are always rules of thumb, right? This is what you need to know to start with building a scalable high-performance system based on stuff that I’ve seen. Obviously, this assumes a database-centric app. Let’s start with the first ten principles.

(more…)

Good Database Design is Mightier than Hardware

By Shakir March 7th, 2008 at 10:03 am
Posted in Group Blog PostsOracle
Tags:

Have you ever heard the one about throwing hardware at a software problem?

In one of my previous blog posts, I mentioned something along the lines of—well I’ll just cut and paste . . .

In my experience, the solution to most problems (the ones the caller refers to as “it’s running slow”) are not rooted in hardware, because hardware problems generally cause things to not run at all. It always baffles me when developers and architects prescribe hardware upgrades to make things run faster, because about 80% of the performance-related problems and subsequent solutions I’ve dealt with were resolved by tuning the application.

Well yes, I know you can buy new hardware, and it’s easier. But when it comes to hardware, how many of you have a ten-node RAC cluster running Enterprise Edition with 8GB of RAM on each node, running off a massive SAN?

I’ve been on so many systems that have been running for years–poorly–the way they were, and in a week we can take them apart and have them running without a hitch. We’ve even managed to fix problems that turned out to be the business case to go from RAC back down to a single instance. How much did those customers save on licensing costs?

Back to the example at hand. I have this nifty RAC system that supports some very public and very mission-critical apps, and one day (it was Sunday night) it starts choking. We’re getting enqueues. Slowly they start climbing. Ten nodes came to a crashing halt. I have now seen a ten-node RAC cluster come to crashing halt and completely lock up.

Why, you ask? (more…)

Bug in Oracle DST Patch Rollback

By Shakir February 26th, 2007 at 5:58 pm
Posted in Group Blog PostsOracle

I’ve done a few patches now — on Windows, Linux, Solaris and on many platforms for both 9i and 10g, some with physical standbys and others without. The lesson that I’ve taken away from them is that rolling back is twice the fun of actually applying the patch.

Last week, we discovered a bug related to the rollback of the Java DST patch for 9206 on Linux x86. If you’re in the unfortunate position of needing to roll this patch back, you won’t be able to. It turns out that the patch instructions omit telling you to also back up the classes.bin file in addition to
$ORACLE_HOME/javavm/admin/libjox9java_util.so.

When you rollback, you need to restore both of these files, and then do a create or replace java system. Otherwise you risk hitting a dreaded ORA-03114 End of file on communication channel.

In my case, when I followed the patch instructions, the Java test case given to us by Oracle still returned a 1. Those of you familiar with DST patching will have seen this, but for those who haven’t, here are the scripts you need to check for
a successful Java implementation:

(more…)

Oracle Security vs Microsoft Security

By Shakir November 30th, 2006 at 12:56 pm
Posted in Group Blog PostsNon-Tech ArticlesNot on Homepage

I always get a chuckle… well not always… but often, when I read anything to do with Microsoft vs. “the flavour of the month”. In this case, it looks like the flavour of the month is Oracle. I’m sure most of you in this space have already read much of this, but if you haven’t, here’s the link, and here’s the white paper that started it all.

The articles simple imply that SQL Server is more secure than Oracle. They go on to count the number of vulnerabilities identified over the last few years. It’s scary stuff and I’m sure we’re all running out now to ask our architects and DBAs how quickly we can port over to SQL Server.

I say I get a chuckle with these things because I actually read one of the use-cases from the author.

First off, I need to make a confession. I like Microsoft. They generally make the only software I find practical for my home use. And the keyboard and mice they make are unmatched. The XBox is great too. I also really like Linux and Solaris. Frankly, all the database that I work on, work best on them. Frankly, Oracle also makes the best database software for what it’s built for, and for what most of our clients use it for.

All of these companies are leaders in one way or the other and they have their own faults. I always get that chuckle when I read about Windows flaws. What do researchers expect? I liken it to Michael Jordan. See, I played high school basketball, and I dreamt of making the NBA someday, but then I went to university and common sense kicked in. I still play regularly. And what strikes me is this, when you’re the best player in the league, the opposing team always puts up their best defender against you. MJ became the best in the league, a god–among–ballers if you will, because the other teams consistently made him improve his game. It logically follows that if you have a dominant position in anything, everyone will come after you and expose every flaw in your armour, no matter how insignificant, or complex, or impractical that vulnerability is.

Obviously Microsoft has to work harder to fix its vulnerabilities on the OS front — it’s the undisputed champ. Well, I think the same can be said for Oracle. I think the more important question to be asked is, how many people are actually at risk of being exploited?

A case in point here. It’s the same as the use-case I mentioned above. I’m a developer and a DBA. I’ve written a couple of apps myself. Will anyone ever be able to use this exploit against me? I doubt it. The reason I doubt it is that security isn’t built only in the database, it’s built at the firewall, it’s built into the app, it’s built into the DB. Can you imagine how far a person has to infiltrate into my system before this exploit becomes practical?

At least Oracle came through with the logical response to the threat of having its security vulnerabilities unleashed on the masses: “(We) do not credit security researchers who disclose the existence of vulnerabilities before a fix is available. We consider such practices, including disclosing ‘zero day’ exploits, to be irresponsible as they can result in needlessly exposing customers to risk of attack.”

I like this. This shows more commons sense than that displayed by self-trumpeting security analysts threatening the company and putting users at risk. Yee-haw.

I once went to a basketball camp with Leo Rautins, one of the few Canadians to have made it to the NBA and played against Jordan and Larry Bird. (I played against him, and he was awesome, which is kinda sad, because in the NBA, he was actually the worst–performing Canadian player. So it was obvious why I needed to move my career in another direction.) He once told me that the best basketball player in the world is probably rotting away in prison (a reference to the huge numbers of African–American men incarcerated, hugely disproportional to the actual number of African–Americans in the US).

I don’t want to get into a philosophical discussion, since demographics, social policy, stats and crime are some of my hobbies and outside the scope of this article. But the point he was making is that if you don’t see something, how do you know it exists? And by extension, if fewer people are looking, does that mean something isn’t there? There are no scouts scouting American prisons, but he personally knew people with exceptional skill that were playing the penitentiary leagues. The problem is that these people existed but were unknown because no one was looking there.

The same is true with software. If no one is looking for vulnerabilities on a platform, does that inherently mean it is more secure? Is a Mac more secure because it’s better built, or is it more secure because there are fewer threats against it?

I know there’s a risk with these flaws, just as there are risks with using IE. But I still surf using it. Hell, I still use Windows 98 at home on an old AMD k62-450 machine that runs just as well today as it did five years ago. I haven’t been “identity thieved” yet. I have resigned myself to the fact that it’s just a matter of time, and if someone really wants to get to me, they’ll find a way no matter how non-secure or secure my database is. What all these security experts fail to mention is that the vast majority of security exploits are conducted by insiders.

Finally, it comes down to broken trust. That’s something you can’t guard against. Chuckle chuckle… sigh.

Cut Oracle Costs on Mainframe

By Shakir November 13th, 2006 at 5:18 pm
Posted in Group Blog Posts

I’m a simple guy. I like things simple. I once had a consulting gig with a client who wanted to do something that I thought was quite simple.

But before I tell you what this simple thing is, I respectfully ask that you be seated.

The client entrusted me with the migration of one of their mission critical apps running 8i on HPUX to — drum role please — 9i … on … IBM Mainframe running os390 (z/os). For those who aren’t familiar with this battle–tested industrial–strength hardware, read here. For your added pleasure, read this article about the the mainframe world that came out today on CNN. You can see that it is indeed alive and well.

I had never worked on a mainframe before. Heck, I’m a young lad, and I’ve been brought up to laugh at these old hunks of metal. I was quickly put through mainframe initiation, which is a little odd, because you need to learn that os390 is a Dr. Jekyl/Mr. Hyde kinda apparatus. It’s something like two OSes running simultaneously together (as opposed to simultaneously on the same box but separately). It’s just odd and it took me a while to wrap my mind around it. But I did, and now I am a member–for–life of the reclusive club of mainframers. We wave at each other as we pass on the street.

In any case, the current production environment was composed of HPUX boxes running 8i and 9i. I can’t imagine how expensive the Oracle licensing must have been, but based on the sheer number of machines, I know Oracle was a happy supplier. Which brings me to the central point — unless you’re the one actually paying the bills, you don’t really realize that Oracle is quite expensive. While you do get what you pay for in terms of robust infrastructure, there are ways to lower the cost.

For this client, we were able to shave $600,000 (not just the Oracle license, but also the external costs of maintaining the environment) by migrating just one HPUX machine over to the mainframe with maybe 30 or more to go. All of them were scheduled to move over to the one mainframe with one Oracle license and one environment to manage.

I’m not here to convince you that mainframe is the solution to the world’s Oracle licensing cost problems, but if you’re in the “we have 30 machines each running Oracle” space, then I think it makes sense for you to seriously consider it. Add to that the benefits of all the control you get over the environment, it makes for a compelling story.

P.S.: In case any of you are planning such a migration, one thing I should point out about it: if you use import, then there’s a brutal bug that creates the initial extent of every object in the database with a minimum of 1MB. So, if you have scads of tiny objects taking up a few kilobytes, be prepared to account for this in your space requirement calculations.

P.P.S.: Trivia: Did you know that the government of Canada runs the busiest Oracle–on–mainframe environment in the world? It does!

Cheers,
Shakir

Speeding-Up Oracle Export/Import Migration

By Shakir November 10th, 2006 at 5:16 pm
Posted in Group Blog Posts

Or, How To Become an Export/Import Migration Superstar!

We’ve already established that I like broccoli. I get the taste from my dad who takes it with a glob a salad dressing and munches on it. It beats munching on Doritos. He and I both also do this with cucumbers. I don’t get it, but it tastes good and I know I’ll get a real kick out of it if I see my kids doing this someday too. In any case, one thing my dad doesn’t do is work with Oracle, and sadly, because of this, he’ll never get to add the Export/Import superstar trophy to his collection.

Over the last weekend I had the pleasure (I say pleasure because this was actually a very smooth operation) to do an 8i -> 10g, Solaris -> Linux migration. Talk about going in head–first. Now, whenever I run into a situation like this, we generally recommend a two–step process so that we can iron out bugs and be able to isolate causes. However, we were under serious time constraints, and we decided to just go with the following simple plan and move straight there. I had plenty of salad dressing to keep me company overnight.

Step 1 - export
Step 2 - import data
Step 3 - import everything else

That was the plan. We tried this a few times and when I finally had all the indexes etc. where I wanted them on the new server, and since we did some minor storage rejigging, I actually took a norows there and used that on the object import.

Now for the fun part. Typically, if you stick with a standard export/import, it’ll take you say… 10 hours. That’s what it usually took me. 30GB of data and 30GB of indexes. Now, you may ask, what makes me a superstar? Well, the fact that I dropped this down to about 5 hours, that’s what. This assumes you have big pages set up and Oracle able to address the space for a massive SGA.

The tricks:

  1. Assuming a full export, use direct=y if you’re not using any predicates in the export.
  2. Set your buffer to be big (10MB at least).
    Now the good stuff:
  3. Alter all your constraints novalidate prior to export if you can bring the app down, and if you can take a consistent export. This helps in reinstating the constraints instantly without forcing Oracle to validate constraints on massive tables
  4. I set workarea_size_policy=manual
  5. Set sort_area_size=6GB (yes, 6 GB). Combined with 4, it let Oracle build indexes with more space in memory for all the sorts without the need to spill to disk.
  6. Set massive online redo logs at 2GB each, 2 members each, 6 groups.

After the import, I reset everything back down to “normal”. The beauty of this migration was that the the client gave us a significant window of downtime. I cannot stress how much this actually reduces the complexity of the whole operation, and reduces the time needed to verify that everything works. Yay! for clients who provide realistic migration windows, and yay! for less complexity and effort (which equates to cost for all you business–y people) in migration windows.

I’m happy, and the client is happy, and I still have some cucumbers and broccoli left for breakfast.

P.S.: To add to the list of export/import deficiencies and 10.2.0.2 problems — you should note that function–based indexes need to be created manually.

Failovers with Oracle Dataguard

By Shakir November 3rd, 2006 at 3:27 pm
Posted in Group Blog Posts

In the last month, we have been hit with two clients’ large-scale failures. The first involved network issues; the second, disk failures. Fun fun.

We have one production and two standby environments. In the case of the first failure we went into our trusty dataguard config and performed what can only be described as a minor miracle.

First we connected to the standby that was to become the primary.

Welcome to DGMGRL, type "help" for information.
DGMGRL> connect sys/xxxxx@rcatstdby2
Connected.
DGMGRL> show configuration;

Configuration
Name: rcat
Enabled: YES
Protection Mode: MaxPerformance
Fast-Start Failover: DISABLED
Databases:
rcat_aussw10_db - Physical standby database
rcat_nflsw01_db - Primary database
rcat_aussw20_db - Physical standby database

Current status for "rcat":
^C

We had to control-c out because it hung: it could not connect to nflsw01 and we knew that at least dataguard thought there was a problem. Then we took the steps that separate the men from the boys.

(more…)

Using the Oracle Wait Interface to Troubleshoot I/O Issues

By Shakir October 30th, 2006 at 3:56 pm
Posted in Group Blog Posts

Or, Why I Like Broccoli and Wait Events

There’s been a fair bit of discussion with respect to whose vegetables are the freshest in Ottawa. From one DBA to another, I’m here to tell you that St. Laurent Fruit and Vegetable (beside the Mideast Meat Market) is probably the place to go for the best prices on… well… fruits and vegetables. They also carry lots of bags, and they’re strategically placed so you’ll never need old men to help you get more :-)

What’s more interesting about grocery shopping is that it lets you daydream and go over all the good work you’ve done over the last week (assuming you go weekly and a certain amount good work done). Last week was particularly tough. One of our more stable clients had some serious “mystery” production problems. To paraphrase, their app that was running “OK” starting running “sucky”. I knew that nothing had changed from our end and so being the intrepid super-sleuths that we are, we determined that the problems were related to I/O. How? Here’s how (forgive me, but I haven’t really gone into massive amounts of detail on this one, it’s more of a general framework).

Step 1

Run a handy query originally written by Steve Adams. I use this as my very first step to troubleshooting anything that resembles an Oracle performance issue, and work from there.

-----------------------------------------------------------
--
-- Script:      waiters.sql
-- Purpose:     to count the waiters for each event type
-- For:         8.0 and higher
--
-- Copyright:   (c) 2000 Ixora Pty Ltd
-- Author:      Steve Adams
--
-----------------------------------------------------------
select /*+ ordered */
  substr(n.name, 1, 29)  event,
  t0,
  t1,
  t2,
  t3,
  t4,
  t5,
  t6,
  t7,
  t8,
  t9
from
  sys.v_$event_name  n,
  (select event e0, count(*)  t0 from sys.v_$session_wait group by event),
  (select event e1, count(*)  t1 from sys.v_$session_wait group by event),
  (select event e2, count(*)  t2 from sys.v_$session_wait group by event),
  (select event e3, count(*)  t3 from sys.v_$session_wait group by event),
  (select event e4, count(*)  t4 from sys.v_$session_wait group by event),
  (select event e5, count(*)  t5 from sys.v_$session_wait group by event),
  (select event e6, count(*)  t6 from sys.v_$session_wait group by event),
  (select event e7, count(*)  t7 from sys.v_$session_wait group by event),
  (select event e8, count(*)  t8 from sys.v_$session_wait group by event),
  (select event e9, count(*)  t9 from sys.v_$session_wait group by event)
where
  n.name != 'Null event' and
  n.name != 'null event' and
  n.name != 'rdbms ipc message' and
  n.name != 'pipe get' and
  n.name != 'virtual circuit status' and
  n.name not like '%timer%' and
  n.name not like '%slave wait' and
  n.name not like 'SQL*Net message from %' and
  n.name not like 'io done' and
  n.name != 'queue messages' and
  e0 (+) = n.name and
  e1 (+) = n.name and
  e2 (+) = n.name and
  e3 (+) = n.name and
  e4 (+) = n.name and
  e5 (+) = n.name and
  e6 (+) = n.name and
  e7 (+) = n.name and
  e8 (+) = n.name and
  e9 (+) = n.name and
  nvl(t0, 0) + nvl(t1, 0) + nvl(t2, 0) + nvl(t3, 0) + nvl(t4, 0) +
  nvl(t5, 0) + nvl(t6, 0) + nvl(t7, 0) + nvl(t8, 0) + nvl(t9, 0) > 0
order by
  nvl(t0, 0) + nvl(t1, 0) + nvl(t2, 0) + nvl(t3, 0) + nvl(t4, 0) +
  nvl(t5, 0) + nvl(t6, 0) + nvl(t7, 0) + nvl(t8, 0) + nvl(t9, 0)
/

It’s really a nifty “group by” from v$session_wait. So, if you really want, you can go straight there. This query showed me the following:

SQL*Net more data from client 1 1 1 1 1 1 1 1 1 1
latch free   2 2 2 2 2 2 2 2 2 2
db file sequential read  2 2 2 2 2 2 2 2 2 2
db file scattered read  3 3 3 3 3 3 3 3 3 3
enqueue    8 8 8 8 8 8 8 8 8 8
buffer busy waits  15 15 15 15 15 15 15 15 15 15

Ok, so now I know we definitely had a problem with enqueues and buffer_busy_waits. In addition, when I ran this a few times, I also encountered significant latch free waits. So what’s the deal? Poking around the web, you’ll come up with a few standard answers for each of these. To summarize, here’s a great paper on resolving latch free waits.

Looking in v$session_wait told me the latch I was waiting on was the kghalo latch for the shared_pool. This was expected. Our client was known for not binding any SQL, and so we were parsing everything. This wasn’t unusual.

The P2 column in v$session_wait tells you the latch# it’s waiting for, which can be looked up in v$latchname. In addition, for some system stats, you can run this:

select * from v$latch_misses order by 5 desc;

Step 2

What was unusual, however, were the absolute number of enqueues and buffer_busy_waits. Linking the sid from v$session_wait to v$session and then to v$sqlarea, a trend emerged. All the buffer_busy_waits were against the same object (in memory). In addition, the enqueues were also on the same object. How did I know this? Well, the waiting queries were one indication, however, I also looked at v$session_wait to get the information I needed.

When you look at buffer_busy waits, the P1 column gives you the file_name and the P2 column gives you the block#. P3 gives you the reason you’re waiting. Run the following to get the object you’re waiting on.

select * from dba_extents where file_id = [[your_file] and [your_block] between block_id and block_id + blocks;

Typically, you’re hitting hot blocks. You can try to use some of the techniques discussed here.

Step 3

Troubleshooting enqueue uses a similar strategy. For enqueues, the P2 column gives you the object_id of the object, and from there you figure out what’s going on.

select * from dba_objects where object_id = [your_P2_object];

In 10G, the interface is a little more useful, giving you the actual type of enqueue and the reason. In 8173, we’re not so fortunate. Either way, we managed to isolate a number of bad running queries this way, and it was obvious that we were hitting an I/O bottleneck based on one table in the database. My recommendation to the client was to check the disks. After they assured us that everything was okay, I went ahead and looked for query plan flips, etc.

At the end of 3 days of query tuning, one of the clients’ sys admins finally discovered that a battery had failed on the disk controller, causing all write cacheing to no longer be available. I’m sure there must have been a way to see this, but I just don’t know how.

To summarize folks, always check your batteries. But if you don’t, and your DBA starts lifting the hood and poking around, they’re sure to find tons of other stuff that can be fixed and that came to light only when the hardware didn’t behave nicely. Kinda like a nice stress test, eh?

What my team and I discovered was that the object in question needed some serious love, since the client does plan on growing their operation. Because of all the queries we fixed, their app is that much faster. So in the long run, this was good for us. Just like broccoli.

I’m off to dinner.

Tuning 101, Part 1: Egads! Why is it So Slow?

By Shakir August 30th, 2006 at 11:06 am
Posted in Group Blog Posts

I’ve received the call more times than I can count. You know, the one that arrives ten minutes before you pack up for the trip home to see the kids, play ball, and generally not be at the office. The caller is normally your boss/administrator/guy in charge of all things not working. And when the caller reaches you, you’re the guy in charge of making it work again.

In my experience, the solution to most problems (the ones the caller refers to as “it’s running slow”) are not rooted in hardware, because hardware problems generally cause things to not run at all. It always baffles me when developers and architects prescribe hardware upgrades to make things run faster, because about 80% of the performance-related problems and subsequent solutions I’ve dealt with were resolved by tuning the application. In the hands of an experienced tuner, the cost of getting more out of the whole system does not compare with the cost of a hardware upgrade. It’s just apples and oranges… and my apples are much cheaper than their oranges.

This first installment will deal with the biggest contributor to bad application design — the dreaded bad query. I’ve actually had busy systems crash because a developer unleashed a query that performed a full tablescan too many.

This approach can be used both for troubleshooting in a crisis, or preferably, pro-actively to help you identify potential bottlenecks before they cause problems. It might not be the most efficient, and I can’t guarantee that the query is bug-free, but it works. If you find any problems, please let me know. I always appreciate efforts to make my work better.

You need to know what’s running on the system, and how “expensive” it is. That’s the only way to know what you should attack, and in my experience it’s the hardest part of this making-things-run-faster thing.

On Oracle systems we can identify the SQL that’s been running by querying the v$sqlarea/v$sqltext views. You can also do this with statspack using a level 6 or higher snap. These methods work superbly under ideal conditions. However, real-life is rarely ideal. We flush the shared_pool every now and then; some of the systems have a cold backup taken; some of them crash; and last, some of them obfuscate the SQL by not using bind_variables, making it difficult to isolate queries.

Running the following always works as a good starting point:
select executions, buffer_gets, sql_text from v$sqlarea;

74522 166474 INSERT INTO UNIQUE_OBJECT VALUES ( :b1,:b2,:b3,:b4,:b5,:b6,:b7  )
35263 71972 INSERT INTO UNIQUE_USER_ACCOUNT VALUES ( :b1,:b2,:b3  )
953 4105 INSERT INTO UQ_USER_SNAPSHOT VALUES ( SEQ_UQ_USER_SNAPSHOT.NEXTVAL,:b1,:b2,SYSDATE  )
953 16415 INSERT INTO UQ_USER_SNAPSHOT_ENTRY VALUES ( :b1,:b2,:b3  )
1 4 SELECT  MODIFIED_BY “modified_by”, MODIFY_DT “modify_dt” FROM UQ_ANSWER where (ID=12899182)
1 4 SELECT  MODIFIED_BY “modified_by”, MODIFY_DT “modify_dt” FROM UQ_ANSWER where (ID=12899184)
1 4 SELECT  MODIFIED_BY “modified_by”, MODIFY_DT “modify_dt” FROM UQ_ANSWER where (ID=12899192)
1 4 SELECT  MODIFIED_BY “modified_by”, MODIFY_DT “modify_dt” FROM UQ_ANSWER where (ID=12899194)

You’ll notice that some of the queries — the ones highlighted in bold — have literal values. These literal values get in the way of figuring out how expensive a query really is. For instance, I could try doing a group by summing the buffer_gets, but those pesky literals just get in the way. They prevent the underlying query from getting grouped properly because Oracle thinks they’re all different! Also, these buffer_gets and executions are relevant only since the time this shared_pool was last flushed or bounced, or if the query hadn’t been aged out. Are we really getting the information we need to make a good judgement?

I found this function from Tom Kyte’s webspace, and I haven’t touched it in any way. It takes unbound SQL and strips away the bind variables. Thank you Tom, it was a life saver.

create or replace function remove_constants( p_query in varchar2 ) return varchar2 as l_query long;
l_char varchar2(1);
l_in_quotes boolean default FALSE;
begin for i in 1 .. length( p_query ) loop l_char := substr(p_query,i,1);
if ( l_char = '''' and l_in_quotes ) then l_in_quotes := FALSE;
elsif ( l_char = '''' and NOT l_in_quotes ) then l_in_quotes := TRUE;
l_query := l_query || '''#';
end if;
if ( NOT l_in_quotes ) then l_query := l_query || l_char;
end if;
end loop;
l_query := translate( l_query, '0123456789', '@@@@@@@@@@' );
for i in 0 .. 8 loop l_query := replace( l_query, lpad('@',10-i,'@'), '@' );
l_query := replace( l_query, lpad(' ',10-i,' '), ' ' );
end loop;
return upper(l_query);
end;
/

The next section is where we get creative. Because we’re flushing the shared_pool, and crashing, and taking cold backups, we need to periodically keep track of the sqlarea. I do this by scheduling the following procedure that takes the shared_pool and simply stuffs it into a holding table. You can schedule it at whatever interval you want. Here’s the syntax for the underlying tables.

drop table bob_sqlarea;
create table bob_sqlarea tablespace INDX as select hash_value
     , sql_text
     , executions, buffer_gets
     , parsing_user_id from v$sqlarea;
alter table avail.bob_sqlarea add sql_text_wo_constants varchar2(4000);
alter table avail.bob_sqlarea add timestamp date;
create or replace procedure pop_bob_sqlarea as begin insert into bob_sqlarea select sysdate timestamp
     , hash_value, sql_text
     , executions , buffer_gets
     , parsing_user_id from v$sqlarea;
commit; end;
/  

exec sys.dbms_job.submit(:x,'pop_bob_sqlarea;',sysdate+1/24,'sysdate + 1/24');

Now that we’re catching the entire shared_pool, we stand a chance of getting meaningful cumulative stats for the DB. Here I have scheduled my job to run every hour.

When you’re ready to look at the data after a few minutes/hours/days/years of gathering it, you simply do this:

update bob_sqlarea set sql_text_wo_constants = remove_constants(sql_text); commit;

Now, run this query and look at the report. It’s ugly because it takes into account both the SQL getting flushed in/out and also the time of the DB bounce. You do lose some data on a bounce, but if you’re taking stats every x minutes, it’s better than what statspack gives you.

select sum(total_execs), sum(total_bg), case when sum(total_execs) > 0 then round(sum(total_bg)/sum(total_execs),0) else sum(total_bg) end ratio, sql_text
from (
select
username,
substr(sql_text_wo_constants,0,900) sql_text,
hash_value,
executions,
buffer_gets,
count(*) over (partition by hash_value) num_hashes,
executions - lag(executions,1,0) over (partition by hash_value order by timestamp) diff,
case
when count(*) over (partition by hash_value) = 1 then executions
when executions - lag(executions,1,0) over (partition by hash_value order by timestamp asc) = 0 then 0
when executions - lag(executions,1,0) over (partition by hash_value order by timestamp asc) > 0 then executions - lag(executions,1,0) over (partition by hash_value order by timestamp asc)
when count(*) over (partition by hash_value) = 2 then executions
when executions - lag(executions,1,0) over (partition by hash_value order by timestamp asc) < 0 then executions
else executions - lag(executions,1,0) over (partition by hash_value order by timestamp)
end total_execs,
case
when count(*) over (partition by hash_value) = 1 then buffer_gets
when buffer_gets - lag(buffer_gets,1,0) over (partition by hash_value order by timestamp asc) = 0 then 0
when buffer_gets - lag(buffer_gets,1,0) over (partition by hash_value order by timestamp asc) > 0 then buffer_gets - lag(buffer_gets,1,0) over (partition by hash_value order by timestamp asc)
when count(*) over (partition by hash_value) = 2 then buffer_gets
when buffer_gets - lag(buffer_gets,1,0) over (partition by hash_value order by timestamp asc) < 0 then buffer_gets
else buffer_gets - lag(buffer_gets,1,0) over (partition by hash_value order by timestamp)
end total_bg
from bob_sqlarea, dba_users
where bob_sqlarea.parsing_user_id = dba_users.user_id)
group by sql_text
order by case when sum(total_execs) > 0 then round(sum(total_bg)/sum(total_execs),0) else sum(total_bg) end desc;

With this query, you can see a truer picture of what SQL your database is actually running, despite the bounces, crashes, and general badness that the app is throwing at it. It should save you some time in identifiying heavy resource-hogging queries, because you have the queries grouped together despite having unbound literals, and the numbers will make sense even if the DB has been bounced.

This is what you’ll see (indented for readability):

3 12 4 SELECT MUMP_CREATE_DT "CREATEDT", MUMP_MODIFY_DT "MODIFYDT"
    , MUMP_BREAKFAST "MUMP_BREAKFAST", MUMP_DAY_NUMBER "MUMP_DAY_NUMBER"
    , MUMP_DINNER "MUMP_DINNER", MUMP_ID "MUMP_ID"
    , MUMP_LUNCH "MUMP_LUNCH" , MUMP_SNACK_@ "MUMP_SNACK_@"
    , MUMP_SNACK_@ "MUMP_SNACK_@" , MUMP_SNACK_@ "MUMP_SNACK_@"
    , MUMP_STATUS "MUMP_STATUS" , MUMP_USER_ID "MUMP_USER_ID"
    FROM MP_USER_MEAL_PLAN WHERE ((MUMP_ID=@))
11 44 4 SELECT MODIFY_DT "MODIFY_DT" FROM UT_DATAPOINT WHERE (ID=@)
139 558 4 SELECT MODIFIED_BY “MODIFIED_BY”
, MODIFY_DT “MODIFY_DT” FROM UQ_ANSWER WHERE (ID=@)
1916 6792 4 SELECT ID FROM UQ_USER_SNAPSHOT WHERE USER_ID = :B@ AND QUESTIONNAIRE_ID = :B@
1449 5850 4 SELECT ID FROM UQ_RESPONSE_PROFILE WHERE USER_ID = @ 

In case you’re wondering who the “bob” is for “bob_sqlarea”, feel free to shoot me an email :-)

In the second part of this series, I’ll be highlighting techniques that use the Oracle Wait interface to identify performance issues.