Good Database Design is Mightier than Hardware
Have you ever heard the one about throwing hardware at a software problem?
In one of my previous blog posts, I mentioned something along the lines of—well I’ll just cut and paste . . .
In my experience, the solution to most problems (the ones the caller refers to as “it’s running slow”) are not rooted in hardware, because hardware problems generally cause things to not run at all. It always baffles me when developers and architects prescribe hardware upgrades to make things run faster, because about 80% of the performance-related problems and subsequent solutions I’ve dealt with were resolved by tuning the application.
Well yes, I know you can buy new hardware, and it’s easier. But when it comes to hardware, how many of you have a ten-node RAC cluster running Enterprise Edition with 8GB of RAM on each node, running off a massive SAN?
I’ve been on so many systems that have been running for years–poorly–the way they were, and in a week we can take them apart and have them running without a hitch. We’ve even managed to fix problems that turned out to be the business case to go from RAC back down to a single instance. How much did those customers save on licensing costs?
Back to the example at hand. I have this nifty RAC system that supports some very public and very mission-critical apps, and one day (it was Sunday night) it starts choking. We’re getting enqueues. Slowly they start climbing. Ten nodes came to a crashing halt. I have now seen a ten-node RAC cluster come to crashing halt and completely lock up.
Why, you ask?
A simple SQL statement: DELETE FROM a WHERE b=c AND d=e;.
That’s it. It was going off indexes and in itself was rather boring. Yes, indeed. So why was this statement the one waiting on all the enqueues?
This table was at the top of a four-levels-deep parent-child relationship, and all the foreign keys were un-indexed. For every delete, Oracle was forced to do four full tablescans on parent tables to make sure no child records were around to be orphaned. This oft-overlooked foundation of data-modeling was overlooked on this app, causing some serious downtime.
I first learned of this issue (well, not really an issue but a design oversight) when reading something by Steve Adams over at Ixora back in the day. (If anyone has a link to this or any article illustrating this, I’d appreciate it.) Back to basics, folks! We know Oracle is good, but that doesn’t mean we can forget the core of application design.
The point to take away from this is that we had a massive amount of hardware crippled by a design flaw that was easily avoidable. At this point, the work that I and a few of the other professionals at Pythian have done has reduced the overall load on this cluster to justify permanently reducing it. This is on an app that peaked at over 3000 queries per second prior to anything being done on it. It’s still doing the same amount of work, although I think we can count on this and a number of other changes that we’ve made at the design level, to drop it down from ten nodes (which is obviously overkill for most apps) to a more reasonable two.
There is a time and place for hardware upgrades. The problem is that most people find hardware easier and just aren’t aware of how expensive their app logic really is. I find tuning easier and more cost effective.
Adieu.















March 7th, 2008 at 10:09 am
Hi Shakir
The article and query at ixora’s site is here - the year was 2000. We miss you Steve. :-)
Paul
March 7th, 2008 at 2:11 pm
I’m running a 250GB sized PG DB on a celeron 1.7G box and 768MB of ram which never really sees it’s load average hit < 3 and due to the design of it, it’s handling the load pretty well and users are happier with the response compared to the company’s own OLTP styled DB on mssql
right tools for the job.
March 8th, 2008 at 6:55 am
Other references: Tom Kyte and Connor McDonald (on Jonathan Lewis’s site.
As well as full table scans, there’s a locking - sometimes deadlocking - issue (which full table scans are more likely to reveal, of course). See OraStory for one anecdote. The FK locking behaviour has changed over time (sorry, can’t find a good reference or version number quickly).
March 11th, 2008 at 3:41 pm
[…] point you at an excellent blog post by Shakir Sadikali at the Pythian Group which shows off a ten-node RAC cluster brought to its knees […]
March 14th, 2008 at 11:56 am
[…] SQL is 5 LIOs per row per row source.” He cites a post by Pythian’s Shakir Sadikali: Good Database Design is Mightier than Hardware. (Incidentally, Shakir will be handling next week’s […]
April 30th, 2008 at 11:28 pm
Big hardware breeds lazy programmers…