Recently we quickly and efficiently resolved a major performance issue with one of our New York clients. In this blog, I will discuss about this performance issue and its solution.
Problem statement
The client’s central database was intermittently freezing because of high CPU usage, and their business severely affected. They had already worked with vendor support and the problem was still unresolved.
Symptoms
Intermittent High Kernel mode CPU usage was the symptom. The server hardware was 4 dual-core CPUs, hyperthreading enabled, with 20GB of RAM, running a Red Hat Linux OS with a 2.6 kernel.
During this database freeze, all CPUs were using kernel mode and the database was almost unusable. Even log-ins and simple SQL such as SELECT * from DUAL; took a few seconds to complete. A review of the AWR report did not help much, as expected, since the problem was outside the database.
Analyzing the situation, collecting system activity reporter (sar) data, we could see that at 08:32 and then at 8:40, CPU usage in kernel mode was almost at 70%. It is also interesting to note that, SADC (sar data collection) also suffered from this CPU spike, since SAR collection at 8:30 completed two minutes later at 8:32, as shown below.
I’ve found lately that munching on carrots with French dressing is more satisfying than broccoli. Maybe it’s the tang-and-crunch combination. In any case, I was crunching away yesterday while thinking about how to answer a question one of our newer start-up clients asked me.
No one has ever come out and formally asked me for a document that states “Best Practices to Scale Application X”. It is an unusual demand, since it’s something many of us at Pythian have implemented, but it’s been more of an ad hoc, iterative process — and rightly so, since architectures must be so organic, and so tailored to the application. What’s more, no one has ever brought us on board so early in the game that we have a hand in actually — gasp! — doing the design and data-model from the get-go. Woo hoo!
Now, a little background. I have built and maintained a few systems. Some of them even supported over 100k concurrent users. These databases didn’t run RAC either (although I do support two very high profile RAC environments now). So, having been in the trenches and knowing what it takes to make a DB move, I got to thinking about some of the basic fundamentals. There are always rules of thumb, right? This is what you need to know to start with building a scalable high-performance system based on stuff that I’ve seen. Obviously, this assumes a database-centric app. Let’s start with the first ten principles.
I gave this talk at the UKOUG, and I have received a few requests to post the slides online. Instead of just posting the PowerPoint I took some time to give the presentation again (internally here at Pythian) and this time we recorded the session and we are posting it here in a variety of formats. This is a bit of a departure from the typical Pythian Goodies, in that it is scripted, and there is a lot of content here in the whitepaper, but there hasn’t been a Goodie in a while so why not!
I’d love to hear from you, so please feel free to ask any follow-up questions to this post in the comments.
Abstract
Do I have enough memory? Why is my free memory so low? Am I swapping to disk? Can I increase my SGA (db cache) size? Can I add another instance to this server? Are my system resources used optimally? These are all questions that often haunt DBAs. This presentation is The Answer. It covers in detail the different types of memory, how to monitor memory, and how to optimally use it with Oracle. Multiple examples in the presentation demonstrate how certain actions on the database side cause different memory areas to be allocated and used on the OS side. Key underlying differences in operating systems approaches to managing memory will be highlighted, with special attention given to Linux, Solaris, and Windows. Using Linux as an example throughout, this presentation explains how to effectively use tools such as “top”, “vmstat” and “/proc/meminfo” to look into into a system’s allocation and use of memory.
Below you should see a flash video with me giving the session.
And below you will find the complete contents of the whitepaper. This is intended to be a good overall reference resource for how memory works in Oracle, using Linux as an example.