Looking for Missing Memory Using R
Mar 20, 2012 / By Gwen Shapira
The best thing about HotSOS is the participants. It seems like everyone I meet there is very smart and very educated. This can be intimidating for presenters, but it also makes for good hallway conversations. I had several conversations in HotSOS where I expressed my delight in using R to explore data and assist me in troubleshooting. It sounded like many people are interested in a quick introduction to R and in learning how to use R to visually explore data. My colleague Marc Fielding forwarded me an interesting issue he encountered on a customer site and suggested that this may be a good case for visualization.
The forwarded case didn’t have many details. Its a 9i system that started experiencing ORA-4031 errors sporadically over the last few weeks. Marc suggested using R to explore how the various SGA components changed sizes over the last few weeks and see whether we can find any interesting trends that can indicate memory leaks. So we did just that.
The first step is to get the actual data out of the system, and we used the following query:
select name, ss.snap_time,sga.bytes from perfstat.stats$snapshot ss, perfstat.stats$sgastat sga where ss.snap_id = sga.snap_id and ss.snap_time > sysdate-90
We didn’t have 90 days of snapshots in that system, only 10, so we’ll have to work with that.
The next problem is the format of the data. The results of the query looked similar to:
"NAME","SNAP_TIME","BYTES" "free memory",12-03-09 00:00:00,645935368 "KGH: NO ACCESS",12-03-09 00:00:00,325214880 "db_block_hash_buckets",12-03-09 00:00:00,186650624 "free memory",12-03-09 00:00:00,134211304 "shared_io_pool",12-03-09 00:00:00,536870912 "log_buffer",12-03-09 00:00:00,16924672 "buffer_cache",12-03-09 00:00:00,21676163072 "fixed_sga",12-03-09 00:00:00,2238472 "JOXLE",12-03-10 04:00:01,27349056 "free memory",12-03-10 04:00:01,105800192 "free memory",12-03-10 04:00:01,192741376 "PX msg pool",12-03-10 04:00:01,8192000
I prefer to work with a row for each snapshot and the measurements in separate columns. I could use SQL “pivot” to transform the data set, but pivot requires a list of all columns involved and with 14 columns and rather long names and I preferred not to type that query.
Instead, I used R to transform the data. I used “SQL Developer” to save the results of the query to CSV and loaded the CSV into an R data frame:
The easiest way I found to pivot data in R is by using the “reshape” library. I didn’t have “reshape” installed on my laptop, but installing new libraries in R is easy:
while I’m here, lets also install a library for date/time manipulation, we’ll need it soon:
Now we just need to load the library and pivot our data:
library(reshape) max_shared_pool<-cast(shared_pool,SNAP_TIME ~ NAME,max)
This will keep SNAP_TIME as the rows, turn individual stat NAME into columns and if a row/column combination has more than one value, we’ll keep the highest.
Now its time to explore our new data frame, max_shared_pool. Since we have 14 different variables to explore, I want to be able to figure out which pools are most interesting. In this context, most interesting means having the most variability. If the buffer cache size never changed during those two weeks, there will be no need to explore this variable in detail.
To explore the variability of the data, I decided to use a box plot to show the spread of the data in each field. A boxplot is a compact and convenient way to compare distributions. The box depicts the data between 25 to 75 percentile, a line in the middle of the box indicates the median. Whiskers are drawn to the highest and lowest non-outlier values, and small circles indicate the outliers. The interesting variables will have larger boxes, indicating more variability of the data, or more outliers.
Since I’m not looking at the data over time, there is no need to include the SNAP_TIME column, so we only look at columns 2-14:
This plot has two major problems:
- The first variable, buffer cache, is so much larger from every other pool (as it should be in a properly configured database) that it distorts the chart and we can’t see any interesting details about any other field.
- The X-axis numbers are not very user friendly.
Lets give this another try:
boxplot((max_shared_pool[,3:14])/1024/1024,xlab="Size in MBytes",horizontal=TRUE,las=1,par(mar=c(4,6,2,1)))
Now we clearly see that the interesting pools are the SQLArea, KGHLD, KGHL0 and obviously the free memory. The boxes for those variables are larger and many outliers are shown.
Now I’d like plot those over time and see where the outliers are. The first step in plotting a timeline is preparing an chron object with the times we want to draw. This turned out to be more difficult than I expected. First, I created a table with the dates in SNAP_TIME split into date and time, because chron expects them separately:
dtparts = t(as.data.frame(strsplit(as.character(max_shared_pool$SNAP_TIME),' '))).
Then I used the new table to generate the chron object:
dtchron = chron(dates=dtparts[,1],times=dtparts[,2],format=c('y-m-d','h:m:s'))
Now that the dates are ready, lets plot the timelines of SQLArea, KGHLD, KGHL0 and free memory:
plot(x=dtchron,y=max_shared_pool[,14]/1024/1024,ylim=c(0,1200),type='l') points(x=dtchron,y=max_shared_pool[,10]/1024/1024,type='l',col='green') points(x=dtchron,y=max_shared_pool[,9]/1024/1024,type='l',col='blue') points(x=dtchron,y=max_shared_pool[,5]/1024/1024,type='l',col='red')
Two interesting things stand out from the chart:
- The size of KGHL0 and free memory mirror each other, showing KGHL0 flunctuations to have the largest impact on free SGA memory in this system.
- More important, there are two significant drops in size of the SQL Area. With corresponding increase in free memory. Other than that the sizes of all pools are relatively stable. SQL Area size never dropped below 1G other than those two events.
I wanted to look more closely at those two events, so I selected only the data that matches the times where SQLA size dropped below 1G:
subset(max_shared_pool,SQLA/1024/1024<1000) SNAP_TIME buffer_cache db_block_hash_buckets fixed_sga 63 12-03-08 04:00:03 21676163072 186650624 2238472 64 12-03-08 05:00:01 21676163072 186650624 2238472 87 12-03-09 04:00:02 21676163072 186650624 2238472 88 12-03-09 05:00:04 21676163072 186650624 2238472 free.memory JOXLE KGH..NO.ACCESS KGLDA KGLH0 KGLHD 63 1003574152 27349056 325214880 -Inf 590769624 93064312 64 707855048 27349056 325214880 -Inf 617980608 97541736 87 1056780816 27349056 325214880 -Inf 588999216 94175784 88 701223888 27349056 325214880 -Inf 630100520 101227720 log_buffer PX.msg.pool shared_io_pool SQLA 63 16924672 8192000 536870912 716303928 64 16924672 8192000 536870912 954248096 87 16924672 8192000 536870912 660155416 88 16924672 8192000 536870912 94151836
The drop in SQLA size occurred at the exact same times on two consecutive days, and there was no corresponding increase in any other memory area to explain what happened. It is likely that this is related to a job that runs at these times.
This is just a small example, but I hope you got a taste of how R can be used as a troubleshooting tool to explore the available data and gain insights that are difficult to find when you just use SQL. What I didn’t demonstrate in this example is how to use statistical functions such as regression and smoothing to find even more meaning in data, and how to create more impressive visualizations for presentations. I’ll save those for future blog posts.