Are you aware of an existing DBA opening or consulting requirement in your organization? Enter your email for a chance to win one year's access to Safari Books.
Basic IO Monitoring on Linux
This is my fourth week at Pythian and in Canada and I’m starting to get back to my normal life cycle – my personal things are getting sorted and my working environment is set. Here at Pythian I’m in a team of four people together with Christo, Joe, and Virgil. (I should write another post about beginning at Pythian – will do one day.)
Yesterday, I asked Christo to show me how he monitors IO on Linux. I needed to collect statistics on a large Oracle table on a production box, and wanted to keep an eye on the impact. So we grabbed Joe as well and sat all three around my PC. While we were discussing, Paul was around and showed some interest in the topic we discussed – otherwise, why would we all three be involved?. Anyway, Dave and Paul thought that this would be a nice case for a blog post. So here we are…
Indeed, while the technique we discuss here is basic, it gives a good overview and is very easy to use. So let get focused… We will use iostat utility. In case you need you know where to find more about it – right, man pages.
So we will use the following form of the command:
iostat -x [-d] <interval>
- -x option displays extended statistics. You definitely want it.
- -d is optional. It removes CPU utilization to avoid cluttering the output. If you leave it out, you will get the following couple lines in addition:
avg-cpu: %user %nice %sys %iowait %idle 6.79 0.00 3.79 16.97 72.46
- <interval> is the number of seconds
iostatwaits between each report. Without a specified interval,iostatdisplays statistics since the system was up then exits, which is not useful in our case. Specifying the number of seconds causesiostatto print periodic reports where IO statistics are averaged for the time period since previous report. I.e., specifying 5 makesiostatdump 5 seconds of average IO characteristics, every 5 seconds until it’s stopped.
If you have many devices and you want to watch for only some of them, you can also specify device names on command line:
iostat -x -d sda 5
Now let’s get to the most interesting part – what those cryptic extended statistics are. (For readability, I formatted the report above so that the last two lines are in fact a continuation of the first two.)
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s rkB/s sda 0.00 12.57 10.18 9.78 134.13 178.84 67.07 wkB/s avgrq-sz avgqu-sz await svctm %util 89.42 15.68 0.28 14.16 8.88 17.72
r/sandw/s– respectively, the number of read and write requests issued by processes to the OS for a device.rsec/sandwsec/s– sectors read/written (each sector 512 bytes).rkB/sandwkB/s– kilobytes read/written.avgrq-sz– average sectors per request (for both reads and writes). Do the math –(rsec + wsec) / (r + w) = (134.13+178.84)/(10.18+9.78)=15.6798597
If you want it in kilobytes, divide by 2.
If you want it separate for reads and writes – do you own math usingrkB/sandwkB/s.avgqu-sz– average queue length for this device.- await – average response time (ms) of IO requests to a device. The name is a bit confusing as this is the total response time including wait time in the requests queue (let call it
qutim), and service time that device was working servicing the requests (see next column –svctim).So the formula is
await = qutim + svctim. svctim– average time (ms) a device was servicing requests. This is a component of total response time of IO requests.%util– this is a pretty confusing value. The man page defines it as, Percentage of CPU time during which I/O requests were issued to the device (bandwidth utilization for the device). Device saturation occurs when this value is close to 100%. A bit difficult to digest. Perhaps it’s better to think of it as percentage of time the device was servicing requests as opposed to being idle. To understand it better here is the formula:utilization = ( (read requests + write requests) * service time in ms / 1000 ms ) * 100%
or
%util = ( r + w ) * svctim /10 = ( 10.18 + 9.78 ) * 8.88 = 17.72448
Traditionally, it’s common to assume that the closer to 100% utilization a device is, the more saturated it is. This might be true when the system device corresponds to a single physical disk. However, with devices representing a LUN of a modern storage box, the story might be completely different.
Rather than looking at device utilization, there is another way to estimate how loaded a device is. Look at the non-existent column I mentioned above – qutim – the average time a request is spending in the queue. If it’s insignificant, compare it to svctim – the IO device is not saturated. When it becomes comparable to svctim and goes above it, then requests are queued longer and a major part of response time is actually time spent waiting in the queue.
The figure in the await column should be as close to that in the svctim column as possible. If await goes much above svctim, watch out! The IO device is probably overloaded.
There is much to say about IO monitoring and interpreting results. Perhaps this is only the first of a series of posts about IO statistics. At Pythian we often come across different environments with specific characteristics and various requirements that our clients have. So stay tune – more to come.
Update 12-Feb-2007: You might also find useful Oracle Disk IO Basics session of Pythian Goodies.
So now, dear reader, we hope we have helped you figure out something you needed to know. It turns out that you can help us here at Pythian with something we need to know! If you are aware of a DBA requirement within your organization, salaried or consulting, please pop in your email address here:
We respect your privacy and will not share your address with any third party. As a thank you for just participating, we will enter you into a monthly draw for a year’s membership in the ACM, which includes access to 600 books from the O’Reilly Bookshelf Online, 500 books from Books24×7 and 3000 online courses from SkillSoft, including tons of courseware on Oracle, SQL Server, and MySQL.
Sep 18, 2006
Category: Group Blog Posts
Tags:
Hi Alex,
I would be interested in hearing more about your experience at Pythian. I heard it is a great place to work at.
Cheers
Hi Tobias,
It is a great place to work indeed. I plan to post a bit on this topic soon. Stay tuned! ;-)
Cheers,
Alex
Keep up the good work, Alex.
If anyone wants to load their iostat data into Oracle, there’s a script to massage it into sqlldr format at http://preferisco.blogspot.com/2006/09/loading-iostat-output-into-oracle.html.
Regards Nigel
Great article.
[...] Jeremy Cole shows how to get a visual take on MySQL and I/O statistics on Linux. (Something Pythian’s Alex Gorbachev looked at in an older article on basic IO Monitoring on Linux). [...]
[...] sar, sysstat. I made serious progress last week, when Dushyanth from my team shared this post on IO Monitoring on Linux, by the folks over at Pythian, on our internal mailing list. Here are my notes on the [...]
I loved this post so much, it prompted me to write one on my blog, using reference material from here and performing some analysis on my post. I have been gazing at iostat outputs since the past few days and I am a little confused about the explanation youhave given above. For instance here is an iostat output from one of my servers -
iostat -dkx 10
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util
dm-0 0.00 0.00 8.79 46.95 71.13 187.81 9.29 0.34 6.18 0.85 4.77
Now my confusion is that there were only 55 IO requests issued to the disk, and clearly the disk is not at all utilized. despite that how come the await time is so much more higher than svctim. Technically none of the IO requests should have had to wait during that time since each IO request was taking only 0.85 ms to process.
While you may put this down to requests issued prior to this 10 second interval etc … I have seen this type of stats in my continuous monitoring which does not seem to make sense. ie a very high await time in comparison to svctim even when the number of requests are low and the disk is not utilized
Bhavin,
Don’t forger that these are *averaged* results. It’s very easy to draw wrong conclusions on the averaged data. One example is that you requests are coming as a bunch at once and clearly wait in the queue.
What is IO response time from applications (database?) telling you?
You might try to reduce period to one second and see if you see the spikes.
Cheers,
Alex
[...] iostat is a popular tool amongst the database crowd, so not surprisingly you’ll find a lot of great discussions documenting the use. Depending on your application you will need to focus on different metrics, but [...]