Determining I/O Throughput for a System

Jul 29, 2010 / By Sheeri Cabral

Tags: , ,

At Kscope this year, I attended a half day in-depth session entitled Data Warehousing Performance Best Practices, given by Maria Colgan of Oracle. In that session, there was a section on how to determine I/O throughput for a system, because in data warehousing I/O per second (iops) is less important than I/O throughput (how much actual data goes through, not just how many reads/writes).

The section contained an Oracle-specific in-database tool, and a standalone tool that can be used on many operating systems, regardless of whether or not a database exists:

If Oracle is installed, run DBMS_RESOURCE_MANAGER.CALIBRATE_IO:

SET SERVEROUTPUT ON
DECLARE
lat INTEGER;
iops INTEGER;
mbps INTEGER;
BEGIN
-- DBMS_RESOURCE_MANAGER.CALIBRATE_IO(<DISKS>, <MAX_LATENCY>,iops,mbps,lat);
DBMS_RESOURCE_MANAGER.CALIBRATE_IO (2, 10, iops, mbps, lat);
dbms_output.put_line('max_mbps = ' || mbps);
end;

For us MySQL folks, or even the Drizzle or NoSQL folks, Oracle offers a free standalone tool called Orion. The example given in the slides was:

./orion –run advanced –testname mytest –num_small 0 –size_large 1024 –type rand –simulate contact –write 0 –duration 60 –matrix column

-num_small is 0 because you don’t usually do small transactions in a dw.
-type rand for random I/O’s because data warehouse queries usually don’t do sequential reads
-write 0 – no writes, because you do not write often to the dw, that is what the ETL is for.
-duration is in seconds
-matrix column shows you how much you can sustain

I would be interested to see how other folks measure I/O throughput, and maybe even do a side-by-side comparison of different tools. Orion is available for:

Linux (x86, x86-64, Itanium, Power)
Solaris (SPARC64)
AIX (PPC64)
zLinux
HPUX (PA RISC, Itanium)
Windows

I am working on a larger write-up of the session itself, which had many concise descriptions of data warehousing issues, but I thought that this merited its own post.

8 Responses to “Determining I/O Throughput for a System”

  • Do you any objective evidence that a MySQL based database warehouse benefits from throughput more than IOPS?

    Because of nested loops, a MySQL DW will often have to perform many more random operations than a similar Oracle DB on the same schema/data. This is because Oracle can switch to a hash join, and sequentially scan large chunks of data. MySQL can not do this.

  • Justin — thank you! I have a longer post I’m in the middle of on more stuff in that presentation where I take some of the points raised and compare Oracle to MySQL — this was harder to ferret out.

    Can MySQL use the sequential scanning more if it’s looking at MyISAM data? After all, MyISAM is sequential by nature; and the underlying [default] BTREE structure of data and indexes lends well to sequential scans….

    Also, how does something like page size come into play? It’s not really changeable with MyISAM (unless you change the OS page size) but with InnoDB you can change the page size….

  • Also, Justin, I’d be interested to think if, in general, you think that I/O throughput is not useful for MySQL, that iops is always better. Is there any case in which I/O throughput is a better metric than iops?

  • Sheeri,

    It depends on the schema and the workload. For example, you might have a table such as the ‘Ontime Flightstats’ data, which is really a star schema but it is flattened into a single table.

    This type of schema might benefit from throughput more than IOPS, particularly if covering indexes can be scanned sequentially. In fact, both MyISAM and InnoDB can take advantage of covering indexes to approximate the performance of a column store when scanning only the columns in the index. If MySQL has to get the whole row from the table, there is likely going to be a random element (since keys are not in natural table order in most cases) and then throughput can become less important than IOPS.

    For tables with joins, if you use FORCE INDEX to force a scan of the fact table, and the dimension tables fit into memory, then throughput may be more important than IOPS. This is helped by a split LRU, since hopefully the scan of the fact table doesn’t evict the dimensions.

    Oracle also supports parallel query for partitioned tables. This makes a tremendous difference in overall performance. If you have lots of throughput, Oracle can sequentially scan more than one partition in parallel, leading to serious performance improvements. My Google code project ‘Shard-Query’ shows that MySQL could be significantly faster if it supported this rather simple optimization.

  • Justin, thanx! I think you’ll have a lot to say about the blog post I just published too — http://www.pythian.com/news/15157

    I look forward to your comments there. I have tried in that post to specifically say what optimizations are Oracle-specific.

  • Greg Rahn says:

    Another IO tool that may be useful is Iometer.

  • Oracle Clinic says:

    IO Issues in Data Warehouse Environment…

    Client IO Performance Review Description of the physical hardware configuration including CPU, Memory, Disk, HBA,Storage Type, etc? Interview technical contacts to get a description of the problems. Measure IO using OS and Oracle tools available? Use R…

Leave a Reply

  • (will not be published)

XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>