<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>The Pythian Blog &#187; Christo Kutrovsky</title>
	<atom:link href="http://www.pythian.com/news/author/kutrovsky/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.pythian.com/news</link>
	<description>News and views from Pythian DBAs</description>
	<lastBuildDate>Mon, 15 Mar 2010 21:40:17 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.8.4</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>RAC+ASM 3 years in production. Stories to share (slides from RMOUG10)</title>
		<link>http://www.pythian.com/news/9055/oracle-rac-asm-3-years-in-production-stories-to-share-slides-from-rmoug10/</link>
		<comments>http://www.pythian.com/news/9055/oracle-rac-asm-3-years-in-production-stories-to-share-slides-from-rmoug10/#comments</comments>
		<pubDate>Tue, 02 Mar 2010 03:57:24 +0000</pubDate>
		<dc:creator>Christo Kutrovsky</dc:creator>
				<category><![CDATA[Group Blog Posts]]></category>
		<category><![CDATA[Oracle]]></category>
		<category><![CDATA[Pythian Appearances]]></category>
		<category><![CDATA[Technical Blog]]></category>
		<category><![CDATA[ASM]]></category>
		<category><![CDATA[RAC]]></category>

		<guid isPermaLink="false">http://www.pythian.com/news/?p=9055</guid>
		<description><![CDATA[Here are the slides from my presentation at RMOUG 2010. 
I am not sure how much sense all this will make without my comments. We may do it in a webinar if there is sufficient interest. Regardless I will probably be doing it again at some point in the future.
RAC+ASM: Stories to Share
View more presentations [...]]]></description>
			<content:encoded><![CDATA[<p>Here are the slides from my presentation at RMOUG 2010. </p>
<p>I am not sure how much sense all this will make without my comments. We may do it in a webinar if there is sufficient interest. Regardless I will probably be doing it again at some point in the future.</p>
<div style="width:425px" id="__ss_3311981"><strong style="display:block;margin:12px 0 4px"><a href="http://www.slideshare.net/kutrovsky/racasm-stories-to-share" title="RAC+ASM: Stories to Share">RAC+ASM: Stories to Share</a></strong><object width="425" height="355"><param name="movie" value="http://static.slidesharecdn.com/swf/ssplayer2.swf?doc=racasm-100301213454-phpapp02&#038;stripped_title=racasm-stories-to-share" /><param name="allowFullScreen" value="true"/><param name="allowScriptAccess" value="always"/><embed src="http://static.slidesharecdn.com/swf/ssplayer2.swf?doc=racasm-100301213454-phpapp02&#038;stripped_title=racasm-stories-to-share" type="application/x-shockwave-flash" allowscriptaccess="always" allowfullscreen="true" width="425" height="355"></embed></object>
<div style="padding:5px 0 12px">View more <a href="http://www.slideshare.net/">presentations</a> from <a href="http://www.slideshare.net/kutrovsky">kutrovsky</a>.</div>
</div>
]]></content:encoded>
			<wfw:commentRss>http://www.pythian.com/news/9055/oracle-rac-asm-3-years-in-production-stories-to-share-slides-from-rmoug10/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Oracle Parallel Query Sorting and Index Creation Performance Problems</title>
		<link>http://www.pythian.com/news/5379/oracle-parallel-query-sorting-performance-problems/</link>
		<comments>http://www.pythian.com/news/5379/oracle-parallel-query-sorting-performance-problems/#comments</comments>
		<pubDate>Mon, 16 Nov 2009 19:34:10 +0000</pubDate>
		<dc:creator>Christo Kutrovsky</dc:creator>
				<category><![CDATA[Oracle]]></category>
		<category><![CDATA[Technical Blog]]></category>
		<category><![CDATA[11.2]]></category>
		<category><![CDATA[order by]]></category>
		<category><![CDATA[PQ sort]]></category>

		<guid isPermaLink="false">http://www.pythian.com/news/?p=5379</guid>
		<description><![CDATA[Ever wondered why recreating certain indexes takes forever, even when you do so in parallel? Ever wondered why certain PQ queries just don&#8217;t run that fast?
Here&#8217;s a serious performance bug that&#8217;s been in Oracle for a while, and finally there are hints of it been fixed, but only partially.
The bug happens when performing sorting operations [...]]]></description>
			<content:encoded><![CDATA[<p>Ever wondered why recreating certain indexes takes forever, even when you do so in parallel? Ever wondered why certain PQ queries just don&#8217;t run that fast?</p>
<p>Here&#8217;s a serious performance bug that&#8217;s been in Oracle for a while, and finally there are hints of it been fixed, but only partially.</p>
<p>The bug happens when performing sorting operations in parallel, and the source data is already well sorted. The &#8220;ranger&#8221; doesn&#8217;t do a good job of properly assigning row ranges to sorter processes, and ~90% of the rows end up being sent to the same parallel process, regardless of the level of parallelism. So even if you have 256 CPUs, only about 10% performance improvement is achieve, instead of a factor of your parallelism when running the query in parallel.</p>
<p>For example, if the non parallel sort/index creation took 45 minutes, running with parallel 32 will take 41 minutes, instead of the possible 1.4 minutes (assuming you have sufficient horsepower).</p>
<p><span id="more-5379"></span></p>
<p>When running a &#8220;Sort&#8221; operation in parallel, there are 2 sets of parallel processes. Producers and consumers. The number of producer/consumer pairs depends on your parallelism settings. So in the case of parallel 32 there are 32 producers and 32 consumers. This is well illustrated in the Oracle documentation HERE (Oracle web based documentation down, will update later). As each &#8216;producer&#8217; is reading data it is sending it to the appropriate consumer for that &#8220;range&#8221;. For example, consumer 1 takes A-B, consumer 2 takes C-F, consumer 3 G-L and etc. The exact split is dynamically calculated by the &#8220;Ranger&#8221; process. Unfortunately it doesn&#8217;t work so good with sorted data.</p>
<p>The same applies to index creation. Index creation is basically a big sort, followed by writing out the result set into a B-Tree structure. Index creation suffers from the exact same ranging issues, at least until 11.2.</p>
<p>Here&#8217;s an example: </p>
<pre class="brush: sql;">
-- Create mini-sample table
create table mytest_s as select rownum r from dual connect by level &lt;=400000;

-- Fetch only 1 row, no need to fetch all
begin
  for c in ( select /*+PARALLEL(t,4)*/ * from mytest_s t order by 1) loop
    exit;
  end loop;
end;
/
select dfo_number &quot;d&quot;, tq_id as &quot;t&quot;, server_type, num_rows,rpad('x',round(num_rows*10/nullif(max(num_rows) over (partition by dfo_number, tq_id, server_type),0)),'x') as &quot;pr&quot;, round(bytes/1024/1024) mb,  process, instance i,round(ratio_to_report (num_rows) over (partition by dfo_number, tq_id, server_type)*100) as &quot;%&quot;, open_time, avg_latency, waits, timeouts,round(bytes/nullif(num_rows,0)) as &quot;b/r&quot;
from v$pq_tqstat order by dfo_number, tq_id, server_type desc, process;

d t SERVER_TYPE   NUM_ROWS pr           MB PROCESS I    %  OPEN_TIME AVG_LATENCY   WAITS   TIMEOUTS  b/r
- - ----------- ---------- ----------- --- ------- - ---- ---------- ----------- ------- ---------- ----
1 0 Ranger             372 xxxxxxxxxx    0 QC      1  100          0           0       0          0   11

1 0 Producer        126144 xxxxxxxxxx    1 P012    1   32          0           0      19          2    6
1 0 Producer         47304 xxxx          0 P013    1   12          0           0       8          1    6
1 0 Producer        110376 xxxxxxxxx     1 P014    1   28          0           0      17          2    6
1 0 Producer        116176 xxxxxxxxx     1 P015    1   29          0           0      16          0    6

1 0 Consumer          7885               0 P008    1    2          0           0       6          1    5
1 0 Consumer          7884               0 P009    1    2          0           0       6          1    6
1 0 Consumer          7884               0 P010    1    2          0           0       7          2    6
1 0 Consumer        376347 xxxxxxxxxx    2 P011    1   94          0           0      16          4    6

1 1 Producer          5536 xxxxxxxxxx    0 P008    1   55          0           0     577        568    3
1 1 Producer          4508 xxxxxxxx      0 P011    1   45          0           0     587        571    4

1 1 Consumer           100 xxxxxxxxxx    0 QC      1  100          0           0       1          0  161
</pre>
<p>As you can see from this test case, the sorter processes (4 consumers) had a very uneven split, with 94% of the rows been sent to only one consumer. I tested this case with parallel 64, and in that case 90% gets sent to 1 consumer, with the other 10% evenly distributed on the remaining ones.</p>
<p>This essentially reduces your execution time by at most 10%.</p>
<p>A very similar thing happens if you create an index. I did, however, catch an anomaly in the test case. With parallel 4, the distribution is 66/34/0/0, while with parallel 8, it&#8217;s 100/0/0/0/0/0/0/0. I.e., terrible as all the work will be performed by 1 process, absolutely the same as Serial, only a little worse as there will be inter-process communication.</p>
<pre class="brush: sql;">
create index on mytest_s (r) parallel 4;
select dfo_number &quot;d&quot;, tq_id as &quot;t&quot;, server_type, num_rows,rpad('x',round(num_rows*10/nullif(max(num_rows) over (partition by dfo_number, tq_id, server_type),0)),'x') as &quot;pr&quot;, round(bytes/1024/1024) mb,  process, instance i,round(ratio_to_report (num_rows) over (partition by dfo_number, tq_id, server_type)*100) as &quot;%&quot;, open_time, avg_latency, waits, timeouts,round(bytes/nullif(num_rows,0)) as &quot;b/r&quot;
from v$pq_tqstat order by dfo_number, tq_id, server_type desc, process;

SERVER_TYPE   NUM_ROWS pr          MB PROCESS  I   %  OPEN_TIME AVG_LATENCY      WAITS   TIMEOUTS        b/r
----------- ---------- ----------- -- -------- - --- ---------- ----------- ---------- ---------- ----------
Ranger              12 xxxxxxxxxx   0 QC       1 100          0           0          1          0       3974
Producer        132601 xxxxxxxxxx   2 P004     1  33          0           0         19          1         18
Producer         95265 xxxxxxx      2 P005     1  24          0           0         14          2         18
Producer         95265 xxxxxxx      2 P006     1  24          0           0         14          1         18
Producer         79497 xxxxxx       1 P007     1  20          0           0         11          0         18

Consumer        262308 xxxxxxxxxx   4 P000     1  66          0           0         76         73         18
Consumer           164              0 P001     1   0          0           0         77         74         19
Consumer           164              0 P002     1   0          0           0         76         73         19
Consumer        137364 xxxxx        2 P003     1  34          0           0         76         73         18

Producer             1 xxxxxxxxxx   0 P000     1  25          0           0          0          0        322
Producer             1 xxxxxxxxxx   0 P001     1  25          0           0          0          0        322
Producer             1 xxxxxxxxxx   0 P002     1  25          0           0          0          0        322
Producer             1 xxxxxxxxxx   0 P003     1  25          0           0          0          0        322
Consumer             4 xxxxxxxxxx   0 QC       1 100          0           0          1          0        322

create index on mytest_s (r) parallel 8;

SERVER_TYPE   NUM_ROWS pr            MB PROCESS  I   %  OPEN_TIME AVG_LATENCY      WAITS   TIMEOUTS        b/r
----------- ---------- ------------- -- -------- - --- ---------- ----------- ---------- ---------- ----------
Ranger               0                0 QC       1              0           0          6          2 

Producer         47304 xxxxxxxx       1 P008     1  12          0           0         14          2         18
Producer         51246 xxxxxxxxx      1 P009     1  13          0           0         15          1         18
Producer         47304 xxxxxxxx       1 P010     1  12          0           0         15          2         18
Producer         45220 xxxxxxxx       1 P011     1  11          0           0         15          2         18
Producer         59130 xxxxxxxxxx     1 P012     1  15          0           0         17          3         18
Producer         35478 xxxxxx         1 P013     1   9          0           0         10          3         18
Producer         55188 xxxxxxxxx      1 P014     1  14          0           0         16          1         18
Producer         59130 xxxxxxxxxx     1 P015     1  15          0           0         17          1         18

Consumer        400000 xxxxxxxxxx     7 P000     1 100          0           0         52         49         18
Consumer             0                0 P001     1   0          0           0         52         49
Consumer             0                0 P002     1   0          0           0         52         49
Consumer             0                0 P003     1   0          0           0         52         49
Consumer             0                0 P004     1   0          0           0         52         49
Consumer             0                0 P005     1   0          0           0         52         49
Consumer             0                0 P006     1   0          0           0         52         49
Consumer             0                0 P007     1   0          0           0         52         49 

Producer             1 xxxxxxxxxx     0 P000     1  13          0           0          0          0        322
Producer             1 xxxxxxxxxx     0 P001     1  13          0           0          0          0        322
Producer             1 xxxxxxxxxx     0 P002     1  13          0           0          0          0        322
Producer             1 xxxxxxxxxx     0 P003     1  13          0           0          0          0        322
Producer             1 xxxxxxxxxx     0 P004     1  13          0           0          0          0        322
Producer             1 xxxxxxxxxx     0 P005     1  13          0           0          0          0        322
Producer             1 xxxxxxxxxx     0 P006     1  13          0           0          0          0        322
Producer             1 xxxxxxxxxx     0 P007     1  13          0           0          0          0        322

Consumer             8 xxxxxxxxxx     0 QC       1 100          0           0          2          1        322
</pre>
<p>To further explore the implications of this bug, I created a more elaborate test case. I created several types of data, and tested ordering against each &#8220;class.&#8221;</p>
<pre class="brush: sql;">
create table mytest as
select rownum pk, trunc(dbms_random.value(0,400000)) rnd,
floor(rownum/1000) type_1000, floor(rownum/5) type_5,
mod(rownum,1000) type_1000mod,mod(rownum,5) type_5mod,mod(rownum,10009) type_10000mod,
sysdate-rownum/100 as dt,
rpad('x',10,'x') pad from dual connect by level &lt;=400000;
</pre>
<p>To spare you some of the testing, here are my results.  Query ran:<br />
<code>select /*+PARALLEL(t,[DEGREE])*/* from mytest order by [ORDER BY];</code></p>
<table>
<tr>
<th><strong>Degree</strong></th>
<th><strong>Order by</strong></th>
<th><strong>Distribution(%)</strong></th>
</tr>
<tr>
<td>4</td>
<td>pk</td>
<td>94/2/2/2</td>
</tr>
<tr>
<td>64</td>
<td>pk</td>
<td>93/0&#8230; </td>
</tr>
<tr>
<td>4</td>
<td>rnd</td>
<td>28/21/24/27</td>
</tr>
<tr>
<td>64</td>
<td>rnd</td>
<td>2/1/2/1&#8230;</td>
</tr>
<tr>
<td>4</td>
<td>type_1000</td>
<td>94/2/2/2</td>
</tr>
<tr>
<td>4</td>
<td>type_5</td>
<td>94/2/2/2</td>
</tr>
<tr>
<td>64</td>
<td>type_5</td>
<td>93/0/0&#8230;.</td>
</tr>
<tr>
<td>4</td>
<td>type_5mod</td>
<td>40/20/20/20</td>
</tr>
<tr>
<td>8</td>
<td>type_5mod</td>
<td>20/20/0/20/20/0/20/0</td>
</tr>
<tr>
<td>8</td>
<td>type_1000mod</td>
<td>17/13/8/6/4/4/4/43</td>
</tr>
<tr>
<td>64</td>
<td>type_1000mod</td>
<td>4/3/1/4/2/1&#8230;.</td>
</tr>
<tr>
<td>4</td>
<td>dt</td>
<td>96/2/2/0</td>
</tr>
<tr>
<td>64</td>
<td>dt</td>
<td>93/0/0/0&#8230;</tr>
<tr>
<td>4</td>
<td>type_5,pk</td>
<td>2/2/2/94</td>
</tr>
<tr>
<td>4</td>
<td>type_5,rnd</td>
<td>2/2/2/94</td>
</tr>
<tr>
<td>4</td>
<td>type_5_mod,pk</td>
<td>20/20/20/39</td>
</tr>
<tr>
<td>64</td>
<td>type_5_mod,pk	</td>
<td>19/19/19/19/19/0/0&#8230;</td>
</tr>
<tr>
<td>4</td>
<td>type_5_mod,rnd</td>
<td>26/24/26/24</td>
</tr>
</table>
<p>A few quick conclusions:
<ul>
<li>The first column of order matters for distribution.</li>
<li>If the <em>Order by</em> column has repeated values, PQ sort will be limited by number of distinct values, but only if they are not grouped together.</li>
<li>Index creation on time series (log table, stock table) is slow, &#8220;type&#8221; indexes are slow.</li>
<li>If you reorder the keys of an index, you may affect the time it takes to create it.</li>
<li>Follow up on the preceding, especially true if you try to put low cardinality columns first to improve compression.</li>
</ul>
<p>One could dispute in which cases data in the table is ordered, and it&#8217;s amazing how many cases this is:
<ul>
<li>Time series data &mdash; ever-growing data. The PK is ordered; the &#8220;insert date&#8221; is somewhat ordered.</li>
<li>Data warehouses &mdash; bulk load files are often ordered via some conditions.</li>
<li>Sometimes it is good to reorder a table, to improve data locality and compression in data warehouses. This can, however, have negative effects on index build time.</li>
<li>Sometimes one of the intermediate steps will return an ordered set for the final processing.</li>
</ul>
<p>One example of the last type, is analytics. But that&#8217;s for a separate blog post.</p>
<p>And finally, to end on an optimistic note, it appears that <strong>11.2 has the index creation issue resolved</strong>, but the <em>order by</em> in queries is still bad.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.pythian.com/news/5379/oracle-parallel-query-sorting-performance-problems/feed/</wfw:commentRss>
		<slash:comments>9</slash:comments>
		</item>
		<item>
		<title>Adding Columns with Default Values and Not Null in Oracle 11g</title>
		<link>http://www.pythian.com/news/1660/adding-columns-with-default-values-and-not-null-in-oracle-11g/</link>
		<comments>http://www.pythian.com/news/1660/adding-columns-with-default-values-and-not-null-in-oracle-11g/#comments</comments>
		<pubDate>Fri, 20 Mar 2009 20:09:44 +0000</pubDate>
		<dc:creator>Christo Kutrovsky</dc:creator>
				<category><![CDATA[Oracle]]></category>
		<category><![CDATA[11g]]></category>
		<category><![CDATA[adding columns]]></category>
		<category><![CDATA[not null]]></category>

		<guid isPermaLink="false">http://www.pythian.com/news/?p=1660</guid>
		<description><![CDATA[Oracle 11g has a new performance enhancement when adding columns. In the pre-11g releases, adding a new not null column with a default value would have caused a massive update on the entire table, locking it for the operation and generating tons of undo and redo. I&#8217;ve seen this happening in production.
Oracle 11g has improved [...]]]></description>
			<content:encoded><![CDATA[<p>Oracle 11g has a new performance enhancement when adding columns. In the pre-11g releases, adding a new not null column with a default value would have caused a massive update on the entire table, locking it for the operation and generating tons of undo and redo. I&#8217;ve seen this happening in production.</p>
<p>Oracle 11g has improved this behaviour by storing the default value in the metadata and making the column addition instantaneous. An example of this feature at work can be seen in <a href="http://tonguc.wordpress.com/2008/09/28/11g-enhancement-for-alter-table-add-column-functionality/">11g Enhancement for ALTER TABLE .. ADD COLUMN Functionality</a>  and <a href="https://metalink.oracle.com/CSP/main/article?cmd=show&#038;type=NOT&#038;id=602327.1">some bugs regarding sysdate</a>, as pointed out in the comments.</p>
<p>Although this is a welcomed enhancement, there are some unexpected aspects beyond the basic operations.</p>
<p>First, we know default values for new columns are stored in the metadata, but what happens when you <em>change</em> the default?</p>
<p><span id="more-1660"></span></p>
<pre>
create table mytest (a number); -- instant
insert into mytest (a) select rownum from dba_objects; -- 10 seconds
alter table mytest add b varchar2(2000) default rpad('x',1000,'x') not null; --instant</pre>
<p>And now the test:</p>
<pre>alter table mytest modify b  default rpad('Z',1000,'Z') ;-- instant !</pre>
<p>Instant! If we query the table what are we going to see? The correct, &#8216;xxxxx&#8217; value or the new &#8216;ZZZZZ&#8217; value? How does it work?</p>
<p>Obviously Oracle has to behave properly. So querying the table shows correctly the <em>already-initialized</em> &#8220;xxxx&#8221; value. Through experimentation, I have determined that this enhancement works as follows:</p>
<p>There are two default values for each column:</p>
<ul>
<li>one default expression for &#8220;not null with default value&#8221; columns that do not have a value at the block level</li>
<li>another default value for all new inserted records</li>
</ul>
<p>I&#8217;ll spare you all the test cases, but based on the above mechanics, we can observe the following not-so-obvious behaviours:</p>
<ul>
<li>Rebuilding the table with &#8220;alter table move&#8221; can (and probably will) make it bigger, as all the defaults are initialized at the block level.</li>
<li>Updating other columns will not initialize the default value, thus does not grow the table.</li>
<li>Changing the column constraint to &#8220;null&#8221; (from not null) will initialize the default values, causing a massive update and locking of the table</li>
<li>Some &#8220;magic&#8221; can be applied to large fact tables by adding, instead of creating them with default values for very popular values, but that can be a nightmare to maintain.</li>
</ul>
<p>Virtual columns may present an innovative space-saving approach by using nvl for default values. Of course, it works differently, but for very large tables, it may be of significant advantage. For example:</p>
<pre>
create table mytest ( colA_data varchar2(30), colA as (nvl(colA_data, 'DEFAULT VALUE')));
</pre>
<p>The default value will not consume any space (except 1 byte for &#8216;null&#8217; value), ever.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.pythian.com/news/1660/adding-columns-with-default-values-and-not-null-in-oracle-11g/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>RMOUG: Day 2</title>
		<link>http://www.pythian.com/news/1489/rmoug-day-2-2/</link>
		<comments>http://www.pythian.com/news/1489/rmoug-day-2-2/#comments</comments>
		<pubDate>Fri, 13 Feb 2009 20:22:56 +0000</pubDate>
		<dc:creator>Christo Kutrovsky</dc:creator>
				<category><![CDATA[Non-Tech Articles]]></category>
		<category><![CDATA[Oracle]]></category>
		<category><![CDATA[conference]]></category>
		<category><![CDATA[RMOUG]]></category>
		<category><![CDATA[Rocky Mountain Oracle Users Group]]></category>
		<category><![CDATA[user group]]></category>

		<guid isPermaLink="false">http://www.pythian.com/blogs/1489/rmoug-day-2-2</guid>
		<description><![CDATA[Day 2 finished yesterday. It was quite a busy day, with some excellent sessions.
Battle of the Nodes: RAC Performance Myths &#8212; Riyaj Shamsudeen
A great presentation on popular RAC myths, with some great examples. Excellent visuals that made complex processes look simple. I really liked this one.
Getting the Most Out of AWR &#8212; Tim Gorman
A first-rate [...]]]></description>
			<content:encoded><![CDATA[<p>Day 2 finished yesterday. It was quite a busy day, with some excellent sessions.</p>
<p><strong>Battle of the Nodes: RAC Performance Myths &#8212; <a href="http://orainternals.wordpress.com/">Riyaj Shamsudeen</strong></a><br />
A great presentation on popular RAC myths, with some great examples. Excellent visuals that made complex processes look simple. I really liked this one.</p>
<p><strong>Getting the Most Out of AWR &#8212; <a href="http://www.linkedin.com/in/timgorman">Tim Gorman</strong></a><br />
A first-rate session attended by a lot of the conference. It went into detail on what scripts are available to extract AWR information without needing Grid Control or Database Control. For command-line lovers, it&#8217;s great.</p>
<p><strong>The SAN is guilty&#8230; until proven otherwise &#8212; <a href="http://www.linkedin.com/in/gajakrishnavaidyanatha">Gaja Krishna Vaidyanatha</a></strong><br />
A very important session for all DBAs, showing the end-to-end components involved in database I/O. There are so many more components that can cause problems between the database and the physical spindles. Concepts, case studies, plenty of information.</p>
<p><strong>Understanding Oracle Execution Plans: How  SQL is Really Executed &#8211; <a href="http://tanelpoder.com">Tanel Poder</strong></a><br />
One of those eye-opening sessions, starting with how to read SQL Execution Plans, and moving to showing stack traces and mapping function calls, to Execution Plan steps. A must-see for everyone tuning SQL.</p>
<p>And that is it. As exhausting conferences are, I always wish for them to have been longer. </p>
]]></content:encoded>
			<wfw:commentRss>http://www.pythian.com/news/1489/rmoug-day-2-2/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>RMOUG: Day 1</title>
		<link>http://www.pythian.com/news/1482/rmoug-day-1/</link>
		<comments>http://www.pythian.com/news/1482/rmoug-day-1/#comments</comments>
		<pubDate>Thu, 12 Feb 2009 18:32:57 +0000</pubDate>
		<dc:creator>Christo Kutrovsky</dc:creator>
				<category><![CDATA[Non-Tech Articles]]></category>
		<category><![CDATA[Oracle]]></category>
		<category><![CDATA[conference]]></category>
		<category><![CDATA[photos]]></category>
		<category><![CDATA[pictures]]></category>
		<category><![CDATA[RMOUG]]></category>
		<category><![CDATA[Rocky Mountain Oracle Users Group]]></category>
		<category><![CDATA[user group]]></category>

		<guid isPermaLink="false">http://www.pythian.com/blogs/1482/rmoug-day-1</guid>
		<description><![CDATA[Day One at RMOUG in Denver is now over. 
There were quite a few interesting presentations. Unfortunately, the very first I went to was canceled due to car trouble.  I also found that several sessions of similar interest to me overlapped, so I had to choose my spots.
Advanced Oracle Troubleshooting
This presentation was particularly good. [...]]]></description>
			<content:encoded><![CDATA[<p>Day One at <a href="http://www.rmoug.org/training.htm">RMOUG</a> in Denver is now over. </p>
<p>There were quite a few interesting presentations. Unfortunately, the very first I went to was canceled due to car trouble.  I also found that several sessions of similar interest to me overlapped, so I had to choose my spots.</p>
<p><strong>Advanced Oracle Troubleshooting</strong><br />
This presentation was particularly good. Tanel goes into detail on how to quickly asses a situation without going through a number of &#8220;health checks&#8221; and still be nowhere near solving the problem. His approach is to look directly at what a &#8220;hanging&#8221; session is waiting on, and to systematically determine the cause of the problem, with no time wasted.</p>
<p><strong>Putting your database on a Diet: Oracle&#8217;s Data compression</strong><br />
A short overview of table compression. I found that that even though the presenter obviously had some experience with compression, there were hardly any examples nor anything mentioned about how to determine proper re-ordering to improve compression.</p>
<p><strong>All About Oracle&#8217;s In-Memory Undo</strong><br />
An unusual topic&#8212;something that works so well that no one really talks about it. The presentation, however, was very short, and provided little new information. There was only one demonstrated test case. Although it went into detail about the difference between in-memory and standard undo, the other-than-obvious effects were omitted.</p>
<p>During lunch I took a picture that shows the entire RMOUG crowd:<br />
<a href='http://www.pythian.com/blogs/wp-content/uploads/rmoug-lunch.JPG' title='RMOUG Day 1  Lunch'><img src='http://www.pythian.com/blogs/wp-content/uploads/rmoug-lunch.thumbnail.JPG' alt='RMOUG Day 1  Lunch' /></a></p>
<p>Tomorrow is Day 2, and I will be posting about it here.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.pythian.com/news/1482/rmoug-day-1/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Christo Kutrovsky Presenting at RMOUG</title>
		<link>http://www.pythian.com/news/1476/christo-kutrovsky-presenting-at-rmoug/</link>
		<comments>http://www.pythian.com/news/1476/christo-kutrovsky-presenting-at-rmoug/#comments</comments>
		<pubDate>Mon, 09 Feb 2009 20:44:06 +0000</pubDate>
		<dc:creator>Christo Kutrovsky</dc:creator>
				<category><![CDATA[Non-Tech Articles]]></category>
		<category><![CDATA[Oracle]]></category>
		<category><![CDATA[Christo Kutrovsky]]></category>
		<category><![CDATA[conference]]></category>
		<category><![CDATA[presentation]]></category>
		<category><![CDATA[RMOUG]]></category>
		<category><![CDATA[Rocky Mountain Oracle Users Group]]></category>
		<category><![CDATA[user group]]></category>

		<guid isPermaLink="false">http://www.pythian.com/blogs/1476/christo-kutrovsky-presenting-at-rmoug</guid>
		<description><![CDATA[I am back on the road, going to RMOUG Training Days to present The Answer to Free Memory, Swap, Oracle, and Everything.
I am quite excited, as the RMOUG schedule (PDF) looks quite promising, especially these presentations:

Further RMAN Optimizations in 11g &#8212; Stephan Haisley
Advanced Oracle Troubleshooting: No Magic is Needed &#8212; Tanel Poder
Understanding Oracle Execution Plans: [...]]]></description>
			<content:encoded><![CDATA[<p>I am back on the road, going to <a href="http://www.rmoug.org/training.htm">RMOUG Training Days</a> to present <a href="http://www.pythian.com/blogs/741/pythian-goodies-free-memory-swap-oracle-and-everything">The Answer to Free Memory, Swap, Oracle, and Everything</a>.</p>
<p>I am quite excited, as <a href="http://www.rmoug.org/images/SAG.pdf">the RMOUG schedule</a> (PDF) looks quite promising, especially these presentations:</p>
<ul>
<li>Further RMAN Optimizations in 11g &#8212; <a href="http://www.linkedin.com/pub/5/426/470">Stephan Haisley</a></li>
<li>Advanced Oracle Troubleshooting: No Magic is Needed &#8212; <a href="http://blog.tanelpoder.com/">Tanel Poder</a></li>
<li>Understanding Oracle Execution Plans: How SQL is Really Executed &#8212; Tanel Poder</li>
<li>The SAN is Guilty until proven otherwise &#8212; <a href="http://www.linkedin.com/in/gajakrishnavaidyanatha">Gaja Krishna Vaidyanatha</a></li>
</ul>
<p>Some of these overlap, so I guess I will have to make a difficult choice.</p>
<p>I hope to see you all in Denver.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.pythian.com/news/1476/christo-kutrovsky-presenting-at-rmoug/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Interview: Kevin Closson on the Oracle Exadata Storage Server</title>
		<link>http://www.pythian.com/news/1267/interview-kevin-closson-on-the-oracle-exadata-storage-server/</link>
		<comments>http://www.pythian.com/news/1267/interview-kevin-closson-on-the-oracle-exadata-storage-server/#comments</comments>
		<pubDate>Tue, 30 Sep 2008 19:48:09 +0000</pubDate>
		<dc:creator>Christo Kutrovsky</dc:creator>
				<category><![CDATA[Oracle]]></category>
		<category><![CDATA[HP Oracle Database Machine]]></category>
		<category><![CDATA[interview]]></category>
		<category><![CDATA[Kevin Closson]]></category>
		<category><![CDATA[ODS]]></category>
		<category><![CDATA[OESS]]></category>
		<category><![CDATA[Oracle Exadata Storage Server]]></category>

		<guid isPermaLink="false">http://www.pythian.com/blogs/1267/interview-kevin-closson-on-the-oracle-exadata-storage-server</guid>
		<description><![CDATA[Last Friday (September 26), Paul Vall&#233;e and I were lucky enough to interview Kevin Closson about the Oracle Exadata Storage Server.  A tidied-up stream of the audio is here: closson-interview.m3u.
The audio quality is a little spotty here and there, so you might like to follow the transcription below.
Paul gets the interview started.
Paul Vall&#233;e (PV): [...]]]></description>
			<content:encoded><![CDATA[<p>Last Friday (September 26), Paul Vall&eacute;e and I were lucky enough to interview Kevin Closson about the Oracle Exadata Storage Server.  A tidied-up stream of the audio is here: <a href='http://www.pythian.com/blogs/wp-content/uploads/closson-interview.m3u' title='closson-interview.m3u'>closson-interview.m3u</a>.</p>
<p>The audio quality is a little spotty here and there, so you might like to follow the transcription below.</p>
<p>Paul gets the interview started.</p>
<p><strong>Paul Vall&eacute;e (PV):</strong> Christo Kutrovsky and myself, Paul Vall&eacute;e.  We&#8217;re on the line with <a href="http://kevinclosson.wordpress.com/">Kevin Closson</a> of Oracle  (and prior to that with Hewlett-Packard, and prior to that with Polyserve, and prior to that with Sequent).  A giant of our industry, and I&#8217;m honoured to be speaking to him.  Kevin, hello.</p>
<p><strong>Kevin Closson (KC):</strong> Well, they always say that flattery gets you nowhere, but apparently it&#8217;ll get you on the phone.</p>
<p><strong>PV:</strong> [laughs] Very nice!</p>
<p><strong>KC:</strong> No seriously, it&#8217;s more than a pleasure to be here.  I like what you guys do, so this is good.</p>
<p><strong>PV:</strong> Thank you, Kevin. So, we are here to talk about the work that Larry Ellison announced yesterday, specifically the work around the Oracle Database Machine and the Exadata Storage Server.  Kevin, can you  just quickly introduce yourself and how you came to be involved in the project?</p>
<p><strong>KC:</strong> Right.  So, I&#8217;m a performance architect with Oracle, and the project that I&#8217;m stationed on, if you will, is the development team for  Oracle Exadata Storage Server.  And the way I came to Oracle is, quite a few of the folks who are involved with the very genesis of  Exadata are people that I&#8217;ve known and worked with closely dating back to the early &#8217;90s. And after a fruitful endeavour as the chief software architect for Oracle solutions at Polyserve, it became an opportunity <del datetime="2009-03-13T22:06:36+00:00">to latch onto Oracle, because we sold our company to them</del>. So there we are.</p>
<p><strong>PV:</strong> How exciting!  Congratulations!  So I noticed that there&#8217;s still a little, I guess a diversion in terms of the branding.  Larry definitely introduced it as the Exadata <em>Programmable</em> Storage Server, and I double-checked the video. But in your blog, you&#8217;re calling it, for sure, just the Exadata Storage Server.  Just how recently was the marketing/messaging developed for this?</p>
<p><strong>KC:</strong> You know, I&#8217;m not a part of the Go-To-Market (GTM) efforts, but, you know, honestly, the way these things are brought to market&nbsp;.&nbsp;.&nbsp;.&nbsp; They&#8217;re developed under a project name, and the project name remains the same for years.  It was over the last few months that Marketing began cooking the name and what-have-you.  Now, if you&#8217;re referring to something that Larry said in his keynotes,  I have to admit I didn&#8217;t commit to photographic memory all the slides.  And certainly, if he used the term &#8220;programmable&#8221;, I&#8217;m not going to correct Larry Ellison.</p>
<p><strong>PV:</strong> [laughs] That would be risky.</p>
<p><span id="more-1267"></span></p>
<p><strong>KC:</strong> Having said that, I&#8217;m here to tell you that if somehow the word &#8220;programmable&#8221; sticks and becomes pervasive, it&#8217;ll be misleading.  Because the connotation of &#8220;programmable&#8221; is akin to what&#8217;s possible on products like Netezza where there&#8217;s real programmable data rates that on-site people can fiddle with.  And that&#8217;s not the nature of the Exadata Storage Server. The Exadata Storage Server is programmable by <em>us</em>, the folks on the development team, and it comes to you in the form of software, and occasional lower-level firmware fixes from HP.  I hope that answers the question&#8212;safely for me&#8212;and for you as well.</p>
<p><strong>PV:</strong> Sounds good.  So, I must say&#8212;just to go on the record&#8212;that this is not at all what I thought you might be cooking up, but by the same token &#8212; and obviously the evidence for that is all over the Pythian blog, where we had a speculative blog entry about what might be in-store.  But that being said, I am tremendously impressed and really excited, and already getting to work on identifying which Pythian customers have a use-case for the technology, and trying to get a good early-adopter case to co-sell with Oracle. So I&#8217;m that jazzed about it, and definitely, congratulations are in order.</p>
<p><strong>KC:</strong> Well, I&#8217;m glad to hear that, and I appreciate congratulations.  You know, when it comes down to it, this is a new product that solves problems that cannot be solved with any other storage technology. Of course, everybody&#8217;s already crafted their position papers and will argue the fact that perhaps the speeds and feeds are different than them.  But what I just said, I stand by, full-stop.  You cannot solve this problem with any other technology.  And what I mean by that is, in order to have Oracle as your commercial store for all of EP or ERP (which, you know &#8212; nobody&#8217;s going to argue [about] our footprint in that space), in order to do EWBI on that with someone else&#8217;s offering &#8212; you now have two vendors, you have two different types of technology. Exadata is a unifying storage platform for Oracle. And now Oracle is all.</p>
<p><strong>PV:</strong> I would like to introduce Christo Kutrovsky, who is one of Pythian&#8217;s leading experts especially on how Oracle talks to storage &#8212; certainly not our only such expert given than Alex G. [Gorbachev] presented on &#8220;Under the Hood of the Oracle Clusterware&#8221;, just at Open World earlier this week.  But that being said, this is a definite subject matter of interest for Christo,  and he has prepared a couple questions for you Kevin.</p>
<p><strong>KC:</strong> Oh good!</p>
<p><strong>Christo Kutrovsky (CK):</strong> Hi Kevin, how are you?</p>
<p><strong>KC:</strong> Hi Christo.  Just fine.  You and I have met in passing, so this is not a first meeting by any means.</p>
<p><strong>CK:</strong> Oh yeah, absolutely.  I admire your presentations &#8212; they are always to-the-point.  I hope you keep up with that.</p>
<p><strong>KC:</strong> Well thank you.</p>
<p><strong>CK:</strong> So back to the questions and to the nitty-gritty stuff. I want to start with a couple of questions [about] the Infiniband implementation.  Each Exadata instance is linked via 40Gb Infiniband link to the switch. Correct?</p>
<p><strong>KC:</strong> Partially.  So the numbers are off.  Each Exadata cell&#8212;as well as all of the RDBMS hosts in the HP Oracle Database Machine&#8212;all of them are connected with <em>two</em> 20Gb paths to the switch. </p>
<p><strong>CK:</strong> Okay.  So does that mean that each database server can only accept 2*20 Gb of data?</p>
<p><strong>KC:</strong> No.  We have two paths of 20Gb of bandwidth. They are joined together in a bonded relationship, therefore only one path is active at one time.  We don&#8217;t need more than one path, because arithmetically, over 2 giga<em>bytes</em>-per-second is not only more than a single two-socket,  eight-core Xeon server can ingest on the RDBMS host, it&#8217;s most certainly more than a single Exadata  storage cell can produce.  So from a produce-and-consume perspective, we&#8217;re not starving anybody. So what&#8217;s failover if, for instance, the HCA happens to fail.  It&#8217;s a redundancy purpose for the dual paths, but as the numbers work out, it&#8217;s over 2 giga<em>bytes</em>-per-second.  With 20Gb paths, we&#8217;re not feeding something like a 128-core Superdome or something like that. In that particular case, the plumbing would be slightly different.</p>
<p>But look at what we&#8217;ve built.  The HP Oracle Database Machine consists of VL360s.  Each of those has the traditional Xeon processors in them, and they don&#8217;t have unlimited bandwidth, so they can ingest as much as even one single 20Gb path can provide.</p>
<p><strong>CK:</strong> Okay.  And, just to clarify on the [subject of] the paths internals &#8212; it&#8217;s like a switch, basically.  Meaning that any two pairs can talk at full bandwidth without affecting other talking pairs.  Is this correct?</p>
<p><strong>KC:</strong> Yes. There&#8217;s no bottlenecks in the point-to-point meter. So if a cell is delivering I/Os to three or four of the RDBMS hosts, that&#8217;s happening concurrently in-flight with, let&#8217;s say, I/O requests from RDBMS hosts 6 and 7 talking [to yet] other cells. So there&#8217;s no queuing.</p>
<p><strong>CK:</strong> Basically what I was trying to clarify is that &#8212; imagine you have one database server running one query, and you have two Exadata storage systems, that will pretty much saturate your bandwidth, assuming no filtering is happening on the Exadata.  That means, if you have <em>two</em> database servers, each database server can talk at 20Gigiabits to two separate Exadata servers.</p>
<p><strong>KC:</strong> Yes.  Every point is 20Gb.</p>
<p><strong>CK:</strong> Alright, I just wanted to clarify this.  Those were my questions [about] Infiniband.</p>
<p><strong>KC:</strong> Oh good, and along those lines, I should hope that that puts to rest any concerns about&#8212;as I always say&#8212;&#8221;plumbing.&#8221;  We didn&#8217;t build any bottlenecks into this system.</p>
<p><strong>PV:</strong> Just a comment for out listeners who aren&#8217;t familiar with Infiniband as a networking protocol&#8212;not only as a networking protocol, but also as a disk-access protocol&#8212;Infiniband has not only  very high  bandwidth characteristics, but also ultra-low latency characteristics, and that&#8217;s why it&#8217;s such a good choice for cluster solutions like this.</p>
<p><strong>KC:</strong> And Infiniband support multiple communications protocols, and to that end, we&#8217;ve developed and brought to market the lightest&#8212;at least in our assessment&#8212;the lightest and most adaptable of all of them, which is Reliable Datagram Sockets [RDS]. Sure, you can do IP over Infiniband, and that starts to chew into some of the value propositions involving Infiniband, but  we&#8217;ve done none of the sort.  We are fully Remote Direct Memory Access (RDMA) from point to point over RDS.</p>
<p><strong>CK:</strong> Alright.  So continuing onto more details of how the system works, a question on OLTP. For OLTP systems, and considering Database Machines and Exadata, does OLTP benefit in any way other than a best-practices-built system.</p>
<p><strong>KC:</strong> That&#8217;s such a fair question, but let me answer it this way.  The design center for Exadata is providing uninhibited utilization of all of the bandwidth that all of the disks are capable of delivering.  Workloads that require that generally look a lot more like Business Intelligence Data Warehousing.  It&#8217;s not that common to see, strictly speaking, OLTP-style workloads suffering bottlenecks like traditional storage arrays&#8212;fibre-channel SANs, iSCSI, or even high-performance NAS.</p>
<p>That doesn&#8217;t mean, however, that there are <em>no</em> benefits for OLTP using Exadata, because indeed there are. Probably the most substantial benefit that OLTP-style workloads will derive from being serviced by Exadata is the fact that I/Os from the Oracle kernel during OLTP will no longer have to interface through standard C libraries and making system calls to instantiate and report completions of I/Os.  Doing that has always been entirely too costly, in processor-cycles terms.  All of the requests for I/O from all Oracle processes to Exadata are done from userland using Remote Direct Memory Access (RDMA) directly into the processes of the Oracle Storage Server. So there&#8217;s no queuing up and de-queuing as far as the sending and the delivery of I/O requests.</p>
<p>Now, what does all that mean to us? It means that if you&#8217;re doing thousands-upon-thousands of I/Os-per-second of OLTP size&#8212;let&#8217;s say 8 kilobytes&#8212;you&#8217;ll see the reduction in processor cycles lost just doing I/O, substantial.  You can do well over 8,000 [or] 10,000 random I/Os per second with much less than five percent of all processor cycles spent in kernel-mode.  So we start freeing-up substan&#8212; Oh, and doing so with traditional SAN host bus adapters can often cost as much as twenty percent of all processor cycles spent just doing I/O &#8212; not doing anything with the <em>results</em> of the I/O, which is why somebody bought the computer in the first place. [13:44]</p>
<p>We&#8217;re talking about relief on processor cycles, lost I/O&#8212;which I think should be substantial for a lot of people&#8212;and, as well as, we have a balanced system.  You don&#8217;t have to worry about collusion of applications, for instance, on the same fibre channel arbitrated group, where your disks for OLTP reside. And those sorts of balance aspects.  They do pay off.</p>
<p><strong>CK:</strong> Excellent, very cool.  Okay&#8212;moving on now, this brings me to my next question.  What&#8217;s the 8 gigabyte memory on the Exadata cells?  Is this useful [for] caching, or is it just working memory to manage the software running on the Exadata?</p>
<p><strong>KC:</strong> Well, that&#8217;s a great question. And people keep saying &#8220;cache cache cache&#8221;, although unless you have as much cache as you have dataset, cache actually gets in the way. I&#8217;m talking specifically&#8212;in that case&#8212;about, the types of work, the types of I/O profiles you see with DW/BI. To that end, we do configure 8 gigabytes of RAM per Exadata cell, and your question was, what is that for.  Well, each Exadata cell has an operating system kernel, and that is Oracle Enterprise Linux.  And that shouldn&#8217;t surprise anybody, because every storage device that&#8217;s out there&#8212;all SAN arrays, all NAS filers&#8212;they all have operating systems in them as well.  And that takes up a portion&#8212;let&#8217;s say for instance 1GB&#8212;and that leaves us something on the order of 7 gigabytes of working memory. The primary value proposition of Exadata is to service, to scans, the type of I/O profiles you see with DW/BIs, and we scan this using 1 Megabyte frees.</p>
<p>So, let&#8217;s say for instance, we&#8217;re trying to push through a gigabyte per second. Depending on how long the I/Os take, we need to be able to buffer on the order of 5,000 of those I/Os per second. So 5,000 1 Meg buffers is 5 gigabytes. That leaves us about 2 gigabytes of cushioning. I think you can see where I&#8217;m going with this. In order to handle thousands of I/Os per second at that size requires just buffering, and that different from cache, right? Buffering is the memory that you pin down so that the disks, through the drivers, can DMA to memory. And then from there, of course, Exadata will RDNA that result over Infiniband with RDS directly into the address space of the database server process. But you have to have some holding space. Buffers are reused immediately, so in that case, it&#8217;s not cache. Does that answer your question?</p>
<p><strong>CK:</strong> Absolutely. I was kind of suspecting this is the case, but I just wanted to hear it from you. And I assume this is also used to facilitate filtering and joining and extrapolation of query data?</p>
<p><strong>KC:</strong> We all love the fact that Exadata is brute-force, but it&#8217;s also brainy. So, if we&#8217;re pushing through a gigabyte of I/O per second, from disk through the Exadata server out onto Infiniband, in the meantime, we&#8217;re also applying our intelligence. And our intelligence is, we perform predicate filtration.  So if you&#8217;re querying for rows of HR records where &#8220;salary is greater than a million dollars&#8221;&#8212;and I&#8217;m sure there&#8217;s a lot of folks [garbled] like that.  In order to filter through that data, that buffer has to be held down by the Exadata server for the amount of time it takes to rip through there looking for the rows that match. And then after that, whatever columns are sighted in that query, we have to walk through each of the rows and pick out the columns and send them back. [17:43]</p>
<p>So the buffer stays pinned-down  while we&#8217;re doing intelligent processing of the contents of the buffer. And then we wipe the buffer out and hammer it with another I/O.  Did that make any sense? [17:53]</p>
<p><strong>CK:</strong> Absolutely. Now that we&#8217;ve touched on the subject of filtering &#8212; do only parallel queries benefit from this filtering push down to the Exadata cells?</p>
<p><strong>KC:</strong> Right. So I think what you&#8217;re asking is whether or not, somehow, OLTP operations will benefit from what we currently offload to storage. The answer is no, but if that wasn&#8217;t your question, go ahead and state it.</p>
<p><strong>CK:</strong> Yeah. No, the question would be, can you run Standard Edition on the database servers?</p>
<p><strong>KC:</strong> No. That&#8217;s a topic regarding licensing, but I happen to know that this is Enterprise Edition only.</p>
<p><strong>CK:</strong> Okay. Maybe you know the answer to this: does the cost that Larry included on one of his slides include the RAC <em>and</em> partitioning options?</p>
<p><strong>KC:</strong> I do recall that he put up a slide that compared on a per-terabyte basis the HP Oracle Database Machine to Teradata and Netezza.</p>
<p><strong>KC:</strong> And built into that cost-per-terabyte was the cost for software. Because this is a pre-packaged and ready-to-go deal, and because you have to have Real Application Clusters and you have to have partitioning, the answer to that is, yeah it&#8217;s built-in to that cost. What that number is, I wouldn&#8217;t quote.  I don&#8217;t deal with the money.</p>
<p><strong>CK:</strong> Yeah, absolutely. And talking further about the cells. In one of the datasheets it&#8217;s mentioned that backups benefit from this, and that sheet mentions specifically incremental backups being processed and filtered by the cells.  Which is pretty cool. Now the question is, is the cell  CPU used for compression for the backups &#8212; when you do full backups, for example?</p>
<p><strong>KC:</strong> That&#8217;s an excellent question.  And there&#8217;s a lot of other functionality that&#8217;s offloaded to the cells&#8212;and I&#8217;m sure we&#8217;ll talk about that&#8212;specific to backup. The answer to your [question about the] compression aspect of backup is, no &#8212; cells do currently do the compression.  Could they? Sure. We&#8217;ll talk about that in the future. So you still use host CPU to do the compression. Will that mean that, when it comes times to access compressed data, Exadata is helpless?  The answer is no, because Exadata is able to do filtering and projection on data that is compressed. It understand enough about compressed data to be able to do that.</p>
<p>But back to the backup issue. The value proposition for backup is, if you&#8217;re doing something like, let&#8217;s say, an incremental backup, and you have to go through terabytes of data to find several blocks that have been changed before the beginning of the backup, they have to be backed up. Instead of troubling the RDBMS hosts to go looking for those blocks, what they do is send off a smart operation to Exadata cells, who then go and look for the blocks that are old enough to need to be backed up. So it&#8217;s offloading finding the blocks that need to be backed up. Did that make any sense?</p>
<p><strong>CK:</strong> Absolutely. So, what other operations do we have? Incremental backups, tablespace creation, joining&nbsp;.&nbsp;.&nbsp;.&nbsp; Does sorting also count? Can you offload sorting, grouping by?  These are data warehouse operations that usually consume a lot of CPU. Can those be assisted by the CPU power from the Exadata cells?</p>
<p><strong>KC:</strong> The answer to sorting and joining and grouping and aggregation&nbsp;.&nbsp;.&nbsp;.&nbsp; The only joining that we do is very good technology. It&#8217;s very beneficial on so many queries (I intend to blog exactly about that). It&#8217;s a bloom filter join, and we can discuss that some other time.  We don&#8217;t do hash joins in storage cells because cells would have to see all of the other cells&#8217; data, and filter idempotent storage&#8212;pods, if you will&#8212;they don&#8217;t know about each other.  But as far as storage sorting and aggregation &#8212; that&#8217;s too far upstream in the Oracle kernel. It&#8217;s too far up above where Exadata adapts. Exadata is a storage that happens to be able to do a few things that are intelligent. In future, perhaps &#8212; but at this point the answer to that is, no. [22:44]</p>
<p><strong>CK:</strong> It seems like there is a lot of extra things that can be added to the cells &#8212; software, et cetera. Is the roadmap for cell upgrade similar to database? Like every two years or something like that? Has it been discussed at all?</p>
<p><strong>KC:</strong> Although I know a little bit about the product, what I don&#8217;t know is what the release schedules are. That would be committed by someone clear outside of my group, for certain. Do I think that the software will rapidly evolve?   I would say most certainly.  This is the initial release of a product that has been in development for three years. What are the odds that we knew everything over three years that we&#8217;ll find in just a short period of time [garbled]? Pretty slim. We intend to be very aggressive&#8212;not chaotic, but aggressive&#8212;in enhancing Exadata.</p>
<p><strong>CK:</strong> Disk failures. How are those handled? I know that disk failures will not affect data or running queries or anything like that. But I&#8217;m curious &#8212; when you lose a disk, a single disk in a single cell, where is this handled? Is this still something that is happening on the ASM level, is it something that is happening on the cell/disk level? Where does this happen? Who notices this and who cleans up?</p>
<p><strong>KC:</strong> That&#8217;s an excellent question. Disks in storage cells are treated by ASM really no differently than disks out in a fibre channel SAN. So ASM will respond to failure the same way it does to a single-disk failure in a fibre channel SAN. So that&#8217;s a two-part answer. ASM will shield the database processes from knowing anything about that disk failure, as we expect it to because it runs on a fibre channel SAN.</p>
<p>Now, we haven&#8217;t [garbled] about the fact that you&#8217;ve got a physical disk failure, and a human being has to get involved at some point.  Exadata is physical storage management. Gone are the days when you&#8217;d have to do fdisk and all of that sort of stuff. So from soup to nuts, we manage the physical disks. And that means, when a physical disk fails, you can get an alert via email (I suppose that could be an email to an SMS or what-have-you). You&#8217;ll know the disk has failed, and there will be enough information regarding the disk failure so that you can attach to the correct cell and interface with the cell command-line interface to execute a command that will vacate that disk from Exadata ownership. You put another physical disk in there, and there&#8217;s a very short command to bring that disk online. And, just like you would in a fibre channel SAN environment, you go and ask ASM to re-balance. [25:45]</p>
<p><strong>CK:</strong> So ASM actually sees each individual disk on the Exadata cells?</p>
<p><strong>KC:</strong> Yes. Every disk in Exadata is an ASM disk. In fact, and nobody&#8217;s been talking about this yet because we have to roll this out &#8212; we can&#8217;t just turn on the garden hose and hose everybody down.</p>
<p>The way this goes is, physical disks become, conceptually, to Exadata a cell disk, and so that&#8217;s a single physical disk and a single logical management-level disk. So a single physical disk is a single cell disk. You can take cell disks and carve them up into what we refer to as grid disks. If you took at 300GB SAS from our SAS option, if you took a 300GB SAS drive and created a groupdisk on it that was 100GB, you&#8217;ll get the outermost 100GB of space from the platter. Now we have a lot of shorthand for doing this. Don&#8217;t believe for a moment that it&#8217;s a bunch of laborious scripting commands. We&#8217;re very good about that. If, for instance you wanted to&#8212;just in two commands&#8212;create cell disks on all disks, and then create a set of grid disks for something called &#8220;data&#8221;, you would just simply say, create celldisk all [garbled] initialize cell (and I&#8217;m blogging about that syntax, you&#8217;ll see that soon)&nbsp;.&nbsp;.&nbsp;.&nbsp; <code>initialize cell</code>, followed by <code>create celldisk all</code>, followed by <code>create griddisk --prefix=data --size=100G</code>, and at that point in time you&#8217;d be able to go over on the database hosts and ASM would be able to see those disks.</p>
<p><strong>CK:</strong> It&#8217;ll be like a candidate on the ASM level?</p>
<p><strong>KC:</strong> It would be a candidate. It would be ready to use. So at that point in time, you would have twelve ASM disks, each of 100GB.</p>
<p><strong>CK:</strong> Oh, okay! So, basically, a cell grid is like a group command, not a virtualization. In the end it will present it as twelve different ones, but it will present it as a group?</p>
<p><strong>KC:</strong> Actually, more to the point, a griddisk is an ASM-usable chunk of a celldisk. And if you create. let&#8217;s say, at one blast on a cell, you create twelve griddisks called &#8220;data&#8221; you would have griddisks named &#8220;data01&#8243; through &#8220;data12&#8243;, and they would each be 100GB. And when you go and add them to a diskgroup, you wouldn&#8217;t have to list out all those happenings, you could use wildcard characters to say /0/*data* &#8212; and now you have a diskgroup that consists of twelve 100GB slices of celldisk. [28:55]</p>
<p><strong>CK:</strong> In a way, you&#8217;re exporting the slices?</p>
<p><strong>KC:</strong> Yes.</p>
<p><strong>CK:</strong> Making the slices visible to ASM?</p>
<p><strong>KC:</strong> Yes. griddisk to a cell is a logical object. griddisk to ASM treats it just as if it was a physical disk.</p>
<p><strong>CK:</strong> And what about failure groups? Is this created automatically?</p>
<p><strong>KC:</strong> No, you still use the age-old ASM incantations to create your failure groups. If you just create&nbsp;.&nbsp;.&nbsp;.&nbsp; let&#8217;s say, for instance, you have the world&#8217;s smallest Exadata configuration, which is two cells. If you create a diskgroup of that, you would have the same normal redundancy, and it would create a failgroup out of that. It would all be mirrored.  No disks are mirrored within a cell &#8212; it&#8217;s smart enough between the cells.</p>
<p><strong>CK:</strong> The minimum number of cells is two?</p>
<p><strong>KC:</strong> Yes.  Because it you don&#8217;t have two cells, you have no redundancy.</p>
<p><strong>CK:</strong> Absolutely.</p>
<p><strong>KC:</strong> Okay?</p>
<p><strong>CK:</strong> Perfect! Very cool! Now we can envision how this works. Well, those are my questions for now. I&#8217;m sure there will be a lot more once you start hosting again and giving us more details on how all this works. Some part of me wants to ask you, do you have all these [blogs] prepared in advance and you&#8217;re just pushing the &#8220;Publish&#8221; button on a pre-determined day. I&#8217;m hoping you&#8217;re answering some of them as people ask them.</p>
<p><strong>KC:</strong> Are you referring to my blog thread on Exadata Questions and Answers?</p>
<p><strong>CK:</strong> Yes.</p>
<p><strong>KC:</strong> Listeners will know where my blog is because when you post this, of course you&#8217;ll post up a URL to my blog. The way I&#8217;m approaching this is, you know, look &#8212; I&#8217;ve got a day-job. My night-job, at this point in time, is also very exciting because I&#8217;m taking it on myself to disseminate some information about Exadata. So what I do throughout the day, I&#8217;m just like all of you folks &#8212; connected and getting your RSS updates and what-have-you. I&#8217;ll see when people are speaking about Exadata, and if I see something that looks like it&#8217;s fodder for a blog Q&#038;A, I just cut the question and paste it into my working sheet, and when I get around to it I type in an answer ,and once I&#8217;ve got something cooked, I post it up to the blog.</p>
<p><strong>CK:</strong> Filling up the gaps.</p>
<p><strong>KC:</strong> Filling in the gaps. You know, honestly &#8212; this is the twenty-first century and it&#8217;s Web 2.0. I think we would be way behind the times to not be handling some of our information dissemination the way we are.  Because if you wanted to collect even the information that we&#8217;ve disclosed in this conversation by going out and trolling through white papers and what-have-you, the odds that you would actually get all of the information is pretty slim.  So how many books and white papers do people want to troll through? I think blogging is a very effective way to get some timely information out, and I&#8217;m hoping I&#8217;ll be able to continue that.</p>
<p><strong>CK:</strong> Very cool. Alright, thank you so much for this interview, Kevin. We really appreciate your finding the time to talk to us. We&#8217;ll see you in the blogs, I guess.</p>
<p><strong>KC:</strong> Indeed. And I appreciate the opportunity to do this. Gee whiz, wouldn&#8217;t it be good if we could do this again some time!</p>
<p><strong>CK:</strong> Oh, absolutely! Let&#8217;s wait until the gap gets a bit bigger, and then we&#8217;ll get it in.</p>
<p><strong>KC:</strong> Okay, very good!</p>
]]></content:encoded>
			<wfw:commentRss>http://www.pythian.com/news/1267/interview-kevin-closson-on-the-oracle-exadata-storage-server/feed/</wfw:commentRss>
		<slash:comments>11</slash:comments>
<enclosure url="http://www.pythian.com/blogs/wp-content/uploads/closson-interview.m3u" length="76" type="audio/x-mpegurl" />
		</item>
		<item>
		<title>Analysis of the Oracle Exadata Storage Server and Database Machine</title>
		<link>http://www.pythian.com/news/1262/analysis-of-the-oracle-exadata-storage-server-and-database-machine/</link>
		<comments>http://www.pythian.com/news/1262/analysis-of-the-oracle-exadata-storage-server-and-database-machine/#comments</comments>
		<pubDate>Thu, 25 Sep 2008 16:04:55 +0000</pubDate>
		<dc:creator>Christo Kutrovsky</dc:creator>
				<category><![CDATA[Oracle]]></category>
		<category><![CDATA[data warehouse]]></category>
		<category><![CDATA[database machine]]></category>
		<category><![CDATA[exadata]]></category>
		<category><![CDATA[storage server]]></category>

		<guid isPermaLink="false">http://www.pythian.com/blogs/1262/analysis-of-the-oracle-exadata-storage-server-and-database-machine</guid>
		<description><![CDATA[*Updated* see comments.
Exadata &#8212; the smart storage server. I am definitely excited about this product, but my point of view is a bit different.
It&#8217;s fast, and much faster than anything out there right now. But how many shops will actually need this? How many shops can spend 2.2 million dollars on hardware and equipment?
What are [...]]]></description>
			<content:encoded><![CDATA[<p><em>*Updated* see comments.</em><br />
Exadata &#8212; the smart storage server. I am definitely excited about this product, but my point of view is a bit different.</p>
<p>It&#8217;s fast, and much faster than anything out there right now. But how many shops will actually need this? How many shops can spend 2.2 million dollars on hardware and equipment?</p>
<p>What are the products, in a nutshell? The Oracle Exadata Storage Server <a href="http://www.oracle.com/technology/products/bi/db/exadata/pdf/exadata-datasheet.pdf">(Data Sheet, PDF)</a>:</p>
<ul>
<li>2U Storage &#8220;unit&#8221; with either 1 TB SAS or 3.3 TB SATA redundant capacity. There is a query processor in the box that can &#8220;offload&#8221; tasks from the main database server. Primary filtering, decompression, joins, backups.</li>
<li>Storage units linked to database servers via dual <a href="http://www.infinibandta.org/" rel="nofollow">Infiniband</a> offering 20 Gbit/s (2.5 GBytes/sec) bandwidth</li>
</ul>
<p>The Database Machine (<a href="http://www.oracle.com/solutions/business_intelligence/docs/database-machine-datasheet.pdf">Data Sheet, PDF</a>):</p>
<ul>
<li>A standard 42U rack with 8 database servers and 12 Exadata storage servers.</li>
<li>Pre-installed Linux <strong><em>and</em></strong> Oracle. Pre-configured.</li>
<li>In 8 servers &#8212; a total of 256GB RAM, 64 Intel cores @ 2.66 Ghz, InfiniBand-ed and gigabit-switched.</li>
</ul>
<p>The cost for one Database Machine: $2.33M ($650,000 + $1,680,000 in software) <a href="http://www.oraclenerd.com/images/exadata_pricing.jpg">as grabbed from Larry&#8217;s keynote (thank chet)</a> I called the &#8220;call us now&#8221; phone mentioned on  the Oracle Exadata website to ask them for pricing. They had no idea what I was asking about, and I&#8217;m still waiting on a salesperson to call me back. (Hint for Oracle &#8212; educate your sales staff about new products, just in case I decide to buy one the day after you announce it.)</p>
<p>You have to realize how &#8220;cheap&#8221; this is. It comes down to $25,000 per core for Oracle EE, RAC, and Partitioning!  And extra &#8220;free&#8221; CPUs for decompressing, filtering and joining, and backups. That&#8217;s a good deal. Oh, did I mention you can interconnect several 42U racks?</p>
<p>Back to the main question, what problems does this product solve?</p>
<p><span id="more-1262"></span></p>
<h3>Configuration</h3>
<p>That&#8217;s right, the number one problem this product solves is configuration. 90% of the problems I&#8217;ve seen are due to improperly configured systems. I am not talking <code>init.ora</code> settings here, or design, or indexing, or any of that. I am talking configuration mistakes all over the place. Starting from bottom up, these are the most common mistakes:</p>
<ul>
<li>buying large disks without accounting for I/O bandwidth delivery</li>
<li>mis-configuring them in big meta-arrays (EMC style) with non-aligned stripe sizes. (See &#8220;turn-offs&#8221; in <a href="http://www.pythian.com/blogs/505/christo-kutrovsky-oracle-pinup">Christo Kutrovsky, Oracle Pinup</a>)</li>
<li>sharing the spindles for redo, datafiles, backup, and a bunch of databases (3par style), thus ensuring that I/O is never sequential</li>
<li>getting single-channel Fibre Channel connectors to the database server</li>
<li>not configuring directI/O or asyncI/O or the largest possible <code>db_file_multi_read_count</code></li>
<li>not using ASM.   Of course, ASM reduces overhead, manages data in 1MB or larger (11g feature) extents (this is sequential data!)</li>
<li>not using parallel query properly &#8212; using default values or considering all of the above, just not getting the bandwidth to perform</li>
</ul>
<p>Using Exadata <em>necessarily and immediately solves </em> all of these issues. You donâ€™t have a choice&#8211;you get more I/O bandwidth when you buy extra space, there&#8217;s no other way.</p>
<p>No expensive consultants to install your system. From the DBA perspective it&#8217;s heaven &#8212; no arguing with storage people for dedicated spindles, no arguing with CIOs about big vs. small disks, no arguing with system administrators for ASM. No hiring of expensive consultants to &#8220;tune&#8221; the system or apply best practices.</p>
<p>You may laugh at all of the above issues, but many shops are exactly like that. Especially the big ones (the target market for Exadata), where everyone is too afraid to change anything in case they get blamed if it doesn&#8217;t work.  The &#8220;best practices&#8221; are the <em>only</em> practices with the Database Machine.</p>
<p>To maximize performance, you have to get <em>all</em> the pieces together. Then and only then will you get all the benefits. And this is quite difficult to achieve, especially in large shops where several entire departments are involved.</p>
<p>In all my experience at Pythian, there has been only one client&#8211;who, thanks to a combination of good managers, thrust, and desire for performance&#8211;would exactly follow my recommendations.  And you know what? <a href="http://www.pythian.com/blogs/204/asm-multi-disk-performance">They are getting their 400 Mb/sec</a>. The new server is reaching 800 with the dual 4gbit fibre channel.</p>
<p>Some interesting aspects of the Oracle Exadata Storage server.</p>
<h3>Performance</h3>
<p>The data sheet presents two options:  1 TB with SAS with 1000 MB/s bandwidth; or 3.3 TB with SATA and 750 MB/sec. Compression is &#8220;extra&#8221;, meaning in a typical data warehouse you get 2-3 times compression, meaning your actual bandwidth will be 2000-3000 MB/sec from a <strong>single</strong> Exadata server.</p>
<h3>Redundancy</h3>
<p>Mirroring is provided by ASM (either 2- or 3-way). It is also performed across Exadata storage servers (does that mean 2 minimum?)</p>
<p>Disk failure does <em><strong>not</strong></em> abort queries or transaction.</p>
<p>Exadata Storage server does <em>abort</em> queries or transactions, but with no data loss. This is important to know when calculating risk.</p>
<h3>Manageability</h3>
<p>There&#8217;s a plug-in available for 10g Enterprise Manager, a GUI to manage all that.  Absolutely mandatory in my opinion.</p>
<h3>Conclusion</h3>
<p>Oracle has solved the communication issues for big shops, and the result is indeed extreme performance. Donâ€™t get me wrong here, the Exadata is a brilliant idea that will solve some very difficult, specific problems for large data warehouse shops. But the Database Machine will do something much more real, that will help far more people: it will make it impossible for people to mis-configure their database systems.</p>
<p>Hats off to Oracle for releasing a product that solves a problem we are facing everyday: convincing clients to get the right hardware setup for their database workload.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.pythian.com/news/1262/analysis-of-the-oracle-exadata-storage-server-and-database-machine/feed/</wfw:commentRss>
		<slash:comments>15</slash:comments>
		</item>
		<item>
		<title>Oracle&#8217;s Secret New Feature: Educated Guesses</title>
		<link>http://www.pythian.com/news/1246/oracles-secret-new-feature-educated-guesses/</link>
		<comments>http://www.pythian.com/news/1246/oracles-secret-new-feature-educated-guesses/#comments</comments>
		<pubDate>Mon, 22 Sep 2008 20:52:32 +0000</pubDate>
		<dc:creator>Christo Kutrovsky</dc:creator>
				<category><![CDATA[MySQL]]></category>
		<category><![CDATA[Non-Tech Articles]]></category>
		<category><![CDATA[Oracle]]></category>
		<category><![CDATA[11gR2]]></category>
		<category><![CDATA[ASM]]></category>
		<category><![CDATA[Cache Fusion]]></category>
		<category><![CDATA[Extreme Performance]]></category>
		<category><![CDATA[in-memory]]></category>
		<category><![CDATA[Kevin Closson]]></category>
		<category><![CDATA[Larry Ellison]]></category>
		<category><![CDATA[new feature]]></category>
		<category><![CDATA[OOW]]></category>
		<category><![CDATA[RAC]]></category>

		<guid isPermaLink="false">http://www.pythian.com/blogs/1246/oracles-secret-new-feature-educated-guesses</guid>
		<description><![CDATA[Larry Ellison is announcing a major new feature this Wednesday at Open World. For the first time in a while, his keynote is dedicated to the &#8220;database&#8221; as opposed to the usual high level ERP/Apps/Fusion. Even the title of his keynote is catchy &#8212;  &#8220;Extreme Performance&#8221;.
Oracle has been keeping the new feature a secret. [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://74.125.95.104/search?q=cache:83JKHhLpiPIJ:www.oracle.com/technology/products/applications/events/oow-2008/oow08_sf_focus_ebs-appstech.pdf+larry+ellison+oow+%22extreme+performance%22&#038;hl=en&#038;ct=clnk&#038;cd=1&#038;client=firefox-a">Larry Ellison is announcing a major new feature this Wednesday at Open World</a>. For the first time in a while, his keynote is dedicated to the &#8220;database&#8221; as opposed to the usual high level ERP/Apps/Fusion. Even the title of his keynote is catchy &#8212;  &#8220;Extreme Performance&#8221;.</p>
<p>Oracle has been keeping the new feature a secret. Even the 11gR2 beta program had very few participants to prevent  information leaking out. It&#8217;s, &#8220;Something&#8217;s coming, but I am not telling what.&#8221;</p>
<p>Okay, it worked on me, I&#8217;m excited about it. Let&#8217;s think what it could be. What single database feature is so major, that Larry himself will announce it during OpenWorld?</p>
<p>What do we know so far?</p>
<ul>
<li>Starting with the obvious, Larry&#8217;s keynote is &#8220;Extreme Performance&#8221;, so it&#8217;s related to performance.</li>
<li>We know Kevin Closson has worked on it &#8211; he had a blog entry saying &#8220;I am working on something big&#8221; that got pulled off the web.  (<a href="http://74.125.95.104/search?q=cache:yWT4AjFXbnoJ:kevinclosson.wordpress.com/2008/06/26/of-gag-orders-excitement-and-new-products/+site:kevinclosson.wordpress.com+larry&#038;hl=en&#038;ct=clnk&#038;cd=4&#038;gl=ca">Here&#8217;s Google&#8217;s cache.</a>)</li>
</ul>
<p>Given these two point, let&#8217;s further think about it.  What do we know <a href="http://kevinclosson.wordpress.com/about/">about Kevin</a>?</p>
<ul>
<li>He worked for PolyServe &#8212; a company whose main product is a cluster file system.</li>
<li>He worked for Sequent on NUMA systems, which in todayâ€™s world is pretty close to cluster software with a very fast, low latency interconnect.</li>
<li>He is an expert in storage systems and disk performance.</li>
<li>He joined Oracle recently, possibly to work on this secret project.</li>
<li>He must be really excited about it, to post <em>anything</em> on his blog under radio silence.</li>
</ul>
<p>I  think it&#8217;s something related to storage, something new and revolutionary about storage. But what?</p>
<p>We already know, from leaks on certain websites, that ASM will become a cluster filesystem which will allow storing OCR files, as well as user files, on the ASM disks.</p>
<p>But is this big enough? It&#8217;s definitely significant. Now you get a &#8220;free&#8221; reliable, cluster file system with Oracle. I donâ€™t think it&#8217;s big enough though. Oracle already had OCFS and OCFS2. So it&#8217;s not something new to release a filesystem. And even if ASM becomes a true filesystem, that would not provide such a significant performance boost to warrant a keynote called &#8220;Extreme Performance&#8221;. An ASM filesystem would be a major manageability feature, not so much a performance feature.</p>
<p>That being ruled out, what could it be?</p>
<p>Recently, when setting up a new 11g database on a server with 128gb of RAM, I was setting up hugepages as usual, and thinking about how big my cache would be. It struck me that the cache will be bigger than the database for quite a while. Why do we even need the SAN/Datafiles?!</p>
<p>Then it hit me.</p>
<p><span id="more-1246"></span></p>
<p>We donâ€™t! We donâ€™t need them at all!</p>
<p>What are the main storage components of a production database?</p>
<ul>
<li>the redo logs &#8212; to guarantee crash recovery</li>
<li>the datafiles &#8212; primary storage</li>
<li>the backups &#8212; mandatory for a production system</li>
<li>the SGA &#8212; why is this part of storage? Well, because you can&#8217;t have a database without some fast in-memory storage, right?</li>
</ul>
<p>If you have sufficient SGA (RAM) to load your entire database (datafiles), why do you need the datafiles?<br />
I am sure you are immediately thinking what if the database crashes?</p>
<p>Remember, what&#8217;s the recent push in Oracle: <a href="http://www.oracle.com/technologies/grid"><em>grid computing</em></a>.</p>
<p>Picture a RAC database &#8211; 8 nodes, 128 GB of RAM each, totaling 1 TB of storage. Add 2- or 3-way mirroring and you get 300 GB of highly redundant, extremely fast storage. A true, native â€œin-memoryâ€ cluster database. A true &#8220;shared nothing&#8221; cluster database.</p>
<p>Even if you do not consider the performance increase, the redundancy level goes up. You no longer have a â€œcentralâ€ SAN to rely on.  Maybe you have two mirrored SANs in your enterprise to protect you against such failures. How about none?</p>
<p>Let&#8217;s keep moving with that idea. How can Oracle achieve it? What technologies would be needed?</p>
<p>I think Oracle already has all the required technologies to achieve this â€œextreme performanceâ€. It&#8217;s just a matter of connecting them.</p>
<p>And the answer is <a href="http://en.wikipedia.org/wiki/Oracle_RAC#Cache_Fusion">Cache Fusion</a>.  But how? Imagine this scenario.  During database startup you would &#8220;restore&#8221; your database from your backups (compressed or not) directly into memory. Remember that&#8217;s 8 nodes that are doing the uncompression/reading. So starting up wonâ€™t really take much time.</p>
<p>Once the database is up, cache fusion will take care of the rest: sending blocks over the interconnect, keeping past images, keeping and managing multiple copies. Oracle already does this, just not for redundancy reasons. Look at my <a href="http://www.pythian.com/blogs/282/oracle-rac-cache-fusion-efficiency-a-buffer-cache-analysis-for-rac">Cluster Efficiency query</a>.</p>
<p>If a node (or two) go down, who cares? All the data is already replicated 2- or 3-way. In the event all nodes go down, Oracle would still keep the online redo logs for archival purposes. Or maybe not? Replicated in memory REDO? Why not?</p>
<p>In fact, the only real changes are:</p>
<ul>
<li>backup will be restored into memory</li>
<li>no dbwriter &#8212; no datafiles to write to</li>
<li>cache fusion block replication for redundancy</li>
</ul>
<p><img src='http://www.pythian.com/blogs/wp-content/uploads/new-feature-graphs.png' alt='cluster db with no datafiles'  title='cluster db with no datafiles' /></p>
<p>The result? &#8220;Extreme Performance.&#8221; Now that&#8217;s definitely worthy for a keynote by Larry himself. </p>
<p>A major innovation indeed. For Oracle, at least. MySQL cluster databases are already all in memory. Actually, itâ€™s the only way it can be, and this is seen as a limitation by the community, simply because itâ€™s the only way.</p>
<p>Oracle doesn&#8217;t need to make the feature exclusive for the entire database. This may be a tablespace level feature, or even a table/partition level one. Then you would really be in control of which areas of your database get â€œextreme performanceâ€. Think of the possibilities.</p>
<p>We were brainstorming with Paul VallÃ©e on what the new feature could be. Paul&#8217;s idea was slightly different than mine. He envisioned ASM to be the driving technology behind an all-in-memory database. ASM already has 2- and 3-way mirroring. The change would be minor &#8212; instead of creating disks out of LUNs, they would be created out of RAM. ASM would take care of the inter-node replication.</p>
<p>If Oracle had an all-in-memory database done with ASM, you would still have to &#8220;read&#8221; the data into the buffer cache, introducing double-buffering. This would be a step <em>back</em>, actually. In the PC world, Windows NT/2000 revolutionized caching from DOS/Windows95. The merging of the file system cache with the execution memory was a significant step forward to avoid double buffering.  And this would limit the granularity of what is &#8220;all-in-memory&#8221;.</p>
<p>This is how Paul&#8217;s idea looks:</p>
<p><img src='http://www.pythian.com/blogs/wp-content/uploads/pauls-vision.png' alt='In-memory database via ASM in-memory disks.'  title='In-memory database via ASM in-memory disks.'  /></p>
<p>We have our bets. What&#8217;s yours? Please throw in some wild guesses. The winner (the earliest correct guess) gets a Pythian Maestro shell shipped to him or her. (NOTE: I was going to write â€œdoes not apply to Oracle employeesâ€, but I decided to give them a chance too. As long as you donâ€™t know and you are guessing, you can try). </p>
<p>Here&#8217;s  Darrin Leboeuf, Pythian&#8217;s V.P. Client Services, modeling the Pythian Maestro shell.</p>
<p><img src='http://www.pythian.com/blogs/wp-content/uploads/shell.jpg' alt='Darrin Leboeuf models the Pythian Maestro shell.'  title='Darrin Leboeuf models the Pythian Maestro shell.' /></p>
]]></content:encoded>
			<wfw:commentRss>http://www.pythian.com/news/1246/oracles-secret-new-feature-educated-guesses/feed/</wfw:commentRss>
		<slash:comments>16</slash:comments>
		</item>
		<item>
		<title>Recent Spike Report from v$active_session_history (ASH)</title>
		<link>http://www.pythian.com/news/922/recent-spike-report-from-vactive_session_history-ash/</link>
		<comments>http://www.pythian.com/news/922/recent-spike-report-from-vactive_session_history-ash/#comments</comments>
		<pubDate>Tue, 15 Apr 2008 16:05:26 +0000</pubDate>
		<dc:creator>Christo Kutrovsky</dc:creator>
				<category><![CDATA[Oracle]]></category>
		<category><![CDATA[ash]]></category>
		<category><![CDATA[performance]]></category>
		<category><![CDATA[RAC]]></category>

		<guid isPermaLink="false">http://www.pythian.com/blogs/922/recent-spike-report-from-vactive_session_history-ash</guid>
		<description><![CDATA[For the past few months I&#8217;ve been using a query that I refer to as &#8220;ash report &#8211; recent spike&#8221;. That&#8217;s the second thing I do when I get a call of the &#8220;the system is slow&#8221; type. The first thing I do is run &#8220;top&#8221; (or whichever alternative for the OS) and check the [...]]]></description>
			<content:encoded><![CDATA[<p>For the past few months I&#8217;ve been using a query that I refer to as &#8220;ash report &#8211; recent spike&#8221;. That&#8217;s the second thing I do when I get a call of the &#8220;the system is slow&#8221; type. The first thing I do is run &#8220;top&#8221; (or whichever alternative for the OS) and check the overall CPU usage.</p>
<p>The script is fully RAC-aware, and although it&#8217;s not 100% perfect, I use this imperfection to see if any particular node is doing something stupid. Although it is primarily targeted for OLTP systems, it can be useful for data warehouses as well, especially if they use the parallel option.</p>
<p>The query is to the database what &#8220;load&#8221; (uptime) for a Linux/Unix machine is, except it has much more detail. It is basically a summarization query of the <code>v$active_session_history</code> table. NOTE:  you need to have the performance pack license to use it. It is not designed to be aligned, or read. The best is to leave it on just a few lines  and concentrate on the results.</p>
<p>It has two &#8220;variables&#8221; that you can adjust: how far back to look (I use two hours), and how aggressively to look for problems (<code>having count(*) >= 2</code>).</p>
<p>An explanation of how to make sense of the results follows the query.</p>
<p><span id="more-922"></span></p>
<pre>
select round(avg(max(cnt_tot)) over (order by sample_time RANGE BETWEEN INTERVAL '5' minute PRECEDING AND current row)) as avg,max(cnt_tot) as tot,
SAMPLE_time,max(cnt) as cnt, event,substr(sq.sql_text,1), ash.sql_id, ash.sql_child_number chd,/* plsql_entry_object_id, plsql_entry_subprogram_id, plsql_object_id, plsql_subprogram_id, session_state, qc_session_id, qc_instance_id, blocking_session, blocking_session_status, blocking_session_serial#, event_id, event#, seq#, p1text, p1, p2text, p2, p3text, p3, wait_class, wait_class_id, wait_time, time_waited, xid, current_obj#, current_file#, current_block#, program, module, action, client_id*/to_char(round(sum(elapsed_time) / nullif(sum(executions), 0) / 1000000, 6), '9,999,999,990.999999') as "sec p",round(sum(disk_reads) / nullif(sum(executions), 0), 0) as "disk p", round(sum(buffer_gets) / nullif(sum(executions), 0), 0) as "gets p",round(sum(rows_processed) / nullif(sum(executions), 0), 0) as "rows p",round(sum(cpu_time) / 1000000 / nullif(sum(executions), 0), 3) as "cpu p", sum(executions) as exec, sum(users_opening) as open,sum(users_executing) as e
from ( select sum(count(*)) over (partition by sample_time) as cnt_tot, count(*) as cnt,SAMPLE_time, event,sql_id,sql_child_number from gv$active_session_history where 1=1
and sample_time > sysdate - interval '2' hour
group by event, sql_id,sql_child_number, SAMPLE_time having count(*) >= 2) ash, gv$sql sq
where  ash.sql_id = sq.sql_id(+) and ash.sql_child_number=sq.child_number (+)
group by event,sql_text,ash.sql_id,ash.sql_child_number, ash.SAMPLE_time order by sample_time desc
</pre>
<p>The query concentrates on the idea, &#8220;you shouldn&#8217;t have too many sessions doing the same thing at the same moment&#8221;. Basically, anytime you have multiple sessions running the same query and waiting on the same event, they are grouped together.</p>
<p>Here&#8217;s the explanations of each column:</p>
<p><code>AVG</code> &#8212; the average &#8220;load&#8221; (active sessions) over a 5-minute interval. This should help you spot a problem when you scroll through the results.<br />
<code>TOT</code> &#8212; total &#8220;load&#8221; (active sessions) for that sample time. RAC users: each RAC node will have its own sample time, within 1 second of each other, but not exactly spot-on. So, even if you have sessions waiting on the same event, they will not be grouped together. I kind of like it this way, for now.<br />
<code>SAMPLE_TIME</code> &#8212; self-explanatory<br />
<code>CNT</code> &#8212; the number of active sessions waiting on the same event and query<br />
<code>EVENT</code> &#8212; the event been waited on<br />
<code>SQL_TEXT</code> &#8212; self-explanatory, except when empty which means either not found in shared pool or not available in ASH<br />
<code>SQL_ID</code> &#8212; if you need to find the SQL<br />
<code>CHD</code> &#8212; the child number being executed</p>
<p>NOTE: the following columns are as of &#8220;now&#8221; and not as of the <code>SAMPLE_TIME</code>.</p>
<p><code>sec p</code>, <code>dsk p</code>, <code>gets p</code>, <code>rows p</code>, <code>cpu p</code> &#8212; these are average statistics for the query being executed. This should give you a quick overview of whether the query is a big query, a small query, a CPU-intensive or a disk I/O-intensive query. Be careful: since the query was introduced to the shared pool, those columns are averages, and therefore  could be misleading. These are all per execution stats, so <code>sec p</code> represents the number of seconds  on average it took to execute the query.</p>
<p><code>OPEN</code> &#8212; comes from <code>v$sql</code>: the number of sessions that have the query open<br />
<code>E</code> &#8212; the number of sessions actively executing the query</p>
<p>I generally use <code>having count(*) >= 2</code> &#8212; the lowest reasonable setting to get an overview of what&#8217;s happening on the server. This usually doesn&#8217;t show many rows on my servers. It really depends on how busy your server is. You should play with this &#8220;filter&#8221; to see only what you are interested in.</p>
<p>If you are looking for a problem, you can raise that to, say, 10 and remove the time restraint in order to find times when the database was particularly busy.</p>
<p>Feel free to ask any questions if you have any.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.pythian.com/news/922/recent-spike-report-from-vactive_session_history-ash/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
	</channel>
</rss>
