Stabilize Oracle 10G's Bind Peeking Behaviour

Written by Pythian Marketing | Mar 18, 2008 4:00:00 AM

I wrote this post because I feel there is a great need for it. The number of people struggling with unstable query plans due to bind peeking in Oracle 10G is enormous, to say the least. More than that, solutions like disabling bind variable peeking are driving us away from understanding the root cause of the problem and applying the right fix to it.

What are the causes of unstable plans due to bind variable peeking?

There are three things that might put you at risk of unstable plans due to bind variable peeking. Those are histograms, partitions, and range-based predicates. I'll cover last two in upcoming blog posts.

Histograms

Let me share with you my déja vu. When I see this:

SQL> select value
  2  from v$parameter
  3  where name='_optim_peek_user_binds';
 
 VALUE
 --------------------
 FALSE

I immediately expect this:

SQL> select sum(case when max_cnt > 2 then 1 else 0 end) histograms,
  2  sum(case when max_cnt <= 2 then 1 else 0 end) no_histograms
  3 from (
  4  select table_name, max(cnt) max_cnt
  5  from (
  6  select table_name, column_name, count(*) cnt
  7  from dba_tab_histograms
  8  group by table_name, column_name
  9  ) group by table_name
  10 );
 
 HISTOGRAMS NO_HISTOGRAMS
 ---------- -------------
  1169 2494

The above is an example from a real-world OLTP system running with bind peeking disabled. It is no surprise to me. An exception, you say? Here's another one . . .

SQL> select value
  2  from v$parameter
  3  where name='_optim_peek_user_binds';
 
 VALUE
 --------------------
 FALSE
 
 SQL> 
 SQL> select sum(case when max_cnt > 2 then 1 else 0 end) histograms,
  2  sum(case when max_cnt <= 2 then 1 else 0 end) no_histograms
  3 from (
  4  select table_name, max(cnt) max_cnt
  5  from (
  6  select table_name, column_name, count(*) cnt
  7  from dba_tab_histograms
  8  group by table_name, column_name
  9  ) group by table_name
  10 );
 
 HISTOGRAMS NO_HISTOGRAMS
 ---------- -------------
  304 521

Here comes my déja vu. If you tell me that you disabled bind peeking, my immediate response will be "do you have a lot of unnecessary histograms?" On an OLTP system, there is no way that you need histograms on a third of your tables (and I can hardly think of any DSS system where this amount of histograms can be justified).

How histograms and bind variable peeking can cause an unstable plan

I'll give you a simple example:

SQL> create table t as
  2  select case when level <= 9900 then 0 else level-9900 end n,
  3  rpad('*', 100, '*') v
  4  from dual
  5  connect by level <= 10000;
 
 Table created
 
 SQL> create index i_t_n on t (n);
 
 Index created
 
 SQL> exec dbms_stats.gather_table_stats(user, 't', method_opt => 'for columns n size 254', cascade => true);
 
 PL/SQL procedure successfully completed

In other words, we have a table T with 10000 rows, where 9900 rows have N=0 and 100 rows have N from 1 to 100. The histogram tells Oracle about this data distribution:

SQL> select column_name, endpoint_number, endpoint_value
  2  from user_tab_histograms
  3  where table_name='T';
 
 COLUMN_NAM ENDPOINT_NUMBER ENDPOINT_VALUE
 ---------- --------------- --------------
 N 9900 0
 N 9901 1
 N 9902 2
 N 9903 3
 N 9904 4
 N 9905 5
 ...skipped for clarity...
 N 9997 97
 N 9998 98
 N 9999 99
 N 10000 100
 
 101 rows selected

The above is known as a skewed data distribution, and a histogram can help Oracle to choose the right plan, depending on the value:

SQL> set autot traceonly explain
 SQL> select * from t where n=0;
 
 Execution Plan
 ----------------------------------------------------------
 Plan hash value: 1601196873
 
 --------------------------------------------------------------------
 | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
 --------------------------------------------------------------------
 | 0 | SELECT STATEMENT | | 9900 | 531K| 42 (0)| 00:00:01 |
 |* 1 | TABLE ACCESS FULL| T | 9900 | 531K| 42 (0)| 00:00:01 |
 --------------------------------------------------------------------
 
 Predicate Information (identified by operation id):
 ---------------------------------------------------
 
  1 - filter("N"=0)
 
 SQL> select * from t where n=1;
 
 Execution Plan
 ----------------------------------------------------------
 Plan hash value: 2912310446
 
 --------------------------------------------------------------------
 | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
 --------------------------------------------------------------------
 | 0 | SELECT STATEMENT | | 1 | 55 | 2 (0)| 00:00:01 |
 | 1 | TABLE ACCESS BY INDEX ROWID| T | 1 | 55 | 2 (0)| 00:00:01 |
 |* 2 | INDEX RANGE SCAN | I_T_N | 1 | | 1 (0)| 00:00:01 |
 --------------------------------------------------------------------
 
 Predicate Information (identified by operation id):
 ---------------------------------------------------
 
  2 - access("N"=1)

That is, Oracle is able to choose an index range scan when N=1 (returns only one row) and do an FTS when N=0 (returns 9900 rows) -- perfect execution plans given the conditions. What does this have to do with bind peeking you ask? Well, imagine that you've submitted the following query:

select * from t where n=:n;

We used a bind variable in place of a literal. On a hard parse, Oracle will peek at the value you've used for :n, and will optimize the query as if you've submitted the same query with this literal instead. The problem is that, in 10G, bind variable peeking happens only on a hard parse, which means that all following executions will use the same plan, regardless of the bind variable value. This is easy enough to demonstrate:

SQL> variable n number;
 SQL> exec :n:=0;
 
 PL/SQL procedure successfully completed.
 
 SQL> set autot traceonly stat
 
 SQL> select * from t where n=:n;
 
 9900 rows selected.
 
 Statistics
 ----------------------------------------------------------
  982 recursive calls
  0 db block gets
  951 consistent gets
  0 physical reads
  0 redo size
  106664 bytes sent via SQL*Net to client
  7599 bytes received via SQL*Net from client
  661 SQL*Net roundtrips to/from client
  12 sorts (memory)
  0 sorts (disk)
  9900 rows processed
 
 SQL> exec :n:=1;
 
 PL/SQL procedure successfully completed.
 
 SQL> select * from t where n=:n;
 
 Statistics
 ----------------------------------------------------------
  0 recursive calls
  0 db block gets
  155 consistent gets
  0 physical reads
  0 redo size
  476 bytes sent via SQL*Net to client
  350 bytes received via SQL*Net from client
  2 SQL*Net roundtrips to/from client
  0 sorts (memory)
  0 sorts (disk)
  1 rows processed

You can see that both executions used the same cursor with a full table scan:

SQL> select sql_id, executions, child_number
  2 from v$sql
  3 where sql_text = 'select * from t where n=:n';
 
 SQL_ID EXECUTIONS CHILD_NUMBER
 ------------- ---------- ------------
 g2n32un6t1c55 2 0
 
 SQL> select * from table(dbms_xplan.display_cursor('g2n32un6t1c55'));
 
 PLAN_TABLE_OUTPUT
 --------------------------------------------------------------------
 SQL_ID g2n32un6t1c55, child number 0
 -------------------------------------
 select * from t where n=:n
 
 Plan hash value: 1601196873
 
 --------------------------------------------------------------------
 | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
 --------------------------------------------------------------------
 | 0 | SELECT STATEMENT | | | | 42 (100)| |
 |* 1 | TABLE ACCESS FULL| T | 9900 | 531K| 42 (0)| 00:00:01 |
 --------------------------------------------------------------------
 
 Predicate Information (identified by operation id):
 ---------------------------------------------------
 
  1 - filter("N"=:N)
 
 18 rows selected.

Both queries would use an index range scan if I had started my example with exec :n:=1 instead of exec :n:=0. Now, if 99% of your queries are not interested in querying the above table with :N:=0, then you want the plan with an index range scan because it will provide optimal performance most of the time, while resulting in suboptimal performance for only 1% of executions. With the histogram in place, one day you will be unlucky enough to have Oracle hard parse the query with a bind variable value of 0, which will force everyone else to use an FTS (as was demonstrated above), which in turn will result an abysmal performances for 99% of executions (until the next hard parse when you might get lucky again). And if you have a system where a third of the tables have histograms on them then -- I think you probably get the idea now.

What to do?

Well, just get rid of any histogram that does nothing but messes up your execution plans. That's easy enough:

SQL> exec dbms_stats.gather_table_stats(user, 't', method_opt => 'for columns n size 1', cascade => true);
 
 PL/SQL procedure successfully completed

Oracle will no longer have distribution information, and you'll get the same plan regardless of the value:

SQL> select * from t where n=0;
 
 Execution Plan
 ----------------------------------------------------------
 Plan hash value: 2912310446
 
 --------------------------------------------------------------------
 | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
 --------------------------------------------------------------------
 | 0 | SELECT STATEMENT | | 99 | 5445 | 3 (0)| 00:00:01 |
 | 1 | TABLE ACCESS BY INDEX ROWID| T | 99 | 5445 | 3 (0)| 00:00:01 |
 |* 2 | INDEX RANGE SCAN | I_T_N | 99 | | 1 (0)| 00:00:01 |
 --------------------------------------------------------------------
 
 Predicate Information (identified by operation id):
 ---------------------------------------------------
 
  2 - access("N"=0)
 
 SQL> select * from t where n=1;
 
 Execution Plan
 ----------------------------------------------------------
 Plan hash value: 2912310446
 
 --------------------------------------------------------------------
 | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
 --------------------------------------------------------------------
 | 0 | SELECT STATEMENT | | 99 | 5445 | 3 (0)| 00:00:01 |
 | 1 | TABLE ACCESS BY INDEX ROWID| T | 99 | 5445 | 3 (0)| 00:00:01 |
 |* 2 | INDEX RANGE SCAN | I_T_N | 99 | | 1 (0)| 00:00:01 |
 --------------------------------------------------------------------
 
 Predicate Information (identified by operation id):
 ---------------------------------------------------
 
  2 - access("N"=1)

No more surprises! This also means that you'll get the same plan regardless of the bind variable value during a hard parse. However, the real question is . . .

What is making all those histograms I didn't ask for?

It is a default behavior. With the declaration of RBO's obsolescence in 10G, we were also presented with a default gather stats job in every 10G database. This jobs runs with a whole bunch of AUTO parameters, but one parameter is of particular interest to us:

SQL> select dbms_stats.get_param('method_opt') method_opt from dual;
 
 METHOD_OPT
 --------------------------------------------------------------------
 FOR ALL COLUMNS SIZE AUTO

The SIZE ... part controls histograms collection. You can get the definition of AUTO in the Oracle documentation:

AUTO: Oracle determines the columns to collect histograms based on data distribution and the workload of the columns.

Well, as it says -- Oracle will decide on which columns to collect histograms. Not you. As a result, every day when this jobs runs, you might be presented with something new to make your life more interesting.

Why `SIZE AUTO` is doing such a bad job?

Because coming up with optimal parameters for statistic gathering involves many more variables than the DBMS_STATS package can ever have. As a result, AUTO tends to be a "lowest common denominator". For example, if you have an OLTP system, chances are you don't need histograms at all (apart from on a couple of tables, maybe). But how can DBMS_STATS know which type of a system do you run? OLTP, DWH, or mixed? Or maybe you're doing OLTP five days per week and DWH on a weekend? DBMS_STATS tries to use some heuristics to come up with an answer, but that's why two days of DWH on a weekend can completely screw up your OLTP activity for the rest of the week. DBMS_STATS just doesn't have enough information. We humans do.

What could we use instead of AUTO?

Because to the above, there is no single answer since it involves "know your data" and "apply your domain knowledge". However, there is one option that works particularly well for most environments. I'm talking about REPEAT.

REPEAT: Collects histograms only on the columns that already have histograms.

That is, Oracle will no longer make histograms you didn't ask for. This will be your first step in stabilizing your bind peeking behaviour:

SQL> exec dbms_stats.set_param('method_opt', 'FOR ALL COLUMNS SIZE REPEAT');
 
 PL/SQL procedure successfully completed

What to do with existing histograms?

Dealing with them depends on the situation. Chances are, however, you have many more histograms than you'll ever need. That means that starting from scratch to figure out when do you need histograms is usually a much simpler task compared to the clean-up of existing onces. If this is your case, then it might be a good idea to wipe out all histograms in a database (gather your stats with FOR ALL COLUMNS SIZE 1 clause), and manually add them when you decide that you really need one. The number of times I have had to go back to add histogram after a complete wipe out is surprisingly low -- much lower than number of surprises histograms were causing on these systems, and those systems never had any bind peeking surprises due to the excessive amount of unnecessary histograms.

Are there any other sources of unnecessary histograms in my system?

Absolutely. People are the next source of unnecessarily histograms. It constantly surprises me how many people treat histograms as a kind of a silver bullet. For example, someone recently communicated to me that select count(*) from table was running slow, and added that, "maybe we should collect histograms on that table." How on earth will a histogram help you to run this query faster? Many histograms on your system might be a result of a complete misunderstanding how histograms work, what they do, and, more important, what they do not.

I have a misbehaving query due to incorrect peeking caused by histogram, and I need to fix it right now. What do I do ?

Don't hurry to flush your shared pool. First, as a result of a complete brain cleaning, your instance will have to hard parse everything it had in the library cache, causing tons of redundant work. Second, and this is much more important, these hard parses might well result in an incorrect peeking for some other queries. So you might end up in a worse situation than you were in before. The right way is to get rid of a histogram by collecting stats on required table, with METHOD_OPT => 'FOR COLUMNS X SIZE 1' and NO_INVALIDATE => FALSE. This will cause all dependent cursors to be invalidated immediately after stats have been gathered. Sometimes, however, you don't have enough time to understand what caused a problem (or you simply don't have time to regather the stats) and, if probability theory is on your side (chances for a good peeking are much higher), all you have to do to invalidate dependent cursors is to create comment on a table:

SQL> select n from t where n=:n;
 
  N
 ----------
  1
 
 SQL> select executions, invalidations
  2 from v$sql
  3 where sql_text = 'select n from t where n=:n';
 
 EXECUTIONS INVALIDATIONS
 ---------- -------------
  2 0
 
 SQL> comment on table t is '';
 
 Comment created.
 
 SQL> select n from t where n=:n;
 
  N
 ----------
  1
 
 SQL> select executions, invalidations
  2 from v$sql
  3 where sql_text = 'select n from t where n=:n';
 
 EXECUTIONS INVALIDATIONS
 ---------- -------------
  1 1

The above will invalidate cursors that depend only on a specific table, thus significantly decreasing the risk of side effects.

Isn't that much more work compared to simply turning off bind variable peeking?

Actually it is not. If you are starting a new system, all you have to do is to modify the default parameter from AUTO to REPEAT and you are done. You'll have to create all required histograms manually but that's our intentional goal: to do histogram creation in a controllable and predictable manner. For existing systems that are plagued by gazillions of histograms, you'll have to figure out what to do. Wiping out histograms for entire database will do for queries with bind variables pretty much the same as turning off bind peeking. I still think, however, that disabling bind peeking in this situation is the wrong choice, since (a) you still run the risk of getting unpredictable results from queries with literals; and (b) you will be doing tons more work during statistics collection, since histogram computation is expensive.

Is there anything else wrong with disabling bind peeking?

Running Oracle Database with an underscore parameter makes you different from the rest of the world, and this is not how Oracle Database was tested and intended to be run in a first place (think bugs). While disabling bind peeking seems to be relatively safe, it also very easy to avoid doing so.

Get histograms under control

Making sure new histograms appear in a controllable and predictable manner will be your first step in building a predictable environment.

Oracle Database Consulting Services

Ready to optimize your Oracle Database for the future?

View full post