We all have encountered this situation many times before: the cost-based optimizer assumes no correlation between two columns (until 11g), and this has the effect of erroneously reducing the cardinality of a row source. Incorrect cardinality estimates are one of many root causes for SQL performance issues. Consider the following example.

This code creates a table and populates data with four columns:

create table t_vc as select mod(n, 100) n1, mod(n, 100) n2 , mod(n, 50) n3 , mod(n, 20) n4 from (select level n from dual connect by level <= 10001);

The first two columns, n1 and n2, have a strong correlation: n1 is always equals to n2.

Let’s collect statistics with histograms on all columns.

begin dbms_stats.gather_Table_stats( user, 'T_VC', estimate_percent => null, method_opt => 'for all columns size 254'); end; /

Let’s explain the plan for the query, specifying one predicate, exactly 100 rows with a value of 10.

explain plan for select count(*) from t_vc where n1=10;`----------------------------------------------------`

| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |`----------------------------------------------------`

| 0 | SELECT STATEMENT | | 1 | 3 | 9 (0)| 00:00:01 | | 1 | SORT AGGREGATE | | 1 | 3 | | | |* 2 | TABLE ACCESS FULL| T_VC | 100 | 300 | 9 (0)| 00:00:01 |`----------------------------------------------------`

Yes, the optimizer is able to estimate the cardinality as 100 from row source `T_VC`

at step 2.

Now, let’s add another predicate, n2=10. Since there is a strong correlation between n1 and n2, adding a predicate such as n2=10 should not alter row source cardinality, right?

explain plan for select count(*) from t_vc where n1=10 and n2=10; select * from table(dbms_xplan.display);`----------------------------------------------------`

| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |`----------------------------------------------------`

| 0 | SELECT STATEMENT | | 1 | 6 | 9 (0)| 00:00:01 | | 1 | SORT AGGREGATE | | 1 | 6 | | | |* 2 | TABLE ACCESS FULL| T_VC | 1 | 6 | 9 (0)| 00:00:01 |`----------------------------------------------------`

Nope, the optimizer cardinality is far off and 100 times lower than the correct cardinality.

Notice the rows column at step 2 above and the CBO estimate is just 1. Essentially, the CBO estimate boils down to

a simplified formula:

# of rows~= total # of rows * (1/NDV for n1) * (1/NDV for n2) = 10000 * (1/100) * (1/100) =1 row.

where selectivity is (1/NDV) in simple case, without considering histograms. It is far from the truth. There are at least 100 rows with n1=10 and n2=10. In complex SQL, this incorrect cardinality estimation will lead to an inefficient access plan.

### Extended Stats

Oracle 11g introduces extended stats to relieve some pain. In 11g, extended stats can be added between columns, enabling the CBO to consider correlation between these column values.

```
SELECT dbms_stats.create_extended_stats(
ownname=>user, tabname => 'T_VC',
extension => '(n1, n2)' ) AS n1_n2_correlation
FROM dual;
N1_n2_correlation
````---------------------------------------------------`

SYS_STUBZH0IHA7K$KEBJVXO5LOHAS

Letâ€™s collect stats again on this table and check the SQL plan.

exec dbms_stats.gather_Table_stats( user, 'T_VC', estimate_percent => null, method_opt => 'for all columns size 254'); explain plan for select count(*) from t_vc where n1=10 and n2=10;`----------------------------------------------------`

| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |`----------------------------------------------------`

| 0 | SELECT STATEMENT | | 100 | 1200 | 9 (0)| 00:00:01 | |* 1 | TABLE ACCESS FULL| T_VC | 100 | 1200 | 9 (0)| 00:00:01`----------------------------------------------------`

At last, in 11g, the optimizer has truly understood the correlation.

### Under the Hood

1. Adding an extended stats adds a new virtual column to the table. Here is the line from sqltrace.

alter table "CBQT"."T_VC" add (SYS_STUBZH0IHA7K$KEBJVXO5LOHAS as (sys_op_combined_hash(n1, n2)) virtual BY USER for statistics);

The virtual column name is cryptic — it seems to have been derived from table_name, column_name combinations. This is why we reanalyzed the table.

2. A new deterministic hash function, `sys_op_combined_hash`

, is called by optimizer to populate this virtual column values. This deterministic function returns same value for unique combination of arguments passed.

col h1 format 99999999999999999999999999 select sys_op_combined_hash (1,1) h1 from dual; H1`----------------------------`

7026129190895635777 select sys_op_combined_hash (1,2) h1 from dual; H1`----------------------------`

298332787864732998

Collecting histograms on all columns collects histograms on this virtual column also.

### Trace Lines

Using the above histogram, the CBO is able to find that there is a strong correlation between these two columns. This is visible in the 10053 output.

SINGLE TABLE ACCESS PATH Single Table Cardinality Estimation for T_VC[T_VC] Column (#1): NewDensity:0.005000, OldDensity:0.000050 BktCnt:10001, PopBktCnt:10001, PopValCnt:100, NDV:100 Column (#2): NewDensity:0.005000, OldDensity:0.000050 BktCnt:10001, PopBktCnt:10001, PopValCnt:100, NDV:100 Column (#5): NewDensity:0.005000, OldDensity:0.000050 BktCnt:10001, PopBktCnt:10001, PopValCnt:100, NDV:100 ColGroup (#1, VC) SYS_STUBZH0IHA7K$KEBJVXO5LOHAS Col#: 1 2CorStregth: 100.00ColGroup Usage:: PredCnt: 2 Matches Full: #0 Partial: Sel: 0.0100 Table: T_VC Alias: T_VC Card: Original: 10001.000000 Rounded: 100 Computed: 100.00 Non Adjusted: 100.00 Access Path: TableScan Cost: 9.11 Resp: 9.11 Degree: 0 Cost_io: 9.00 Cost_cpu: 2404620 Resp_io: 9.00 Resp_cpu: 2404620 Best:: AccessPath: TableScan Cost: 9.11 Degree: 1 Resp: 9.11 Card: 100.00 Bytes: 0

Notice the `ColGroup`

line and `CorStrength`

field above. It is set to 100. `CorStrength`

is calculated using histograms for the virtual column and final cardinality estimates are multiplied by `CorStrength`

.

# of rows~= total # of rows * (1/NDV for n1) * (1/NDV for n2)*corStrength = 10000 * (1/100) * (1/100)*100 =100 rows.

The cardinality estimates match reality, at last. In the next installment, I will discuss this further.

Okay, it’s time to introduce myself. I specialize in Oracle performance tuning, Oracle internals, and the E-Business Suite. I have over 15 years of experience as an Oracle DBA. I am regular presenter in major conferences such as HOTSOS, UKOUG, RMOUG, etc., and I am also an Oak Table member. Some of my papers can be found on my personal blog: my papers and presentations.

### 8 Responses to “Multi-Column Correlation and Extended Stats in Oracle 11g”

### Leave a Reply

[...] A new blogger made his first appearance here this week: Riyaj Shamsudeen, with his item on multi-column correlation and extended stats in Oracle 11g. [...]

Nice post!

I would like to add that in 10G you still can battle correlation problems using higher levels of dynamic sampling (to certain degree).

In 10.2.0.3 the correct cardinality was reported on level 4:

SQL> select /*+ dynamic_sampling(4) */ count(*) from t_vc where n1=10 and n2=10;

Execution Plan

———————————————————-

Plan hash value: 2872596891

—————————————————————————

| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |

—————————————————————————

| 0 | SELECT STATEMENT | | 1 | 6 | 8 (0)| 00:00:01 |

| 1 | SORT AGGREGATE | | 1 | 6 | | |

|* 2 | TABLE ACCESS FULL| T_VC | 100 | 600 | 8 (0)| 00:00:01 |

—————————————————————————

Predicate Information (identified by operation id):

—————————————————

2 – filter(“N1″=10 AND “N2″=10)

Note

—–

– dynamic sampling used for this statement

[...] database one of the problems I faced with 10g—one on those problems Riyaj described in his Multi-Column Correlation and Extended Stats in Oracle 11g post. And the fun part is that I wasn’t able to reproduce it. Yet just setting [...]

[...] tuning — orainternals @ 12:05 am I blogged about extended stats in my earlier blog, extended stats, and also documented that as an investigation in Investigations: extended stats and multi-column [...]

thanks , useful info presented succinctly

Nice Work

I’m not sure it takes CorStregth into account while calculating cardinality.

Or at least formula should be more complicated.

Let’s create a table and check a couple of queries.

create table t(id number, n number);

insert into t select mod(rownum,10), mod(rownum,5) from dual connect by level 'FOR ALL COLUMNS SIZE 254');

select count(*) from t where id = 1 and n = 1;

select count(*) from t where id = 1 and n = 2;

According to 10053 output, “CorStregth” remains the same in both cases and formula is something like “total # of rows * Sel”.

Col#: 1 2 CorStregth: 5.00

ColGroup Usage:: PredCnt: 2 Matches Full: #1 Partial: Sel: 0.0500

Table: T Alias: T

Card: Original: 1000.000000 Rounded: 50 Computed: 50.00 Non Adjusted: 50.00

`Col#: 1 2 CorStregth: 5.00`

ColGroup Usage:: PredCnt: 2 Matches Full: #1 Partial: Sel: 0.1000

Table: T Alias: T

Card: Original: 1000.000000 Rounded: 100 Computed: 100.00 Non Adjusted: 100.00

It’s not clear how Sel is derived though.

For some reason code was corrupted.

create table t(id number, n number);

insert into t select mod(rownum,10), mod(rownum,5) from dual connect by level 'FOR ALL COLUMNS SIZE 254');