The Side Effects of Drop ‘Unused’ Index

Posted in: Oracle, Technical Track

In this blog post I’ll demonstrate a scenario where dropping a potentially ‘unused’ index can have a negative influence on the optimizer’s cardinality estimation. Having columns with logical correlation and skewed data can cause difficulties for the optimizer when calculating cardinality. This post will also address what we can do to help the optimizer with better cardinality estimates.

The inspiration for this post was derived from a recent index usage review. One of the requirements was to determine how to confirm which unused indexes qualify to be dropped. I decided to do some tests regarding extended statistics and the effect of potentially dropping an ‘unused’ index. You will observe what kind of result may be seen from the drop of an index which has not been used. It’s important to remember that it does not apply in all cases.  Occasionally, even if the index is used, it doesn’t mean that it’s needed.

This is more or less linked to columns with skewed data and which might have logical relationship.
Hopefully, it can help you answer some of the following questions:

  1. Is the optimizer using the indexes behind scenes?
  2. While there are methods to determine if an index has been used in an execution plan, can an index be dropped on this basis only?
  3. If we drop composite index (constructed from correlated columns), can we do anything to avoid performance degradation?

Before we start with the use case, let’s briefly review some concepts.

The basic formula for selectivity is 1/NDV. The cardinality (CDN) is calculated as selectivity * total number of rows.

The selectivity of a join is defined as the selectivity of the most selective join column adjusted by the proportion of not null values in each join column.

Join Selectivity:
Sel = 1/max[NDV(t1.c1),NDV(t2.c2)] *
           ( (Card t1 - # t1.c1 NULLs) / Card t1) *
           ( (Card t2 - # t2.c2 NULLs) / Card t2)

Join Cardinality:
Card(Pj) = Card(T1) * Card(T2) * Sel(Pj)

In Oracle’s Doc ID 68992.1 you can find a more detailed explanation about different selectivity calculations based on different predicates. For simplicity, I will use equality predicate.

This blog post is divided in three sections.

  1. Use case where we demonstrate how drop of an “unused” index can mess up optimizer cardinality calculation.
  2. How to help optimizer for better cardinality estimation with extended statistics.
  3. More explanation on column correlation (CorStregth).

Preparation

For the first and second part I’ll be using the following configuration.

DROP TABLE t1 purge;
DROP TABLE t2 purge;
DROP TABLE t3 purge;
 
CREATE TABLE t1
AS
SELECT rownum pk,
    MOD(level,  20) n1,
    MOD(level, 200) n2,
    MOD(level,1000) n3,
    lpad('x',42,'x') c1
FROM dual
CONNECT BY level <= 10000
/
 
CREATE TABLE t2
AS
SELECT rownum pk,
    MOD(level,  20) n1,
    MOD(level, 200) n2,
    MOD(level,1000) n3,
    lpad('x',42,'x') c1
FROM dual
CONNECT BY level <= 10 
/ 
 
CREATE TABLE t3
AS
SELECT rownum pk, lpad('x',rownum,'x') c1
FROM dual
CONNECT BY level <= 2000 
/ 
 
EXEC dbms_stats.gather_table_stats(USER,'T1',method_opt=>'for all columns size 1');
EXEC dbms_stats.gather_table_stats(USER,'T2',method_opt=>'for all columns size 1');
EXEC dbms_stats.gather_table_stats(USER,'T3',method_opt=>'for all columns size 1');
 
 
ALTER TABLE t1 ADD CONSTRAINT t1pk PRIMARY KEY(pk);
ALTER TABLE t2 ADD CONSTRAINT t2pk PRIMARY KEY(pk);
ALTER TABLE t3 ADD CONSTRAINT t3pk PRIMARY KEY(pk);
 
CREATE INDEX t1ix ON t1(n1,n2);
CREATE INDEX t2ix ON t2(n1,n2);

It creates two tables (T1 and T2) with composite index on N1 and N2 columns for both tables.
The statistics are as follows:

--Column statistics
TABLE_NAME	     COLUMN_NAME	  NUM_DISTINCT	  COL_SEL
-------------------- -------------------- ------------ ----------
T1		     C1 			     1		1
T1		     N1 			    20	      .05
T1		     N2 			   200	     .005
T1		     N3 			  1000	     .001
T2		     C1 			     1		1
T2		     N1 			    10	       .1
T2		     N2 			    10	       .1
T2		     N3 			    10	       .1
 
--Index statistics
TABLE_NAME	     INDEX_NAME 	    NUM_ROWS DISTINCT_KEYS    IND_SEL
-------------------- -------------------- ---------- ------------- ----------
T1		     T1IX		       10000	       200	 .005
T2		     T2IX			  10		10	   .1

Table T1 has 10000 rows and T2 only 10 rows. I’ve created two composite indexes for T1 and T2 since I am using both (N1,N2) columns in selection. As you can see (from the data dictionary) we have 200 distinct values for T1IX index and 10 for T2IX index.
e.g. Based on the formula we can easily derive selectivity as 1/NDV, for T1IX index we have 1/200 = 0.005.

Drop of an “unused” index can mess up optimizer’s cardinality estimation

The query I’m going to use in this case is:

SELECT t1.c1,t2.c1
FROM t1,t2
WHERE t1.n1 = t2.n1
AND t1.n2 = t2.n2
/

Let’s first check the amount of the cardinality (CDN) before we do any changes. The column we’re interested in is ROWS, that’s where cardinality estimation is shown.

Execution Plan
----------------------------------------------------------
Plan hash VALUE: 2959412835
 
---------------------------------------------------------------------------
| Id  | Operation	   | Name | ROWS  | Bytes | Cost (%CPU)| TIME	  |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |	  |   500 | 49500 |    29   (0)| 00:00:01 |
|*  1 |  HASH JOIN	   |	  |   500 | 49500 |    29   (0)| 00:00:01 |
|   2 |   TABLE ACCESS FULL| T2   |    10 |   490 |	3   (0)| 00:00:01 |
|   3 |   TABLE ACCESS FULL| T1   | 10000 |   488K|    26   (0)| 00:00:01 |
---------------------------------------------------------------------------
 
Predicate Information (IDENTIFIED BY operation id):
---------------------------------------------------
 
   1 - access("T1"."N1"="T2"."N1" AND "T1"."N2"="T2"."N2")

Based on the formula provided, you may have expected to see cardinality (CDN) of 25. Since, we don’t have nulls we can do the calculation as 1/max(t1.n1,t2.n1) * 1/max(t1.n2,t2.n2) = 1/20 * 1/200 = 0.05 * 0.005 = 0.00025. We came up with selectivity of 0.00025.
Now, if we now do the join cardinality calculation 10 (T1) * 10000 (T2) * selectivity 0.00025 = 1000000 * 0.00025 = 25

But instead of 25, the optimizer calculated cardinality (CDN) of 500. In order to better understand the calculation, let’s review the 10053 event trace file.

  ColGroup (#1, Index) T1IX
    Col#: 1 2    CorStregth: 20.00
  ColGroup (#1, Index) T2IX
    Col#: 1 2    CorStregth: 10.00
Join ColGroups for T1[T1] and T2[T2] : (#1, #1)

What’s interesting here is the Join ColGroups analysis. Since, we have full match on the join with the indexed columns, the optimizer is considering Join ColGroups between T1 and T2. It will use ColGroups statistics to calculate join cardinality (CDN).

Col#: column position (self explanatory)
CorStregth: correlation strength. Will come back later on correlation strength (CorStregth).

Within the trace file in Nested Loop (NL) section (remember join cardinality is only listed in NL section) you should see something like:

Join selectivity using 1 ColGroups: 0.005000 (sel1 = 0.000000, sel2 = 0.000000)
Join Card:  500.000000 = outer (10.000000) * inner (10000.000000) * sel (0.005000)
Join Card - Rounded: 500 Computed: 500.000000

Based on the join selectivity formula, the optimizer chooses T1IX index (ColGroup) statistics for selectivity calculation, hence 1/NDV (T1IX index) = 1/200 = 0.005.
Afterwards, join cardinality is calculated as num_rows (T1) * num_rows (T2) * selectivity (0.005) or 10 * 10000 * 0.005 = 100000 * 0.005 = 500.

If you’re not aware of such cases, you may decide to drop an index based solely on the execution plan information. Even worse, the developers can easily decide to drop an index without knowing the possible effect.

Now, let’s observe the cardinality estimation after T1IX index is dropped.

SQL> DROP INDEX t1ix;
 
INDEX dropped.
 
...
 
Execution Plan
----------------------------------------------------------
Plan hash VALUE: 2959412835
 
---------------------------------------------------------------------------
| Id  | Operation	   | Name | ROWS  | Bytes | Cost (%CPU)| TIME	  |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |	  |    25 |  2475 |    29   (0)| 00:00:01 |
|*  1 |  HASH JOIN	   |	  |    25 |  2475 |    29   (0)| 00:00:01 |
|   2 |   TABLE ACCESS FULL| T2   |    10 |   490 |	3   (0)| 00:00:01 |
|   3 |   TABLE ACCESS FULL| T1   | 10000 |   488K|    26   (0)| 00:00:01 |
---------------------------------------------------------------------------
 
Predicate Information (IDENTIFIED BY operation id):
---------------------------------------------------
 
   1 - access("T1"."N1"="T2"."N1" AND "T1"."N2"="T2"."N2")

After we dropped T1IX the cardinality dropped from 500 to 25! This can have big impact on the performance. If we add additional tables to the join the optimizer can possibly re-order tables, switch Hash Join to Nested Loops or Merge Joins.

So, why did the cardinality (CDN) change after “unused” index was dropped?

Even if the index was not used, it doesn’t mean that its statistics were also not used. That’s the case here. The optimizer used the NDV of the column group created by the T1IX index to calculate the join selectivity.

Lets again check the 10053 event trace file, now where the query has been executed without T1IX index.

ColGroup (#1, Index) T2IX
    Col#: 1 2    CorStregth: 10.00
Join ColGroups for T1[T1] and T2[T2] : Using cdn sanity check
 
...
 
ColGroup cardinality sanity check: ndv for  T1[T1] = 4000.000000  T2[T2] = 10.000000
Join selectivity using 1 ColGroups: 2.5000e-04 (sel1 = 0.000000, sel2 = 0.000000)
Join Card:  25.000000 = outer (10.000000) * inner (10000.000000) * sel (2.5000e-04)
Join Card - Rounded: 25 Computed: 25.000000

We don’t have full match for Join ColGroups (we’re missing the statistics from T1IX index). The optimizer decided to perform cardinality (CDN) sanity check where (after the sanity check) it ended up with a number of distinct values (NDV) of 4000.

How did the optimizer calculate 4000 ?

Since we don’t have column group, the optimizer did a basic calculation of N1 (sel) * N2 (sel) = 0.00025. Having the (N1,N2) selectivity, we can extract the NDV as 1/0.00025 = 4000.
The representation in the trace file is slightly different as 2.5000e-04 = 2.5000 * (1/10000) = 0.00025

Having incorrect cardinality estimate the optimizer can make wrong decision for join methods, as we can see in the following example when we join T3 table.

--query
SELECT t1.c1,t2.c1,t3.c1
FROM t1,t2,t3
WHERE t1.n1 = t2.n1
AND t1.n2 = t2.n2
AND t1.pk = t3.pk
/
 
Execution Plan
----------------------------------------------------------
Plan hash VALUE: 1184213596
 
----------------------------------------------------------------------------
| Id  | Operation	    | Name | ROWS  | Bytes | Cost (%CPU)| TIME	   |
----------------------------------------------------------------------------
|   0 | SELECT STATEMENT    |	   |   500 |   541K|   120   (0)| 00:00:01 |
|*  1 |  HASH JOIN	    |	   |   500 |   541K|   120   (0)| 00:00:01 |
|*  2 |   HASH JOIN	    |	   |   500 | 51500 |	30   (0)| 00:00:01 |
|   3 |    TABLE ACCESS FULL| T2   |	10 |   490 |	 3   (0)| 00:00:01 |
|   4 |    TABLE ACCESS FULL| T1   | 10000 |   527K|	27   (0)| 00:00:01 |
|   5 |   TABLE ACCESS FULL | T3   |  2000 |  1964K|	90   (0)| 00:00:01 |
----------------------------------------------------------------------------
 
Predicate Information (IDENTIFIED BY operation id):
---------------------------------------------------
 
   1 - access("T1"."PK"="T3"."PK")
   2 - access("T1"."N1"="T2"."N1" AND "T1"."N2"="T2"."N2")
 
SQL> DROP INDEX t1ix; 
 
INDEX dropped.
 
Execution Plan
----------------------------------------------------------
 
Plan hash VALUE: 142045272
 
-------------------------------------------------------------------------------------
| Id  | Operation		     | Name | ROWS  | Bytes | Cost (%CPU)| TIME     |
-------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT	     |	    |	 25 | 27725 |	 35   (0)| 00:00:01 |
|   1 |  NESTED LOOPS		     |	    |	 25 | 27725 |	 35   (0)| 00:00:01 |
|   2 |   NESTED LOOPS		     |	    |	 25 | 27725 |	 35   (0)| 00:00:01 |
|*  3 |    HASH JOIN		     |	    |	 25 |  2575 |	 30   (0)| 00:00:01 |
|   4 |     TABLE ACCESS FULL	     | T2   |	 10 |	490 |	  3   (0)| 00:00:01 |
|   5 |     TABLE ACCESS FULL	     | T1   | 10000 |	527K|	 27   (0)| 00:00:01 |
|*  6 |    INDEX UNIQUE SCAN	     | T3PK |	  1 |	    |	  0   (0)| 00:00:01 |
|   7 |   TABLE ACCESS BY INDEX ROWID| T3   |	  1 |  1006 |	  1   (0)| 00:00:01 |
-------------------------------------------------------------------------------------
 
Predicate Information (IDENTIFIED BY operation id):
---------------------------------------------------
 
   3 - access("T1"."N1"="T2"."N1" AND "T1"."N2"="T2"."N2")
   6 - access("T1"."PK"="T3"."PK")

Another example with different data set where there is a change in the row source used for building a hash table.

--query
SELECT t1.c1, t2.c1
FROM t1,t2,t3
WHERE t1.n1 = t2.n1
AND t1.n2 = t2.n2
AND t1.n1 = t3.pk
/
 
Execution Plan
----------------------------------------------------------
Plan hash VALUE: 2855555999
 
-------------------------------------------------------------------------------
| Id  | Operation	       | Name | ROWS  | Bytes | Cost (%CPU)| TIME     |
-------------------------------------------------------------------------------
|   0 | SELECT STATEMENT       |      |  5000K|  1525M|  2766	(1)| 00:00:01 |
|*  1 |  HASH JOIN	       |      |  5000K|  1525M|  2766	(1)| 00:00:01 |
|   2 |   TABLE ACCESS FULL    | T2   |  1000 |   251K|    14	(0)| 00:00:01 |
|*  3 |   HASH JOIN	       |      |  1000K|    59M|  2739	(1)| 00:00:01 |
|   4 |    INDEX FAST FULL SCAN| T3PK |  2000 |  8000 |     3	(0)| 00:00:01 |
|   5 |    TABLE ACCESS FULL   | T1   |  1000K|    55M|  2734	(1)| 00:00:01 |
-------------------------------------------------------------------------------
 
Predicate Information (IDENTIFIED BY operation id):
---------------------------------------------------
 
   1 - access("T1"."N1"="T2"."N1" AND "T1"."N2"="T2"."N2")
   3 - access("T1"."N1"="T3"."PK")
 
SQL> DROP INDEX t1ix; 
 
INDEX dropped.
 
...
 
Execution Plan
----------------------------------------------------------
Plan hash VALUE: 1993272941
 
------------------------------------------------------------------------------
| Id  | Operation	      | Name | ROWS  | Bytes | Cost (%CPU)| TIME     |
------------------------------------------------------------------------------
|   0 | SELECT STATEMENT      |      |	 250K|	  76M|	2754   (1)| 00:00:01 |
|*  1 |  HASH JOIN	      |      |	 250K|	  76M|	2754   (1)| 00:00:01 |
|   2 |   INDEX FAST FULL SCAN| T3PK |	2000 |	8000 |	   3   (0)| 00:00:01 |
|*  3 |   HASH JOIN	      |      |	 250K|	  75M|	2750   (1)| 00:00:01 |
|   4 |    TABLE ACCESS FULL  | T2   |	1000 |	 251K|	  14   (0)| 00:00:01 |
|   5 |    TABLE ACCESS FULL  | T1   |	1000K|	  55M|	2734   (1)| 00:00:01 |
------------------------------------------------------------------------------
 
Predicate Information (IDENTIFIED BY operation id):
---------------------------------------------------
 
   1 - access("T1"."N1"="T3"."PK")
   3 - access("T1"."N1"="T2"."N1" AND "T1"."N2"="T2"."N2")

Although this will not apply in all cases, be careful before deciding to drop an index. Although it may seem like a good idea initially, it can have big impact with wrong cardinality estimation. This could then lead to incorrect execution plan decisions, poor performance and unhappy customers.

So, what is the solution?
We don’t want huge indexes to be (DML) maintained if not used, right?

This brings me to the second part, Extended Statistics.

Extended Statistics

Oracle introduced extended statistics with Oracle 11g. Its main functionality is to highlight column group relation. The extended statistics helps the optimizer become more familiar with logical relation between column groups, otherwise unknown information. The extended statistics are represented as virtual (hidden) columns on which statistics are gathered.

When we had T1IX index, the optimizer used its NDV to calculate correct cardinality. The index statistics were used for the column group of N1 and N2 columns. However, after we dropped the index (even though it was never used), we’ve made a mess of the optimizer with its cardinality estimates.

Extended statistics are maintained by DBMS_STATS package, function CREATE_EXTENDED_STATS. Let’s create extended statistics and see the cardinality estimation.

SQL> SELECT dbms_stats.create_extended_stats(NULL,'T1','(N1,N2)') FROM dual
  2  /
 
DBMS_STATS.CREATE_EXTENDED_STATS(NULL,'T1','(N1,N2)')
------------------------------------------------------------
SYS_STUBZH0IHA7K$KEBJVXO5LOHAS
 
SQL> SET autotrace trace EXP
SQL> 
SQL> SELECT t1.c1,t2.c1
  2  FROM t1,t2
  3  WHERE t1.n1 = t2.n1
  4  AND t1.n2 = t2.n2
  5  /
 
Execution Plan
----------------------------------------------------------
Plan hash VALUE: 2959412835
 
---------------------------------------------------------------------------
| Id  | Operation	   | Name | ROWS  | Bytes | Cost (%CPU)| TIME	  |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |	  |    25 |  2475 |    29   (0)| 00:00:01 |
|*  1 |  HASH JOIN	   |	  |    25 |  2475 |    29   (0)| 00:00:01 |
|   2 |   TABLE ACCESS FULL| T2   |    10 |   490 |	3   (0)| 00:00:01 |
|   3 |   TABLE ACCESS FULL| T1   | 10000 |   488K|    26   (0)| 00:00:01 |
---------------------------------------------------------------------------
 
Predicate Information (IDENTIFIED BY operation id):
---------------------------------------------------
 
   1 - access("T1"."N1"="T2"."N1" AND "T1"."N2"="T2"."N2")

I know, you’re thinking “It didn’t help” and you’re right. The optimizer did not use the column group (extension). If you look a little deeper, USER_TAB_COL_STATISTICS you will see that the virtual column is not created, only the extension definition was created.

SQL> SELECT column_name, num_distinct, histogram
FROM   user_tab_col_statistics
WHERE  TABLE_NAME = 'T1'
/
 
COLUMN_NAME				 NUM_DISTINCT HISTOGRAM
---------------------------------------- ------------ ---------------
N1						   20 NONE
N2						  200 NONE
N3						 1000 NONE
C1						    1 NONE
 
SQL> col extension_name FOR a40
SQL> SELECT extension_name, extension
FROM   user_stat_extensions
WHERE  TABLE_NAME = 'T1'
/
 
EXTENSION_NAME				 EXTENSION
---------------------------------------- --------------------
SYS_STUBZH0IHA7K$KEBJVXO5LOHAS		 ("N1","N2")

This behavior is a bit expected, because DBMS_STATS.CREATE_EXTENDED_STATS will only create an extension. It won’t gather statistics for the extension. If you check the 10053 event trace file you will see NO STATISTICS (using defaults).

Column (#5): SYS_STUBZH0IHA7K$KEBJVXO5LOHAS(NUMBER)  NO STATISTICS (using defaults)

In order to fix this, we need to re-gather statistics on the table.

SQL> EXEC dbms_stats.gather_table_stats(NULL,'T1',method_opt=>'for all columns size 1');
 
PL/SQL PROCEDURE successfully completed.
 
SQL> SELECT column_name, num_distinct, histogram
FROM   user_tab_col_statistics
WHERE  TABLE_NAME = 'T1'
/
 
COLUMN_NAME				 NUM_DISTINCT HISTOGRAM
---------------------------------------- ------------ ---------------
N1						   20 NONE
N2						  200 NONE
N3						 1000 NONE
C1						    1 NONE
SYS_STUBZH0IHA7K$KEBJVXO5LOHAS			  200 NONE
 
 
SQL> SELECT t1.c1,t2.c1
FROM t1,t2
WHERE t1.n1 = t2.n1
AND t1.n2 = t2.n2
/
 
Execution Plan
----------------------------------------------------------
Plan hash VALUE: 2959412835
 
---------------------------------------------------------------------------
| Id  | Operation	   | Name | ROWS  | Bytes | Cost (%CPU)| TIME	  |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |	  |   500 | 49500 |    29   (0)| 00:00:01 |
|*  1 |  HASH JOIN	   |	  |   500 | 49500 |    29   (0)| 00:00:01 |
|   2 |   TABLE ACCESS FULL| T2   |    10 |   490 |	3   (0)| 00:00:01 |
|   3 |   TABLE ACCESS FULL| T1   | 10000 |   488K|    26   (0)| 00:00:01 |
---------------------------------------------------------------------------
 
Predicate Information (IDENTIFIED BY operation id):
---------------------------------------------------
 
   1 - access("T1"."N1"="T2"."N1" AND "T1"."N2"="T2"."N2")

Now the optimizer calculated the correct cardinality. If you check the trace file more extensively, you should see the same join cardinality information comparable to when we had the index in place. The difference would only be that this time it’s using the extension statistics.

...
  ColGroup (#1, VC) SYS_STUBZH0IHA7K$KEBJVXO5LOHAS
    Col#: 1 2    CorStregth: 20.00
  ColGroup (#1, Index) T2IX
    Col#: 1 2    CorStregth: 10.00
 
...
 
Join selectivity using 1 ColGroups: 0.005000 (sel1 = 0.000000, sel2 = 0.000000)
Join Card:  500.000000 = outer (10.000000) * inner (10000.000000) * sel (0.005000)
Join Card - Rounded: 500 Computed: 500.000000

With extended statistics we were able to help the optimizer to calculate cardinality (CDN) of 500. It used extension NDV in the same way it used T1IX index statistics.

However, what if T1 table was a big table and you already gathered statistics before creating the extension (column group)? Re-gathering can be time-consuming and have negative impact on the performance. To answer this question, we have to employ the METHOD_OPT approach from DBMS_STATS.GATHER_TABLE_STATS.
Creating extension via METHOD_OPT will automatically gather statistics on the column group.

SQL>
SQL> SELECT column_name, num_distinct, histogram
FROM   user_tab_col_statistics
WHERE  TABLE_NAME = 'T1'
/  
 
COLUMN_NAME	     NUM_DISTINCT HISTOGRAM
-------------------- ------------ ---------------
C1				1 NONE
N3			     1000 NONE
N2			      200 NONE
N1			       20 NONE
 
SQL> EXEC dbms_stats.gather_table_stats(NULL,'T1',method_opt=>'for all columns size 1 for columns size 1 (N1,N2)');
 
PL/SQL PROCEDURE successfully completed.
 
SQL> col column_name FOR a40
SQL> SELECT column_name, num_distinct, histogram
FROM   user_tab_col_statistics
WHERE  TABLE_NAME = 'T1'
/
 
COLUMN_NAME				 NUM_DISTINCT HISTOGRAM
---------------------------------------- ------------ ---------------
SYS_STUBZH0IHA7K$KEBJVXO5LOHAS			  200 NONE
C1						    1 NONE
N3						 1000 NONE
N2						  200 NONE
N1						   20 NONE

In this case an extension was automatically created and in the same time statistics were gathered.

If you find yourself in a situation were you used DBMS_STATS.CREATE_EXTENDED_STATS after gathering statistics, there are multiple ways to gather statistics on already existing extensions.

1) DBMS_STATS.GATHER_TABLE_STATS
(it will include all virtual columns)

2) METHOD_OPT=>’for columns size … (column group)’
(you can list the actual column group as you do for CREATE_EXTENDED_STATS)

3) METHOD_OPT=>’for columns size … extension_name’
(you can specify the extension name obtained from DBA|ALL|USER_STAT_EXTENSIONS dictionary view)

4) METHOD_OPT=>’for all hidden columns’
(hidden columns are by default included in default DBMS_STATS.GATHER_TABLE_STATS configuration)

What’s the difference between METHOD_OPT and DBMS_STATS.CREATE_EXTENDED_STATS?

I believe it behaves in that manner because METHOD_OPT is a part of a procedure that is designed for statistics gathering. Extended statistics (column groups) were introduced with 11g and are additional functionality to METHOD_OPT. The main functionality of it is to gather statistics, hence gathering statistics via METHOD_OPT creates the extension and gathers statistics at the same time.
Perhaps Oracle could have used different naming, something like CREATE_COLUMN_GROUP instead of CREATE_EXTENDED_STATS (which is misleading).

Column Correlation (CorStreght)

Column correlation represents how much the columns within a given group have mutual relation. It’s calculated as product of NDV of all columns in the group divided with the NDV of the column group. This is clearly explained in MultiColumn/Column Group Statistics – Additional Examples (Doc ID 872406.1).

Product of individual NDVs / NDV for the column group.

If the product of all columns NDV is smaller than CG NDV, the optimizer will ignore CG (extended) stats. Based on the tests I performed, the CorStregth is taken into consideration only when the optimizer has to decide between multiple column groups for which Full Match can be constructed. In the other cases, when Partial Match is used the optimizer it picked the column group with smaller NDV.

Here are two different cases; one where Full Match is constructed and another for Partial Match:

Column Group – Full Match

Let’s examine the first case where we can construct Full Match.

COLUMN_NAME		NUM_DISTINCT
---------------------------------------- ------------
N1						   20
N2						  200
N3						 1000
C1						    1
SYS_STUBZH0IHA7K$KEBJVXO5LOHAS			  200 ---> (N1,N2)
SYS_STUXQEJ8WOFTK5R5EYM4DF_60V			 1000 ---> (N2,N3)
 
SELECT *
FROM t1
WHERE t1.n1 = 10
AND t1.n2 = 40
AND t1.n3 = 80
/
 
Execution Plan
----------------------------------------------------------
Plan hash VALUE: 3617692013
 
--------------------------------------------------------------------------
| Id  | Operation	  | Name | ROWS  | Bytes | Cost (%CPU)| TIME	 |
--------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |	 |     5 |   270 |   228   (1)| 00:00:01 |
|*  1 |  TABLE ACCESS FULL| T1	 |     5 |   270 |   228   (1)| 00:00:01 |
--------------------------------------------------------------------------
 
Trace file
...
Access path analysis FOR T1
***************************************
SINGLE TABLE ACCESS PATH
  Single TABLE Cardinality Estimation FOR T1[T1]
  SPD: RETURN code IN qosdDSDirSetup: NOCTX, estType = TABLE
  COLUMN (#1): N1(NUMBER)
    AvgLen: 3 NDV: 20 NULLS: 0 Density: 0.050000 MIN: 0.000000 MAX: 19.000000
  COLUMN (#2): N2(NUMBER)
    AvgLen: 4 NDV: 200 NULLS: 0 Density: 0.005000 MIN: 0.000000 MAX: 199.000000
  COLUMN (#3): N3(NUMBER)
    AvgLen: 4 NDV: 1000 NULLS: 0 Density: 0.001000 MIN: 0.000000 MAX: 999.000000
  COLUMN (#6): SYS_STUXQEJ8WOFTK5R5EYM4DF_60V(NUMBER)
    AvgLen: 12 NDV: 1000 NULLS: 0 Density: 0.001000
  COLUMN (#5): SYS_STUBZH0IHA7K$KEBJVXO5LOHAS(NUMBER)
    AvgLen: 12 NDV: 200 NULLS: 0 Density: 0.005000
  ColGroup (#1, VC) SYS_STUXQEJ8WOFTK5R5EYM4DF_60V
    Col#: 2 3    CorStregth: 200.00
  ColGroup (#2, VC) SYS_STUBZH0IHA7K$KEBJVXO5LOHAS
    Col#: 1 2    CorStregth: 20.00
  ColGroup Usage:: PredCnt: 3  Matches FULL: #1  Partial:  Sel: 0.001000

We have T1 table with two column groups on (N1,N2) and (N2,N3).
In this case the product of the individual columns N1,N2,N3 NDV is much higher than column groups NDV, hence the optimizer will calculate the cardinality (CDN) based on the extensions. Having a selection with (N1,N2,N3) the optimizer can perform Full Match for both of the column groups. In this scenario, the cardinality can be expressed as:

1) Using CG #1 (N2,N3): 1/CG(NDV) * 1/N1(NDV) = 1/1000 * 1/20 = 0.001 * 0.05 = 0.00005 * 100000 = 5

2) Using CG #2 (N1,N2): 1/CG(NDV) * 1/N3(NDV) = 1/200 * 1/1000 = 0.005 * 0.0001 = 0.000005 * 100000 = 0.5 (rounded to 1)

The following instance demonstrates where the optimizer decided to use CG (N2,N3) or SYS_STUXQEJ8WOFTK5R5EYM4DF_60V because it has higher CorStregth of 200 (vs. 20 for CG (N1,N2)).
To show how CorStregth can influence the optimizer, I’ve changed N1 distinct value to 210 to increase CG (N1,N2) CorStregth.
Here, case column group (N1,N2) was used in cardinality estimates.

EXEC dbms_stats.set_column_stats(NULL,'T1','N1',distcnt=>210);
 
Execution Plan
----------------------------------------------------------
 
Plan hash VALUE: 3724264953
 
---------------------------------------------------------------------------
| Id  | Operation	   | Name | ROWS  | Bytes | Cost (%CPU)| TIME	  |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |	  |	1 |    11 |   228   (1)| 00:00:01 |
|   1 |  SORT AGGREGATE    |	  |	1 |    11 |	       |	  |
|*  2 |   TABLE ACCESS FULL| T1   |	1 |    11 |   228   (1)| 00:00:01 |
---------------------------------------------------------------------------
 
 
 
--Trace file
...
SINGLE TABLE ACCESS PATH
  Single TABLE Cardinality Estimation FOR T1[T1]
  SPD: RETURN code IN qosdDSDirSetup: NOCTX, estType = TABLE
  COLUMN (#1): N1(NUMBER)
    AvgLen: 3 NDV: 210 NULLS: 0 Density: 0.004762 MIN: 0.000000 MAX: 19.000000
  COLUMN (#2): N2(NUMBER)
    AvgLen: 4 NDV: 200 NULLS: 0 Density: 0.005000 MIN: 0.000000 MAX: 199.000000
  COLUMN (#3): N3(NUMBER)
    AvgLen: 4 NDV: 1000 NULLS: 0 Density: 0.001000 MIN: 0.000000 MAX: 999.000000
  COLUMN (#6): SYS_STUXQEJ8WOFTK5R5EYM4DF_60V(NUMBER)
    AvgLen: 12 NDV: 1000 NULLS: 0 Density: 0.001000
  COLUMN (#5): SYS_STUBZH0IHA7K$KEBJVXO5LOHAS(NUMBER)
    AvgLen: 12 NDV: 200 NULLS: 0 Density: 0.005000
  ColGroup (#2, VC) SYS_STUBZH0IHA7K$KEBJVXO5LOHAS
    Col#: 1 2    CorStregth: 210.00
  ColGroup (#1, VC) SYS_STUXQEJ8WOFTK5R5EYM4DF_60V
    Col#: 2 3    CorStregth: 200.00
  ColGroup Usage:: PredCnt: 3  Matches FULL: #2  Partial:  Sel: 0.005000

We can see that the optimizer chose CG #2 based on CorStregth.

Column Group – Partial Match

This second case shows what happens when Full Match can’t be constructed and the optimizer chooses the column group with smaller NDV. I’ve built the same T1 table, but now with Column Groups (N1,N2,N3) and (N1,N2,C1).

COLUMN_NAME		 NUM_DISTINCT
---------------------------------------- ------------
N1						   20
N2						  200
N3						 1000
C1						    1
SYS_STU2NGW2O#$ZX2CDZYOMNMFP64			  200
SYS_STUOYQUEIAZ7FI9DV53VLN$$$0			 1000
 
 
--Execution Plan
----------------------------------------------------------
Plan hash VALUE: 3617692013
 
--------------------------------------------------------------------------
| Id  | Operation	  | Name | ROWS  | Bytes | Cost (%CPU)| TIME	 |
--------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |	 |   500 | 27000 |   228   (1)| 00:00:01 |
|*  1 |  TABLE ACCESS FULL| T1	 |   500 | 27000 |   228   (1)| 00:00:01 |
--------------------------------------------------------------------------
 
 
 
--Trace file
...
Access path analysis FOR T1
***************************************
SINGLE TABLE ACCESS PATH
  Single TABLE Cardinality Estimation FOR T1[T1]
  SPD: RETURN code IN qosdDSDirSetup: NOCTX, estType = TABLE
  COLUMN (#1): N1(NUMBER)
    AvgLen: 3 NDV: 20 NULLS: 0 Density: 0.050000 MIN: 0.000000 MAX: 19.000000
  COLUMN (#2): N2(NUMBER)
    AvgLen: 4 NDV: 200 NULLS: 0 Density: 0.005000 MIN: 0.000000 MAX: 199.000000
  COLUMN (#6): SYS_STUOYQUEIAZ7FI9DV53VLN$$$0(NUMBER)
    AvgLen: 12 NDV: 1000 NULLS: 0 Density: 0.001000
  COLUMN (#5): SYS_STU2NGW2O#$ZX2CDZYOMNMFP64(NUMBER)
    AvgLen: 12 NDV: 200 NULLS: 0 Density: 0.005000
  COLUMN (#3): N3(NUMBER)
    AvgLen: 4 NDV: 1000 NULLS: 0 Density: 0.001000 MIN: 0.000000 MAX: 999.000000
  COLUMN (#4): C1(VARCHAR2)
    AvgLen: 43 NDV: 1 NULLS: 0 Density: 1.000000
  ColGroup (#1, VC) SYS_STUOYQUEIAZ7FI9DV53VLN$$$0
    Col#: 1 2 3    CorStregth: 4000.00
  ColGroup (#2, VC) SYS_STU2NGW2O#$ZX2CDZYOMNMFP64
    Col#: 1 2 4    CorStregth: 20.00
  ColGroup Usage:: PredCnt: 2  Matches FULL:  Partial: #2 (1 2 )  Sel: 0.005000
  TABLE: T1  Alias: T1

As CG #1 has way higher CorStregth of 4000 (vs. 20) the optimizer used CG #2. Let’s decrease CG #2 NDV to 100 to see whether it will influence the optimizer to use CG #1.

Execution Plan
----------------------------------------------------------
Plan hash VALUE: 3617692013
 
--------------------------------------------------------------------------
| Id  | Operation	  | Name | ROWS  | Bytes | Cost (%CPU)| TIME	 |
--------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |	 |  1000 | 50000 |   228   (1)| 00:00:01 |
|*  1 |  TABLE ACCESS FULL| T1	 |  1000 | 50000 |   228   (1)| 00:00:01 |
--------------------------------------------------------------------------
 
 
 
--Trace file
...
SINGLE TABLE ACCESS PATH
  Single TABLE Cardinality Estimation FOR T1[T1]
  SPD: RETURN code IN qosdDSDirSetup: NOCTX, estType = TABLE
  COLUMN (#1): N1(NUMBER)
    AvgLen: 3 NDV: 20 NULLS: 0 Density: 0.050000 MIN: 0.000000 MAX: 19.000000
  COLUMN (#2): N2(NUMBER)
    AvgLen: 4 NDV: 200 NULLS: 0 Density: 0.005000 MIN: 0.000000 MAX: 199.000000
  COLUMN (#6): SYS_STUOYQUEIAZ7FI9DV53VLN$$$0(NUMBER)
    AvgLen: 12 NDV: 100 NULLS: 0 Density: 0.010000
  COLUMN (#5): SYS_STU2NGW2O#$ZX2CDZYOMNMFP64(NUMBER)
    AvgLen: 12 NDV: 200 NULLS: 0 Density: 0.005000
  COLUMN (#3): N3(NUMBER)
    AvgLen: 4 NDV: 1000 NULLS: 0 Density: 0.001000 MIN: 0.000000 MAX: 999.000000
  COLUMN (#4): C1(VARCHAR2)
    AvgLen: 43 NDV: 1 NULLS: 0 Density: 1.000000
  ColGroup (#1, VC) SYS_STUOYQUEIAZ7FI9DV53VLN$$$0
    Col#: 1 2 3    CorStregth: 40000.00
  ColGroup (#2, VC) SYS_STU2NGW2O#$ZX2CDZYOMNMFP64
    Col#: 1 2 4    CorStregth: 20.00
  ColGroup Usage:: PredCnt: 2  Matches FULL:  Partial: #1 (1 2 ) Sel: 0.010000
  TABLE: T1  Alias: T1

Since CG #1 has smaller NDV (100) (vs. #2, 200), it calculated the cardinality (CDN) based on CG #1 as:
1/NDV (CG) * Original Cardinality = 1/100 * 100000 = 0.01 * 100000 = 1000

Summary

There are some situations where dropping a potentially ‘unused’ index can have a negative influence on the optimizer’s cardinality estimation. Consider using extended statistics to help the optimizer in order to become more familiar with columns logical relationship. The extended statistics are represented with virtual (hidden) column. Regarding statistics gathering, note that DBMS_STATS.CREATE_EXTENDED_STATS won’t gather statistics automatically, use METHOD_OPT instead. If you have skewed data, you may want to consider creating proper histograms. When histograms are on base columns, you’ll also need to create histograms on the column group in order to be taken into consideration. This is because Oracle gives priority to histograms. You can use column groups extended statistics to preserve cardinality stability as consequence from dropping an index. While it might not be appropriate for all cases, it’s a viable option for the optimizers help in good cardinality estimation.

Lastly, you can use DBMS_STATS.SEED_COL_USAGE procedure to monitor columns relation in filter predicates, join predicates and group by clauses. A report can be generated with DBMS_STATS.REPORT_COL_USAGE where its information can be used for appropriate column groups identification. I look forward to exploring more on that in a future post.

email

Interested in working with Ivica? Schedule a tech call.

About the Author

Senior Database Consultant
Ivica is an Oracle Certified Master 12c and 11g, and a recognized member of the Oracle ACE Program as an Oracle ACE Associate. He is a blogger and active contributor to the Oracle community and presents at many technology conferences. Known for his deep Oracle expertise and ability to troubleshoot quickly and efficiently, Ivica has the skills to solve problems quickly regardless of size and complexity. He is passionate about database performance and stability.

1 Comment. Leave new

Thanks for this post.

Reply

Leave a Reply

Your email address will not be published. Required fields are marked *