Interval Partitioning and Parallel Query Limit Access Paths

Jul 4, 2012 / By Christo Kutrovsky

Tags: , ,

Interval partitioning – this ability to create partitions on the fly was introduced in 11g. When the feature came out, there were several nasty bugs. Some of them were related to parsing queries and determining “partition equivalence”; others were performance bugs, when a lot of partitions were created as a result of loading data in a single operation.

But bugs are bugs, and eventually they get fixed. There are still a few bugs in regards to Parallel merge with partition based loading. However, hidden feature limiting specifics are annoying and can be nasty.

One such particular “limitation” has to do with parallel group by on the partition key. If you want to see just that part, skip towards the end, but I think reading the whole post will offer some insight on how Oracle Parallel Query works.

Consider this simple, interval based table that ends up creating 12 partitions:

create table mytest (id not null, filler not null) partition by range(id) interval (1000) (partition P1 values less than (0)) storage (initial 1M) as
select rownum, rpad('x',100,'x') from dual connect by level <=10000;

Now let’s review the query plan for:

select id, count(*) from mytest group by id;

-------------------------------------------------------------------------------------------------------
| Id  | Operation                   | Name   | E-Rows |E-Bytes| Cost (%CPU)| E-Time   | Pstart| Pstop |
-------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |        |  10794 |   137K|    58   (2)| 00:00:01 |       |       |
|   1 |  PARTITION RANGE ALL        |        |  10794 |   137K|    58   (2)| 00:00:01 |     1 |1048575|
|   2 |   HASH GROUP BY             |        |  10794 |   137K|    58   (2)| 00:00:01 |       |       |
|   3 |    TABLE ACCESS STORAGE FULL| MYTEST |  10794 |   137K|    57   (0)| 00:00:01 |     1 |1048575|
-------------------------------------------------------------------------------------------------------

Notice that the HASH GROUP BY is at the partition level, meaning that each partition will run a “HASH GROUP BY” on the data of that partition alone, as opposed to the entire table.

Example of “whole table” group by:

select round(id), count(*) from mytest group by round(id);

-------------------------------------------------------------------------------------------------------
| Id  | Operation                   | Name   | E-Rows |E-Bytes| Cost (%CPU)| E-Time   | Pstart| Pstop |
-------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |        |  10794 |   137K|    58   (2)| 00:00:01 |       |       |
|   1 |  HASH GROUP BY              |        |  10794 |   137K|    58   (2)| 00:00:01 |       |       |
|   2 |   PARTITION RANGE ALL       |        |  10794 |   137K|    57   (0)| 00:00:01 |     1 |1048575|
|   3 |    TABLE ACCESS STORAGE FULL| MYTEST |  10794 |   137K|    57   (0)| 00:00:01 |     1 |1048575|
-------------------------------------------------------------------------------------------------------

In the first case, you need only the amount of RAM for the LARGEST partition in order to run fully in RAM, while in the second you need the size of THE ENTIRE RESULT SET. Also in the first case, the query would produce results as soon as the first partition is scanned, while in the second data, results will be produced once all partitions are scanned. There’s a big difference between the two, and the difference gets “bigger” with the total number of partitions.

Now let’s review the parallel case:

select /*+ parallel(4) */ round(id), count(*) from mytest group by round(id);

-------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                        | Name     | E-Rows |E-Bytes| Cost (%CPU)| E-Time   | Pstart| Pstop |    TQ  |IN-OUT| PQ Distrib |
-------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                 |          |  10794 |   137K|    17   (6)| 00:00:01 |       |       |        |      |            |
|   1 |  PX COORDINATOR                  |          |        |       |            |          |       |       |        |      |            |
|   2 |   PX SEND QC (RANDOM)            | :TQ10001 |  10794 |   137K|    17   (6)| 00:00:01 |       |       |  Q1,01 | P->S | QC (RAND)  |
|   3 |    HASH GROUP BY                 |          |  10794 |   137K|    17   (6)| 00:00:01 |       |       |  Q1,01 | PCWP |            |
|   4 |     PX RECEIVE                   |          |  10794 |   137K|    17   (6)| 00:00:01 |       |       |  Q1,01 | PCWP |            |
|   5 |      PX SEND HASH                | :TQ10000 |  10794 |   137K|    17   (6)| 00:00:01 |       |       |  Q1,00 | P->P | HASH       |
|   6 |       HASH GROUP BY              |          |  10794 |   137K|    17   (6)| 00:00:01 |       |       |  Q1,00 | PCWP |            |
|   7 |        PX BLOCK ITERATOR         |          |  10794 |   137K|    16   (0)| 00:00:01 |     1 |1048575|  Q1,00 | PCWC |            |
|   8 |         TABLE ACCESS STORAGE FULL| MYTEST   |  10794 |   137K|    16   (0)| 00:00:01 |     1 |1048575|  Q1,00 | PCWP |            |
-------------------------------------------------------------------------------------------------------------------------------------------

Notice the operation is a two-step process. First, a number of slaves read sections of the table (BLOCKS) and perform a localized “group by” (steps 8 to 6). Then, they send the data to another set of PQ slaves, based on the group by key to do the global grouping (steps 5 to 3). Note that step 5 runs in the “producers”, while step 4 runs in the receivers. Finally, the result is streamed to the coordinator (steps 2 to 1), which sends it back to the client (step 0).
If running on one server, the amount of memory required is the size of your entire result set + some memory to perform the localized group by’s before sending data (step 6). On RAC, the slaves would be distributed on each node, based on load. So the memory usage would vary, but usually it would be equal on each node.

Now you can control whether the localized group happens or not via the NO_GBY_PUSHDOWN hint. The plan without it would look like this:

select /*+ parallel(4) NO_GBY_PUSHDOWN */ round(id), count(*) from mytest group by round(id);

------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                       | Name     | E-Rows |E-Bytes| Cost (%CPU)| E-Time   | Pstart| Pstop |    TQ  |IN-OUT| PQ Distrib |
------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                |          |  10794 |   137K|    17   (6)| 00:00:01 |       |       |        |      |            |
|   1 |  PX COORDINATOR                 |          |        |       |            |          |       |       |        |      |            |
|   2 |   PX SEND QC (RANDOM)           | :TQ10001 |  10794 |   137K|    17   (6)| 00:00:01 |       |       |  Q1,01 | P->S | QC (RAND)  |
|   3 |    HASH GROUP BY                |          |  10794 |   137K|    17   (6)| 00:00:01 |       |       |  Q1,01 | PCWP |            |
|   4 |     PX RECEIVE                  |          |  10794 |   137K|    16   (0)| 00:00:01 |       |       |  Q1,01 | PCWP |            |
|   5 |      PX SEND HASH               | :TQ10000 |  10794 |   137K|    16   (0)| 00:00:01 |       |       |  Q1,00 | P->P | HASH       |
|   6 |       PX BLOCK ITERATOR         |          |  10794 |   137K|    16   (0)| 00:00:01 |     1 |1048575|  Q1,00 | PCWC |            |
|   7 |        TABLE ACCESS STORAGE FULL| MYTEST   |  10794 |   137K|    16   (0)| 00:00:01 |     1 |1048575|  Q1,00 | PCWP |            |
------------------------------------------------------------------------------------------------------------------------------------------

As you can see, the old Step 6 HASH GROUP BY is gone, but everything else is the same. In my particular case, it makes no sense to do the localized group by, as my data is mostly unique. Thus, there would be no reduction of the amount of data sent, and it would end up doing a lot of “double work”.

Now moving into the specifics of how INTERVAL partitioning affects all this.

If you were reading carefully, you may be asking yourself: Why is Oracle not leveraging the fact that the table is partitioned on the group by key, like in the serial plan? And that’s where the problem starts.

Oracle has a hidden parameter called _px_partition_scan_threshold which defaults to 64, and the description is “least number of partitions per slave to start partition-based scan”. So in our case, with parallel 4, Oracle would want a table with at least 4 * 64 = 256 partitions before considering a partition based scan. This appears to be introduced to ensure that there would be enough work for each slave, and they won’t stay idle. However, that doesn’t take into consideration the optimizations that can be achieved by doing a partition scan for group by operations.

For this, we need to hint this parameter via opt_param to 1 in order to force a partition based scan. Note that you still need to have at least as many partitions as slaves, or Oracle would revert to BLOCK iterator. So let’s give this a try:

select /*+ parallel(4) opt_param('_px_partition_scan_threshold',1)  */ id, count(*) from mytest group by id;

-------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                        | Name     | E-Rows |E-Bytes| Cost (%CPU)| E-Time   | Pstart| Pstop |    TQ  |IN-OUT| PQ Distrib |
-------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                 |          |  10794 |   137K|    17   (6)| 00:00:01 |       |       |        |      |            |
|   1 |  PX COORDINATOR                  |          |        |       |            |          |       |       |        |      |            |
|   2 |   PX SEND QC (RANDOM)            | :TQ10001 |  10794 |   137K|    17   (6)| 00:00:01 |       |       |  Q1,01 | P->S | QC (RAND)  |
|   3 |    HASH GROUP BY                 |          |  10794 |   137K|    17   (6)| 00:00:01 |       |       |  Q1,01 | PCWP |            |
|   4 |     PX RECEIVE                   |          |  10794 |   137K|    17   (6)| 00:00:01 |       |       |  Q1,01 | PCWP |            |
|   5 |      PX SEND HASH                | :TQ10000 |  10794 |   137K|    17   (6)| 00:00:01 |       |       |  Q1,00 | P->P | HASH       |
|   6 |       HASH GROUP BY              |          |  10794 |   137K|    17   (6)| 00:00:01 |       |       |  Q1,00 | PCWP |            |
|   7 |        PX BLOCK ITERATOR         |          |  10794 |   137K|    16   (0)| 00:00:01 |     1 |1048575|  Q1,00 | PCWC |            |
|   8 |         TABLE ACCESS STORAGE FULL| MYTEST   |  10794 |   137K|    16   (0)| 00:00:01 |     1 |1048575|  Q1,00 | PCWP |            |
-------------------------------------------------------------------------------------------------------------------------------------------

And nothing happened. The plan is exactly the same as the original one. I’ve used such execution paths in the past, and when Oracle refused to use it I was puzzled at first. It took me a while to figure out.

We need to convert the range-interval table to a pure range table. Fortunately, you can do this on the fly, but it requires a quick lock on the entire table. This also moves the “anchor” partition to the highest in the table, which allows you to drop the lowest partitions before the anchor partition. But it also prevents new partitions to be “inserted” if you had gaps in your range.

alter table mytest set interval ();

select /*+ parallel(4) opt_param('_px_partition_scan_threshold',1)  */ id, count(*) from mytest group by id;

----------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                     | Name     | E-Rows |E-Bytes| Cost (%CPU)| E-Time   | Pstart| Pstop |    TQ  |IN-OUT| PQ Distrib |
----------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT              |          |  10794 |   137K|    17   (6)| 00:00:01 |       |       |        |      |            |
|   1 |  PX COORDINATOR               |          |        |       |            |          |       |       |        |      |            |
|   2 |   PX SEND QC (RANDOM)         | :TQ10000 |  10794 |   137K|    17   (6)| 00:00:01 |       |       |  Q1,00 | P->S | QC (RAND)  |
|   3 |    PX PARTITION RANGE ALL     |          |  10794 |   137K|    17   (6)| 00:00:01 |     1 |    12 |  Q1,00 | PCWC |            |
|   4 |     HASH GROUP BY             |          |  10794 |   137K|    17   (6)| 00:00:01 |       |       |  Q1,00 | PCWP |            |
|   5 |      TABLE ACCESS STORAGE FULL| MYTEST   |  10794 |   137K|    16   (0)| 00:00:01 |     1 |    12 |  Q1,00 | PCWP |            |
----------------------------------------------------------------------------------------------------------------------------------------

Note how drastically different the plan is now. Step 3 “PX PARTITION RANGE ALL” is the key here, as well as the lack of send/receive except for the QC (query coordinator). The way this will be executed is that 4 slaves (as opposed 8 in previous cases) would be created and each will scan a single partition (step 5), perform a group by operation (step 4) , and send the results to the QC.

If running on the same server, the amount of RAM required to perform this operation entirely in memory would be that of a single partition (assuming uniform distribution) times the parallelism level. Note that this will be the same amount of RAM regardless of the size of the entire table. Even if you were to process a table of 4 TB, you would do so in pieces of your partition size. Compare that to the original approach of processing the entire result set in one shot. Note that this is also significantly more CPU efficient, as the data would be sent only once from the slaves to the QC, as opposed to twice in the original case. Once from producer slaves to consumer slaves, and a second time from consumer slaves to the QC. The fact that it would be double-processed due to the GBY pushdown operation is a separate CPU efficiency consideration.

Use cases

The above query is typical for performing de-duplication procedures on large amounts of data. The execution plan above is the most efficient possible way and will scale very well across multiple RAC servers. The original approach, however, would instead shuffle large amounts of data across multiple nodes as each slave processes a PX BLOCK RANGE.

One more “quirk” worth mentioning is that the ANALYTICS functions such as “rank() over ()” and “row_number() over ()” also cannot perform partition-based operations. For example, there are two types of queries you could use to find duplicated data:

select id, count(*) from mytest group by id having count(*) > 1;

--------------------------------------------------------------------------------------------------------
| Id  | Operation                    | Name   | E-Rows |E-Bytes| Cost (%CPU)| E-Time   | Pstart| Pstop |
--------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT             |        |  10794 |   137K|    58   (2)| 00:00:01 |       |       |
|*  1 |  FILTER                      |        |        |       |            |          |       |       |
|   2 |   PARTITION RANGE ALL        |        |  10794 |   137K|    58   (2)| 00:00:01 |     1 |    12 |
|   3 |    HASH GROUP BY             |        |  10794 |   137K|    58   (2)| 00:00:01 |       |       |
|   4 |     TABLE ACCESS STORAGE FULL| MYTEST |  10794 |   137K|    57   (0)| 00:00:01 |     1 |    12 |
--------------------------------------------------------------------------------------------------------

and

select * from (select t.*, row_number() over (partition by id order by null ) r from mytest t) where r>1;

--------------------------------------------------------------------------------------------------------
| Id  | Operation                    | Name   | E-Rows |E-Bytes| Cost (%CPU)| E-Time   | Pstart| Pstop |
--------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT             |        |  10794 |   822K|    58   (2)| 00:00:01 |       |       |
|   1 |  PARTITION RANGE ALL         |        |  10794 |   822K|    58   (2)| 00:00:01 |     1 |    12 |
|*  2 |   VIEW                       |        |  10794 |   822K|    58   (2)| 00:00:01 |       |       |
|   3 |    WINDOW SORT               |        |  10794 |   685K|    58   (2)| 00:00:01 |       |       |
|   4 |     TABLE ACCESS STORAGE FULL| MYTEST |  10794 |   685K|    57   (0)| 00:00:01 |     1 |    12 |
--------------------------------------------------------------------------------------------------------

These would perform in a similar way, although one does HASH aggregation while the other does a SORT aggregation. The overall memory usage and CPU should be similar in the generic case.

However, after a switch to parallel, the row_number() loses badly:

select /*+ parallel(4)  opt_param('_px_partition_scan_threshold',1)  */ id, count(*) from mytest group by id having count(*) > 1;
-----------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                      | Name     | E-Rows |E-Bytes| Cost (%CPU)| E-Time   | Pstart| Pstop |    TQ  |IN-OUT| PQ Distrib |
-----------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT               |          |  10794 |   137K|    17   (6)| 00:00:01 |       |       |        |      |            |
|   1 |  PX COORDINATOR                |          |        |       |            |          |       |       |        |      |            |
|   2 |   PX SEND QC (RANDOM)          | :TQ10000 |  10794 |   137K|    17   (6)| 00:00:01 |       |       |  Q1,00 | P->S | QC (RAND)  |
|*  3 |    FILTER                      |          |        |       |            |          |       |       |  Q1,00 | PCWC |            |
|   4 |     PX PARTITION RANGE ALL     |          |  10794 |   137K|    17   (6)| 00:00:01 |     1 |    12 |  Q1,00 | PCWC |            |
|   5 |      HASH GROUP BY             |          |  10794 |   137K|    17   (6)| 00:00:01 |       |       |  Q1,00 | PCWP |            |
|   6 |       TABLE ACCESS STORAGE FULL| MYTEST   |  10794 |   137K|    16   (0)| 00:00:01 |     1 |    12 |  Q1,00 | PCWP |            |
-----------------------------------------------------------------------------------------------------------------------------------------

vs.

select /*+ parallel(4)  opt_param('_px_partition_scan_threshold',1)  */ * from (select t.*, row_number() over (partition by id order by null ) r from mytest t) where r>1;

-------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                        | Name     | E-Rows |E-Bytes| Cost (%CPU)| E-Time   | Pstart| Pstop |    TQ  |IN-OUT| PQ Distrib |
-------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                 |          |  10794 |   822K|    17   (6)| 00:00:01 |       |       |        |      |            |
|   1 |  PX COORDINATOR                  |          |        |       |            |          |       |       |        |      |            |
|   2 |   PX SEND QC (RANDOM)            | :TQ10001 |  10794 |   822K|    17   (6)| 00:00:01 |       |       |  Q1,01 | P->S | QC (RAND)  |
|*  3 |    VIEW                          |          |  10794 |   822K|    17   (6)| 00:00:01 |       |       |  Q1,01 | PCWP |            |
|   4 |     WINDOW SORT                  |          |  10794 |   685K|    17   (6)| 00:00:01 |       |       |  Q1,01 | PCWP |            |
|   5 |      PX RECEIVE                  |          |  10794 |   685K|    16   (0)| 00:00:01 |       |       |  Q1,01 | PCWP |            |
|   6 |       PX SEND HASH               | :TQ10000 |  10794 |   685K|    16   (0)| 00:00:01 |       |       |  Q1,00 | P->P | HASH       |
|   7 |        PX PARTITION RANGE ALL    |          |  10794 |   685K|    16   (0)| 00:00:01 |     1 |    12 |  Q1,00 | PCWC |            |
|   8 |         TABLE ACCESS STORAGE FULL| MYTEST   |  10794 |   685K|    16   (0)| 00:00:01 |     1 |    12 |  Q1,00 | PCWP |            |
-------------------------------------------------------------------------------------------------------------------------------------------

The parallel analytic row_number does not leverage the fact that the table is partitioned on the group by (in this case partition by) key. The row number would require as much memory as the entire result set, unlike the partition approach of group by.

4 Responses to “Interval Partitioning and Parallel Query Limit Access Paths”

  • GregG says:

    Good to know. I’ve learned about reading parallel plans .
    Thanks You.
    GregG

  • Vyacheslav Rasskazov says:

    Thanks for the post.
    But I didn’t see any influence of _px_partition_scan_threshold parameter. All tests wholly repeatable without opt_param(‘_px_partition_scan_threshold’,1)hint at least in 11.2.0.2.

  • Hi Christo,

    that’s an interesting blog post.

    It is clearly a pity that at present Oracle doesn’t support controlling the parallel distribution of non-join operations like GROUP BY, ORDER BY, Analytic Functions etc. like it does for joins / loads (PQ_DISTRIBUTE hint).

    The outlines for your example queries do look exactly the same, so this cannot be enforced via explicit hints (although it might be possible indirectly, I haven’t tried much in that regard).

    Note that Analytic Functions in principle do allow a Push-Down into Parallel Slaves. If you for example change your example query for the ROW_NUMBER() function into “r<=1" (which is of course not what you want in this case) then you'll see a WINDOW CHILD PUSHED RANK operation on partition level. Although the plan still does some unnecessary post-processing it shows that in principle the operation is supported, but obviously not in every case.

    Randolf

  • [...] as long as old and known bugs are fixed. So when 12c becomes available he will first check if the limitations for parallel query on interval partitions are [...]

Leave a Reply

  • (will not be published)

XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>