Oracle Database

Oracle parallel query hints. A reference series part 2: PQ_REPLICATE

5 min read

Apr 18, 2018 12:00:00 AM

The Mechanics of PQ_REPLICATE in Oracle 12.2

In part 1 of this series, we talked about the basic mechanics of the PQ_DISTRIBUTE hint. I did mention the PQ_REPLICATE, and here are all of the details. Oracle Database 12.2, has added a new feature that allows PQ processes to "read" a table instead of relying on the BROADCAST from other PQ processes. The idea is that in some cases, especially RAC, the commonly accessed tables are already in local cache (SGA) and can be read more quickly from there than having it broadcast-ed by PQ processes. It also reduces the number of star/stops of PQ processes work flows, which for very small tables can have significant overhead. The PQ_REPLICATE controls this feature. It basically says: When you join the table referenced in the hint, and it is a HASH join, don't broadcast the joined table OR RESULTSET, instead read it directly. Note that this is in REVERSE on what you would expect. So you have to think as the PQ_REPLICATE controlling a JOIN and not as an instruction on which table to "replicate". The "read" can result in a buffer cache read, a direct path read, an in-memory table read or a combination. It's just as if each PQ process full scans the table for themselves. Here's an example with the TSMALL table from the Part 1 of the series:

Comparing Distribution Methods: Replicate vs. Broadcast

WITHOUT PQ_REPLICATE-ion

[sql] select /*+PARALLEL(8) NO_PQ_REPLICATE(t1) */* from tlarge t1 join tsmall ts on ts.id = t1.id ; | Id | Operation | Name | E-Rows |E-Bytes| Cost (%CPU)| E-Time | TQ |IN-OUT| PQ Distrib | -------------------------------------------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 10000 | 10M| 352 (1)| 00:00:01 | | | | | 1 | PX COORDINATOR | | | | | | | | | | 2 | PX SEND QC (RANDOM) | :TQ10001 | 10000 | 10M| 352 (1)| 00:00:01 | Q1,01 | P->S | QC (RAND) | |* 3 | HASH JOIN | | 10000 | 10M| 352 (1)| 00:00:01 | Q1,01 | PCWP | | | 4 | PX RECEIVE | | 10000 | 5390K| 32 (0)| 00:00:01 | Q1,01 | PCWP | | | 5 | PX SEND BROADCAST | :TQ10000 | 10000 | 5390K| 32 (0)| 00:00:01 | Q1,00 | P->P | BROADCAST | | 6 | PX BLOCK ITERATOR | | 10000 | 5390K| 32 (0)| 00:00:01 | Q1,00 | PCWC | | | 7 | TABLE ACCESS STORAGE FULL| TSMALL | 10000 | 5390K| 32 (0)| 00:00:01 | Q1,00 | PCWP | | | 8 | PX BLOCK ITERATOR | | 100K| 52M| 319 (0)| 00:00:01 | Q1,01 | PCWC | | | 9 | TABLE ACCESS STORAGE FULL | TLARGE | 100K| 52M| 319 (0)| 00:00:01 | Q1,01 | PCWP | | -------------------------------------------------------------------------------------------------------------------------- [/sql]

With PQ_REPLICATE-ion

[sql] select /*+PARALLEL(8) PQ_REPLICATE(t1) */* from tlarge t1 join tsmall ts on ts.id = t1.id ; ------------------------------------------------------------------------------------------------------------------------ | Id | Operation | Name | E-Rows |E-Bytes| Cost (%CPU)| E-Time | TQ |IN-OUT| PQ Distrib | ------------------------------------------------------------------------------------------------------------------------ | 0 | SELECT STATEMENT | | 10000 | 10M| 352 (1)| 00:00:01 | | | | | 1 | PX COORDINATOR | | | | | | | | | | 2 | PX SEND QC (RANDOM) | :TQ10000 | 10000 | 10M| 352 (1)| 00:00:01 | Q1,00 | P->S | QC (RAND) | |* 3 | HASH JOIN | | 10000 | 10M| 352 (1)| 00:00:01 | Q1,00 | PCWP | | | 4 | TABLE ACCESS STORAGE FULL | TSMALL | 10000 | 5390K| 32 (0)| 00:00:01 | Q1,00 | PCWP | | | 5 | PX BLOCK ITERATOR | | 100K| 52M| 319 (0)| 00:00:01 | Q1,00 | PCWC | | | 6 | TABLE ACCESS STORAGE FULL| TLARGE | 100K| 52M| 319 (0)| 00:00:01 | Q1,00 | PCWP | | ------------------------------------------------------------------------------------------------------------------------ [/sql]

Critical Performance Trade-offs

The plan looks much simpler and it is possible to miss the fact that it is executed in parallel. I would like to highlight a few key points: 1. In the NO_PQ_REPLICATE case, the table is read with parallelism - meaning with "speed up" factor of 8x (or whatever the parallelism is) vs. the PQ_REPLICATE factor - where it is read single threaded - 8 times (or whatever parallelism is). For small tables - for which this feature is intended - obviously Oracle thought it makes sense to introduce the feature. But for larger tables ... this may actually be slower. If it takes 60 seconds to read the table single threaded - then with PQ_REPLICATE - it would still take 60 seconds (assuming hardware can handle 8x threads to do that at the same time) vs. the NO_PQ_REPLICATE which would take 60 / 8 = 7.5 seconds + some time to send over PQ BUFFERS. 2. On the same topic, in the case of EXADATA and a large enough table to cause a SMART SCAN to occur - in the BROADCAST case (NO_PQ_REPLICATE) - the data from the table would go over the infiniband fabric twice. Once to be read from the storage, and once to be sent over to other nodes over the Infiband. Now not exactly twice, depends on the number of RAC nodes - as local broadcasts would not be sent over the Infiniband network. For example with 4 nodes - it would be 1x to read and 0.75x to sent - a total of 1.75x vs. the PQ_REPLICATE approach - in which case the data from the table would go over as many times as the parallelism, in our case 8x. 3. In the BROADCAST case, the PQ processes reading the table would apply whatever filtering conditions that are possible, including filtering out unused columns. Vs in the PQ_REPLICATE case - all data would be re-read as many times as needed. So obviously there are trade-offs - as it is with every new feature.

Analyzing Single PQ Set Execution

Now if we take the final example of our query in Part 1 and remove the NO_PQ_REPLICATE hint, then the query becomes a single PQ set execution and the plan looks very interesting

[sql] select /*+PARALLEL(8) NO_PX_JOIN_FILTER(T2) LEADING(TS, T1, T2) PQ_DISTRIBUTE(T1 NONE BROADCAST ) SWAP_JOIN_INPUTS(T1) PQ_DISTRIBUTE(T2 NONE BROADCAST ) SWAP_JOIN_INPUTS(T2) */* from tlarge t1 join tlarge_two t2 on t2.id=t1.id join tsmall ts on ts.id = t1.id where t1.mod5_id=1 ; --------------------------------------------------------------------------------------------------------------------------- | Id | Operation | Name | E-Rows |E-Bytes| Cost (%CPU)| E-Time | TQ |IN-OUT| PQ Distrib | --------------------------------------------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 10000 | 15M| 671 (1)| 00:00:01 | | | | | 1 | PX COORDINATOR | | | | | | | | | | 2 | PX SEND QC (RANDOM) | :TQ10000 | 10000 | 15M| 671 (1)| 00:00:01 | Q1,00 | P->S | QC (RAND) | |* 3 | HASH JOIN | | 10000 | 15M| 671 (1)| 00:00:01 | Q1,00 | PCWP | | | 4 | TABLE ACCESS STORAGE FULL | TLARGE_TWO | 100K| 52M| 319 (0)| 00:00:01 | Q1,00 | PCWP | | |* 5 | HASH JOIN | | 10000 | 10M| 352 (1)| 00:00:01 | Q1,00 | PCWP | | |* 6 | TABLE ACCESS STORAGE FULL | TLARGE | 20000 | 10M| 319 (0)| 00:00:01 | Q1,00 | PCWP | | | 7 | PX BLOCK ITERATOR | | 10000 | 5390K| 32 (0)| 00:00:01 | Q1,00 | PCWC | | | 8 | TABLE ACCESS STORAGE FULL| TSMALL | 10000 | 5390K| 32 (0)| 00:00:01 | Q1,00 | PCWP | | --------------------------------------------------------------------------------------------------------------------------- [/sql]

The plan looks exceptionally simple. A look at what happens here: 1. PQ Processes set "A" read the T2 (TLARGE_TWO) table and create an in-memory hash table - this happens 8 times (parallelism level) all at the same time 2. Same PQ Processes set "A" read the T1 (TLARGE) table and create an in-memory hash table - this happens 8 times (parallelism level) all at the same time 3. Same PQ Processes set "A" read the TS (TSMALL) table in chunks (Plan Step 7: PX BLOCK ITERATOR) and join T1 and T2 tables via the in-memory hash tables. Reading of TSMALL table is "sped up" 8x (parallelism level) due to chunk-reading. In conclusion: The PQ_REPLICATE hint and feature is an optimisation introduced for very specific use cases, but can speed up and "simplify" an execution plan. It also reduces the number of start/stop cycles for PQ processes, which can have significant overhead, especially with large degrees of parallelism. More on the subject by Jonathan Lewis here: https://jonathanlewis.wordpress.com/2014/03/05/12c-pq_replicate/ and here https://jonathanlewis.wordpress.com/2013/12/08/parallel-execution-3/