Tuning Latch Contention: Cache-buffers-chain latches

Recently, I had an opportunity to tune latch contention for cache buffers chain (CBC) latches. The problem was high CPU-usage combined with poor application performance. A quick review of the statspack report for 15 minutes showed a latch-free wait as the top event, consuming approximately 3600 seconds in an 8-CPU server. CPU usage was quite high, which is a typical symptom of latch contention, due to the spinning involved. v$session_wait
showed that hundreds of sessions were waiting for latch free event.
SQL> @waits10g SID PID EVENT P1_P2_P3_TEXT ------ ------- ------------ -------------------------------------- 294 17189 latch free address 15873156640-number 127-tries 0 628 17187 latch free address 15873156640-number 127-tries 0 .... 343 17191 latch free address 15873156640-number 127-tries 0 599 17199 latch: cache address 17748373096-number 122-tries 0 buffers chains 337 17214 latch: cache address 17748373096-number 122-tries 0 buffers chains ..... 695 17228 latch: cache address 17748373096-number 122-tries 0 buffers chains .... 276 15153 latch: cache address 19878655176-number 122-tries 1 buffers chains
I will use a two-pronged approach to find the root cause scientifically. First, I’ll find the SQL suffering from latch contention and objects associated with the access plan for that SQL. Next,I will find the buffers involved in latch contention, and map that back to objects. Finally, I will match these two techniques to pinpoint the root cause.
Before I go any further, let’s do a quick summary of the internals of latch operations.
A Brief Introduction to CBC Latches (and The Not-So-Brief Reason Why This is a Complicated Topic to Discuss Briefly)
Latches are internal memory structures that coordinate access to shared resources. Locks (also known as enqueues) are different from latches, the key difference being that enqueues, as the name suggests, provide a FIFO queueing mechanism, while latches do not. On the other hand, latches are held very briefly and locks are usually held longer.
In Oracle SGA, the buffer cache is the memory area into which data blocks are read. (If Automatic Shared Memory Management [ASMM] is in use, part of the shared pool can be tagged as KGH:NO ALLOC and remapped to the buffer cache area too.)
Each buffer in the buffer cache has an associated element in the buffer header array, externalized as x$bh
. Buffer headers keep track of various attributes and state of buffers in the buffer cache. This buffer header array is allocated in the shared pool. The buffer headers are chained together in a doubly-linked list and linked to a hash bucket. There are many hash buckets, and their number is derived and governed by the _db_block_hash_buckets
parameter). Access to these hash chains (both to inspect and change) is protected by cache-buffers-chains latches.
Furthermore, buffer headers can be linked and delinked from hash buckets dynamically.
Here is a simple algorithm to access a buffer (I had to deliberately cut out so as not to deviate too much from our primary discussion):
- Hash the data block address (DBAs: a combination of tablespace, file_id and block_id) to find hash bucket.
- Get the latch protecting the hash bucket.
- If success, walk the hash chain, reading buffer headers to see if a specific version of the block is already in the chain.If found, access the buffer in buffer cache, with protection of buffer pin/unpin actions.
If not found, then find a free buffer in buffer cache, unlink the buffer header for that buffer from its current chain, link that buffer header with this hash chain, release the latch and read block in to that free buffer in buffer cache with buffer header pinned.
- If not success, spin for
spin_count
times and go to step 2. If that latch was not got with spinning, then sleep (with exponentially increasing sleep time with an upper bound), wakeup, and go to step 2.
Obviously, latches are playing crucial role in controlling access to critical resources such as the hash chain. My point is that repeated access to a few buffers can increase latch activity.
There are many CBC latch children. The parameter _db_block_hash_latches
controls the number of latches and is derived from the buffer cache size. Furthermore, in Oracle 10g, shareable latches are used; and inspecting a hash chain necessitates acquiring latches in shared mode, which is compatible with other shared-mode operations. Note that these undocumented parameters are usually sufficient, and changes to these parameters must get approval from Oracle support.
Back to Our Problem . . .
Let’s revisit the problem at hand. The wait graph printed above shows that this latch contention is caused by two types of latches. Latch #127 is a simulator lru latch, and #122 is a cache buffers chains latch.
select latch#, name from v$latch where latch# in (127, 122);
The problem with a “simulator lru” latch is simple: there is a bug with db_cache_advice
(bug number 5918642). If db_cache_advice
is set to ON
, latch contention due to simulator lru latches can be observed for large buffer caches. This issue was fixed quickly by setting db_cache_advice
to OFF
.
After resolving the “simulator lru” latch issue, I saw some improvement in performance — but not much.
Querying v$session
to see what SQL statement(s) is/are causing latch contention. The state column below indicates that processes are not currently waiting for latches, but have waited in the past. 24 sessions are executing the same SQL statement, the last wait in the past is a ‘latch free’ event for these sessions, and yes, these are active sessions. If latch contention is prevalent, querying v$session
as below will show the SQL statements to focus on.
select event, sql_hash_value,state, count(*) from v$session w where event='latch free' and status='ACTIVE' group by sql_hash_value, state , event SQL> / EVENT SQL_HASH_VALUE STATE COUNT(*) ------------- -------------- ------------------- ---------- latch free 3629331128 WAITED KNOWN TIME 24 latch free 673277007 WAITED KNOWN TIME 1 latch free 1378683334 WAITED SHORT TIME 1 latch free 3629331128 WAITED SHORT TIME 5 latch free 2920275581 WAITED SHORT TIME 3
We can find the SQL statement querying v$sql_text
with above hash value 3629331128
. The SQL suffering from latch contention is printed below. Of course, for security reasons, I have changed the actual object names.
select * from v1 WHERE ( col1 IN ( 3, 20, 21, 44, 45, 47, 48, 49, 50, 51, 57, 58, 59, 67, 68, 69, 76, 78, 79, 80, 81, 82, 84,85, 106, 450, 451, 452, 453, 454, 455, 456, 457, 458, 459, 460, 461, 462, 463, 464, 465, 466, 467, 468, 469, 470, 471, 472, 473, 474, 476, 478, 500, 501, 502) OR col2 IN (3, 20, 21, 44, 45, 47, 48, 49, 50, 51, 57, 58, 59, 67, 68, 69, 76, 78, 79, 80, 81, 82, 84, 85, 106, 450, 451, 452, 453, 454, 455, 456, 457, 458, 459, 460, 461, 462, 463, 464, 465, 466, 467, 468, 469, 470, 471, 472, 473, 474, 476, 478, 500, 501, 502)) AND UPPER(col3) LIKE :1 and rownum < 200
The explain plan for the above shows that multiple tables are accessed in this view. But at this point, I don’t know which step in this plan is causing latch contention. If you have to guess which of the following tables is causing the issue, check your guess with correct answer later (and become a BAAG member immediately). (I removed a few columns from the plan output to improve readability.)
------------------------------------------------------------------ | Id | Operation | Name |Rows | Cost | ------------------------------------------------------------------ | 0 | SELECT STATEMENT | |17778 | 223| |* 1 | COUNT STOPKEY | | | | | 2 | CONCATENATION | | | | |* 3 | FILTER | | | | |* 4 | HASH JOIN | |17777 | 216| |* 5 | TABLE ACCESS FULL | ORG | 4 | 2| | 6 | NESTED LOOPS | | 197 | 213| | 7 | NESTED LOOPS | | 195 | 18| | 8 | TABLE ACCESS FULL | ORG | 1 | 2| |* 9 | TABLE ACCESS FULL | ORDER | 195 | 16| |* 10 | TABLE ACCESS BY INDEX ROWID| TRADE | 1 | 1| |* 11 | INDEX UNIQUE SCAN | TRADE_PK| 1 | 0| |* 12 | FILTER | | | | | 13 | NESTED LOOPS | | 1 | 7| | 14 | NESTED LOOPS | | 1 | 6| | 15 | NESTED LOOPS | | 1 | 4| |* 16 | TABLE ACCESS FULL | ORG | 1 | 2| |* 17 | TABLE ACCESS FULL | ORG | 1 | 2| |* 18 | TABLE ACCESS FULL | ORDER | 3 | 2| |* 19 | TABLE ACCESS BY INDEX ROWID | TRADE | 1 | 1| |* 20 | INDEX UNIQUE SCAN | TRADE_PK| 1 | 0| ------------------------------------------------------------------
Researching Further
Re-querying v$session_wait
, I see that a couple of latches are hot. I will consider one latch children with latch address 19875043200
as an example, and drill down further.
SID PID EVENT P1_P2_P3_TEXT ------ ---------- ----------- ---------------------------------------- 578 17220 latch:CBC address 19875043200-number 122-tries 0 664 17226 latch:CBC address 19875043200-number 122-tries 0 695 17228 latch:CBC address 19875043200-number 122-tries 0 701 23987 latch:CBC address 19875043200-number 122-tries 0 ...
Converting this latch address — 19875043200
— from decimal to hex yields 4A0A51780
. But the latch address is 16 bytes long, so prefix it with zeros and querying v$latch_children
to see activity against that latch children.
select addr, latch#, child#, level#, gets from v$latch_children where addr='00000004A0A51780' SQL> / ADDR LATCH# CHILD# LEVEL# GETS ---------------- ---------- ---------- ---------- ---------- 00000004A0A51780 122 10437 1 23672075 SQL> / ADDR LATCH# CHILD# LEVEL# GETS ---------------- ---------- ---------- ---------- ---------- 00000004A0A51780 122 10437 1 23672209
I repeated the execution of above SQL almost immediately, which showed an increase of 134 gets in sub-seconds. The above step also helps to validate the latch address. Comparing with the latch type, I see that this latch address is indeed the Cache buffers chains latch.
Hang Those Buffers!
Next, I needed to find buffers protected by these latch children, and then find the buffers causing latch contention. Many such hash buckets (and so, numerous buffers) are protected by a latch children. Fortunately, column tch
can be used effectively to identify hot block(s). Almost every access to a buffer increments the tch
value for that buffer header. The idea here is to find buffers protected by that latch and identify buffers with higher touch counts. Those buffers are probable candidates for further analysis.
The x$bh
table and v$latch_children
can be joined to find those buffer attributes. (By the way, the following SQL can easily be rewritten to print buffers protected by the top latch, say, by sleeps.)
select hladdr, file#, dbablk, decode(state,1,'cur ',3,'CR',state) ST, tch from x$bh where hladdr in (select addr from (select addr from v$latch_children where addr='00000004A0A51780' order by sleeps, misses,immediate_misses desc )where rownum <2) HLADDR FILE# DBABLK ST TCH ---------------- ---------- ---------- ---- ------ 00000004A0A51780 1 52351 cur 3 00000004A0A51780 16 701009 cur 24 00000004A0A51780 16 23959 cur 182 00000004A0A51780 15 16215 cur 2855 <-- 00000004A0A51780 26 693 cur 9 00000004A0A51780 9 52872 cur 2935 <-- 00000004A0A51780 8 45128 cur 1831 <-- 00000004A0A51780 16 635473 cur 560 00000004A0A51780 25 233403 cur 51 00000004A0A51780 25 97993 cur 110 00000004A0A51780 4 97273 cur 43 00000004A0A51780 25 268340 cur 0 12 rows selected.
There are at least three blocks with higher activity since their tch
value is much higher. A note of caution: buffer cache activity is quite dynamic, and this analysis needs to be performed during the period of latch contention. Performing this analysis a few hours after latch contention will lead to an incorrect diagnosis.
We need to associate these three hot blocks with object names. With following script, the file number and DBAblk
can be used to find the object.
accept h_file_id prompt ' Enter file_id ==>' accept h_block_id prompt ' Enter block_id==>' set verify off column owner format A10 column segment_name format A20 column segment_type format A10 column hdrfile format 9999 column curfile format 9999 column curblk format 99999999 column hdrblock format 99999999 select owner, segment_name,partition_name, segment_type, file_id,block_id from dba_extents where file_id = &&h_file_id and block_id &&h_block_id; set verify on
For example: to find the object_name
for buffer with tch
of 2855
, supply 15
for the file_id
and 16215
for the block_id
.
I don’t favour the above script, since its performance is not optimal. It is much easier to dump the blocks and convert them to object ids. Let’s dump these three blocks.
alter system dump datafile 15 block min 16215 block max 16215; alter system dump datafile 9 block min 52872 block max 52872; alter system dump datafile 8 block min 45128 block max 45128;
Reading the trace files, I see three different segments, and one line per block is printed below from the trace file.
seg/obj: 0xd756 csc: 0x00.17b5fe4f itc: 2 flg: E typ: 1 - DATA seg/obj: 0x1801a csc: 0x00.1cb9f0ab itc: 2 flg: E typ: 1 - DATA seg/obj: 0x181c5 csc: 0x00.1bef7a59 itc: 169 flg: E typ: 2 - INDEX
The seg/obj
field is the object_id
printed in hex. Converting the hex numbers d756
, 1801a
, and 181c5
to decimal equivalents results in 55126
, 98330
, and 98757
.
Now I can query dba_objects
to find the object_names
.
select owner, object_id, object_name, data_object_id from dba_objects where object_id in (55126, 98330,98757) or data_object_id in (55126, 98330,98757) SQL> / OWNER OBJECT_ID OBJECT_NAME DATA_OBJECT_ID ------------ ---------- ----------------- -------------- SOME_USER 55126 ORDER 55126 SOME_USER 98330 GBLOCK 98330 SOME_USER 98757 ALLOCATION_OID 98757
Comparing explain plans and object_names printed above, we can see that ORDER table is a common object between these two techniques.
| 7 | NESTED LOOPS | | 195 | 14040 | 18 | | 8 | TABLE ACCESS FULL | ORG | 1 | 34 | 2 | |* 9 | TABLE ACCESS FULL | ORDER | 195 | 7410 | 16 | ... | 14 | NESTED LOOPS | | 1 | 106 | 6 | | 15 | NESTED LOOPS | | 1 | 68 | 4 | |* 16 | TABLE ACCESS FULL | ORG | 1 | 34 | 2 | |* 17 | TABLE ACCESS FULL | ORG | 1 | 34 | 2 | |* 18 | TABLE ACCESS FULL | ORDER | 3 | 114 | 2 | 9 - filter(UPPER(UPPER("GO"."ORD_ID")) LIKE :1 AND "GO"."IS_MOD"='F') 18 - filter(UPPER(UPPER("GO"."ORD_ID")) LIKE :1 AND "GO"."IS_MOD"='F')
A Quick Summary
Let me summarize what I’ve done so far.
- I found one latch children address, located all buffers protected by that latch, found buffers with high
tch
, and queried to find object names for those buffers. - Through
v$session_wait
I found the SQL hash value, found SQL suffering from latch contention, and generated the explain plan.
From these two different techniques, I can find objects common to both steps 1 and 2 and those objects are probable candidates to focus on. I see that the ORDER
table is common to both techniques. From the plan above, the ORDER
table is accessed in a tight nested-loop join. This will increase buffer access to ORDER
table, which in turn results in higher latching activity.
SQL Tuning: That Was Easy
From here on, the solution is straightforward — I need to avoid tight nested-loop joins. Specifically, the if inner tables in the nested-loop join are accessed with Full Table Scan access, that can cause increased latching activity. For every row from the outer row source, the inner row source is queried, in a nested-loops join. A hash join might be a better access method. In a hash join, tables are scanned once and hashed, reducing latching activity.
In this specific case, ORDER
is a small table. Further analysis reveals that the CBO chose nested-loop joins, since rownum
triggers first_rows optimizer_mode
. As a test, let’s remove the rownum
clause to see what plan we get.
explain plan for select * from v1 WHERE ( col1 IN ( 3, 20, 21, 44, 45, 47, 48, 49, 50, 51, 57, 58, 59, 67, 68, 69, 76, 78, 79, 80, 81, 82, 84, 85, 106, 450, 451, 452, 453, 454, 455, 456, 457, 458, 459, 460, 461, 462, 463, 464, 465, 466, 467, 468, 469, 470, 471, 472, 473, 474, 476, 478, 500, 501, 502) OR col2 IN (3, 20, 21, 44, 45, 47, 48, 49, 50, 51, 57, 58, 59, 67, 68, 69, 76, 78, 79, 80, 81, 82, 84, 85, 106, 450, 451, 452, 453, 454, 455, 456, 457, 458, 459, 460, 461, 462, 463, 464, 465, 466, 467, 468, 469, 470, 471, 472, 473, 474, 476, 478, 500, 501, 502)) AND UPPER(col3) LIKE :1 --and rownum select * from table(dbms_xplan.display); --------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost | ---------------------------------------------------------------- | 0 | SELECT STATEMENT | | 4902 | 818K| 4014| |* 1 | HASH JOIN | | 4902 | 818K| 4014| | 2 | TABLE ACCESS FULL | ORG | 370 | 12950 | 5| |* 3 | HASH JOIN | | 17513 | 2325K| 4008| | 4 | TABLE ACCESS FULL | ORG | 370 | 12950 | 5| |* 5 | HASH JOIN | | 17479 | 1724K| 4002| |* 6 | TABLE ACCESS FULL| ORDER | 17331 | 643K| 1320| | 7 | TABLE ACCESS FULL| TRADE | 353K| 21M| 2680| ---------------------------------------------------------------
After commenting the rownum
clause, the CBO chose hash join to join that table. Comparing execution times between these two versions, running both of these SQL statements in parallel, the original query consumed almost 70 CPU seconds per execution, while the query with rownum clause commented out, consumed just 0.5 seconds of CPU time. Essentially, the all_rows
optimizer_mode
should be used for this SQL, even if the rownum
predicate is used. Fortunately, this query is accessing a view (and only similar queries are accessing that view), so adding an all_rows
hint to that view resolved the latch contention.
We used a two-pronged approach: find objects causing latch contention, and match those objects with execution plan of any SQL statements suffering latch contention; then, resolve the issue with a minor change to the view. (The environment: Oracle version 10.2.0.3 on a Linux server.)
Of course, this can be read as PDF file from resolving-latch-contention-cbc.