Batched table access

Timur Akhmadeev

November 29, 2013

Tags: Dba Lounge, Technical Track, Technical Blog, Sysadmin

When I first saw the suffix BATCHED in an Oracle execution plan from 12c, I was curious to see what is hidden behind and how it works. I had some spare time for testing and wanted to share my findings with you here.

Intro

Here is what I'm talking about: [sourcecode lang="sql" highlight="5"] ------------------------------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | ------------------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1000 | 990K| 148 (0)| 00:00:01 | | 1 | TABLE ACCESS BY INDEX ROWID BATCHED| T1 | 1000 | 990K| 148 (0)| 00:00:01 | |* 2 | INDEX RANGE SCAN | T1_Y_INDX | 1000 | | 5 (0)| 00:00:01 | ------------------------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 2 - access("Y"=TO_NUMBER(:1)) [/sourcecode] Line 1 of this very simple execution plan shows how Oracle 12c added a suffix BATCHED to the table access by a B*Tree index rowsource. I was thinking about the reasons behind this change and how it could be implemented before starting my tests. Why Oracle would want to "batch" table access Usually large index range/full/skip scans with subsequent table access running serially cause lots of single block reads of a table. Depending on the clustering of the data, the number of table block reads could be as high as the number of ROWIDs fetched from index leaf blocks up to the next rowsource. In case of a serial execution plan it means that query performance depends on how fast single random table read is. Say you need to read 1000 random table blocks located far away from each other and average read of 1 block takes 5ms, then you need about 5 seconds to execute such query. But if the storage subsystem can handle concurrent IO requests well enough, and you were able to ask it for 1000 blocks someway concurrently or in parallel, transparent for the end user session, then it could take less wall clock time for a user while putting more pressure on the OS, storage and connectivity to the storage. How Oracle optimizes IO already As far as I know, Oracle can and does a few cunning things with IO even in pre-12.1 releases. Here is a(n incomplete, most likely) list with example optimizations you may see:

Within NESTED LOOP joins there are couple of strategies Oracle uses: NL-join batching and moving TABLE ACCESS out of a join (I've no idea how it is called exactly).
"Prefetching" with 'db file parallel read' - as described by Tanel Poder here (it gives you a very nice idea of what 'db file parallel read's are)
In case of a "cold" buffer cache Oracle may choose to read ahead, and instead of reading just a single block when you think it is enough, Oracle may opt to reading multiple physically adjacent on disk blocks to the cache (aka 'db file scattered read'). Sometimes it could hurt (a lot) the application performance, sometimes it doesn't matter, but the thing is: it's a "normal" thing to experience multi-block buffered reads on what should probably be single block reads.

Test Case

Based on my understanding of what Oracle can possibly do I have created a test scenario which could be used to find out more things behind BATCHED table access. Here is the setup: [sourcecode lang="sql"] drop table t1 cascade constraints purge; create table t1 ( id integer, x integer, y integer, pad varchar2(4000) ); insert /*+ append */ into t1 select rownum, mod(rownum, 1000), floor((rownum-1)/1000), lpad('x', 1000, 'x') from all_source a1, all_source a2 where rownum <= 1e6; create index t1_x_indx on t1(x); create index t1_y_indx on t1(y); exec dbms_stats.gather_table_stats(user, 't1', method_opt=>'for all columns size 1', cascade=>true, no_invalidate=>false); [/sourcecode] Very easy. I have created a sufficiently wide table holding 1 million rows with two integer columns following a very bad (T1.X) and very good (T1.Y) clustering of data. Usually it is also important where are you creating this table. Initially I created it in a standard USERS tablespace (i.e., ASSM, non-uniform extent size), but then switched to a MSSM tablespace with uniform extents of 1MB. Looking ahead, it does not make a difference to the test results (at least I could not identify it.) The test itself: [sourcecode lang="sql"] set linesize 180 pagesize 100 define ^ arraysize 100 col plan_table_output format a180 explain plan for select /*+ index(t1(x)) */ * from t1 where x = :1; select * from table(dbms_xplan.display); explain plan for select /*+ index(t1(y)) */ * from t1 where y = :1; select * from table(dbms_xplan.display); col spid new_value spid col curr_date new_value curr_date select p.spid,to_char(sysdate, 'YYYYMMDDHH24MI') curr_date from v$session s, v$process p where s.paddr = p.addr and s.sid = userenv('sid'); col tracefile new_value tracefile select value tracefile from v$diag_info where name='Default Trace File'; alter system flush buffer_cache; !sleep 1 alter system flush buffer_cache; select object_id, data_object_id, object_name from dba_objects where owner = user and object_name like 'T1%'; set termout off exec dbms_session.session_trace_enable(waits=>true, binds=>false) !strace -tt -p ^spid -o trc_^spid..txt & spool batched_^curr_date..txt select /*+ index(t1(x)) */ * from t1 where x = 1; select /*+ index(t1(y)) */ * from t1 where y = 2; spool off set termout on !orasrp -t --sort=fchela --sys=no ^tracefile orasrp_^spid..txt !cat orasrp_^spid..txt | grep -A 165 fvkg1sp2b73x prompt trace: orasrp_^spid..txt prompt strace: trc_^spid..txt prompt tracefile: ^tracefile exit [/sourcecode] So the test is also really easy, except for some diagnostic & preparation steps. Basically I'm tracing two statements, which are accessing T1 by two indexes respectively, both at OS and Oracle levels, and then parse Oracle trace file with OraSRP. You may want to use tkprof. I also used it initially but OraSRP has one feature which helps to see the waits with breakdown by object, like this: [sourcecode] --------- Time Per Call -------- Object/Event % Time Seconds Calls Avg Min Max -------------------------------------------- -------- ------------ --------- ---------- ---------- ---------- TABLE T1 [88650] db file parallel read 68.9% 4.8404s 26 0.1862s 0.0883s 0.2614s db file sequential read 29.9% 2.1023s 147 0.0143s 0.0014s 0.0636s INDEX T1_X_INDX [88651] db file sequential read 1.1% 0.0746s 5 0.0149s 0.0025s 0.0278s Disk file operations I/O 0.0% 0.0034s 1 0.0034s 0.0034s 0.0034s [/sourcecode]

Testing

I was using VirtualBox with 64-bit OEL 6.4 and Oracle 11.2.0.4 & 12.1.0.1. I also did (partial) tests on 11.2.0.3 running in OVM on a faster storage, and the results were similar to what I've observed with 11.2.0.4. Both instances were running with a pfile, with following parameters specified: [sourcecode] -- 11.2.0.4 *._db_cache_pre_warm=FALSE *.compatible='11.2.0.4.0' *.control_files='/u01/app/oracle/oradata/ora11204/control01.ctl','/u01/app/oracle/fast_recovery_area/ora11204/control02.ctl' *.db_block_size=8192 *.db_cache_size=300000000 *.db_domain='' *.db_name='ora11204' *.diagnostic_dest='/u01/app/oracle' *.dispatchers='(PROTOCOL=TCP) (SERVICE=ora11204XDB)' *.filesystemio_options=setall *.pga_aggregate_target=100000000 *.open_cursors=300 *.processes=100 *.remote_login_passwordfile='EXCLUSIVE' *.shared_pool_size=420430400 *.undo_tablespace='UNDOTBS1' -- 12.1.0.1 *._db_cache_pre_warm=FALSE *.compatible='12.1.0.0.0' *.control_files='/u01/app/oracle/oradata/ora121/control01.ctl','/u01/app/oracle/oradata/ora121/control02.ctl' *.db_block_size=8192 *.db_cache_size=300000000 *.db_domain='' *.db_name='ora121' *.diagnostic_dest='/u01/app/oracle' *.dispatchers='(PROTOCOL=TCP) (SERVICE=ora121XDB)' *.enable_pluggable_database=true *.filesystemio_options='SETALL' *.pga_aggregate_target=100000000 *.open_cursors=300 *.processes=100 *.remote_login_passwordfile='EXCLUSIVE' *.shared_pool_size=420430400 *.undo_tablespace='UNDOTBS1' [/sourcecode]

Test Process & Observations

Initially I started testing with a default database config, and filesystemio_options set to DIRECTIO. After some random tests, I realized that this cache warm up thing is not what I'm interested in right now and turned it off with a hidden parameter. Overall I think that the test results could be explained in the following:

Test results are inconsistent. This is the most irritating thing. However, after I ran the test multiple times in a row, I get a pretty stable outcome. So I consider the results after multiple consecutive runs of the same test. Usually it is just 2 runs, but sometimes more, especially after an instance restart. I've no understanding why it happens and what's behind the scene of the decisions. Maybe it has something to do with CKPT as Tanel mentions in his post on oracle-l, but I did not check (and honestly don't want to :))
Both 11g and 12c show that for a table access of scattered data (by T1_X_INDX index) Oracle may batch table access IO using db file parallel reads; on the OS level it is using io_submit/io_getevents calls to run IO with async API if it's turned on of course; in case of just DIRECTIO in place it uses a bunch of single block reads using pread
Both 11g and 12c can use multi-block access of clustered data (by T1_Y_INDX index) for index range scans (and most likely, full/skip scans too). This is one of the most amusing things: even though I turned off cache warm up, Oracle still can identify that the data I am accessing is well placed altogether, and it decides to read multiple adjacent table blocks at once. 12c, However, behaves differently and by default does not use buffered multi-block table reads
The size of multi-block IO (db file parallel read) is different between 11g and 12c: in 11g it is usually 39, sometimes 19. With 12c, by default the number of requests depends on the client's fetch size: it is equal to the minimum of fetch size and 127
Looks like the parameter _db_file_noncontig_mblock_read_count does not control the actual number of blocks read with db file parallel read; any value greater than 1 turns this feature on and the size of read requests stays the same (I have tested only setting it to 1, 2, 3, 5)
The word BATCHED appeared in execution plans of 12c is controlled with a new hidden parameter _optimizer_batch_table_access_by_rowid. By default the parameter is set to TRUE, so plans tend to include BATCHED in table access rowsource. In the run-time this setting acts very much similar to 11g behavior, so it reads scattered table data with db file parallel reads, except for the number of IO requests which is min(fetch_size, 127). If _optimizer_batch_table_access_by_rowid is set to FALSE on a session level, for example, then the plans generated by Oracle do not include BATCHED suffix in table access rowsource, but in run-time Oracle still uses multi-block IO in the same way as 11g does, i.e. 39 or 19 IO requests per one call and scattered reads of clustered table data are there as well!

Summary

In 12c Oracle changed some internal code path which deals with the batched table access. But important thing is that the batched table access is not new, so even if you disable it either explicitly with _optimizer_batch_table_access_by_rowid or implicitly with optimizer_features_enable, Oracle will still be able to utilize a similar approach as it was in 11g. One important thing, of course, is that by default the size of vector IO now depends on the client fetch size. And I can imagine a situation in which this change could make an impact on the application performance after an upgrade. I have uploaded test script & trace files from 11.2.0.4 and 12.1.0.1 here so if you would like to repeat my tests and compare results - feel free to do that.

Insight and analysis of technology and business strategy