Options for Tracing Oracle dbms_stats
A Baffling Case of ORA-00942Recently we had a good reason to find out what was going on behind the scenes with dbms_stats.The automatic stats collection job in a client's database was frequently failing due to ORA-00942 table not found errors. There were a number of red herrings involved in finding a solution to this issue.
Red HerringsOne of the red herrings involved the number of tables that were being created and dropped on a daily basis. At the time this was occurring, up to 200,000 tables may have been created and dropped daily. Running a few tests though caused us to discard this theory. Here is what happens if you delete a table while stats are being gathered on the table:
- T1: Session 1: create table xx as select * from sometable with 7m rows
- T2: Session 2: exec dbms_stats.gather_table_stats('JKSTILL','XX',estimate_percent => 100)
- T3: Session 1: drop table xx;
- T4: Session 2: after 20 seconds or so: ORA-08103: object no longer exists
Enable TracingThe next obvious step seemed to be setting an event in the instance that would cause a trace file to be generated when an ORA-00942 was encountered. Following is the command used: [code lang="sql"] alter system set events '942 trace name errorstack level 3’; [/code] We set the parameter, and that evening the ORA-00942 did appear in the alert log. We checked the generated trace file and found nothing of value. The only information in the trace file was the Oracle banner and the session information. A search was made of all trace files, and nothing was found. As this was a three node RAC system, all nodes were searched. This was a little perplexing, as testing this event on another 11.2 RAC system provided a trace file. There were a couple of differences: the production RAC was Solaris, while the testing was performed on a Linux system. What may be more likely, however, is that jobs run via the autotask process may have been ignoring the events directive - more on that later. (At this point I need to mention that we may have neglected to run the ‘alter system set events’ on all instances. However, that may have been a case of serendipity as you shall see.)
DBMS_STATS TracingAfter some diligent searching, I found that dbms_stats has its own built in tracing routines. Tracing is enabled by calling dbms_stats.set_global_prefs(‘trace’,<trace flags>) Following are the possible values for the trace flags: 1 = use dbms_output.put_line instead of writing into trace file 2 = enable dbms_stat trace only at session level 4 = trace table stats 8 = trace index stats 16 = trace column stats 32 = trace auto stats - logs to sys.stats_target$_log 64 = trace scaling 128 = dump backtrace on error 256 = dubious stats detection 512 = auto stats job 1024 = parallel execution tracing 2048 = print query before execution 4096 = partition prune tracing 8192 = trace stat differences 16384 = trace extended column stats gathering 32768 = trace approximate NDV (number distinct values) gathering Use a combination of flags to combine different tracing levels. In our case, we wanted to trace the auto stats job, and dump a trace on error, so the following options were used: [code lang="sql"]exec dbms_stats.set_global_prefs('trace',to_char(512+128))[/code] The next day, however, the story was a familiar one: ORA-00942 appeared in the alert log as the reason the stats job failed, but no useful information appeared in the trace files.
Even More TracingTime to get more aggressive. I would replace the default stats gathering job with a wrapper that would enable both the dbms_stats specific tracing, and 10046 tracing, all in the same session that was running the stats job. The following steps were taken to finally get a useful trace of the stats job, and find out just what statement was causing the ORA-00942 error. First, a wrapper was created: [code] create table last_sql_tab ( sql_date timestamp with local time zone not null, sql_error number not null, sql_id varchar2(13) not null, sql_fulltext clob not null ) tablespace sysaux / -- db_stats_wrapper.sql -- use a wrapper for dbms_stats.gather_database_stats_job_proc -- so that tracing can be set if needed create or replace procedure sys.db_stats_wrapper as error_code number; v_sql_fulltext clob; v_curr_sqlid varchar2(13); v_prev_sqlid varchar2(13); begin -- 2 = enable dbms_stat trace only at session level -- 128 = dump backtrace on error -- 512 = auto stats job -- 2048 = print query before execution dbms_application_info.set_module( module_name => 'sys.db_stats_wrapper', action_name => 'auto optimizer stats collection' ); dbms_stats.set_param('trace',to_char(2 + 2128 + 512 + 2048)); execute immediate q'[alter session set events '10046 trace name context forever, level 4']'; dbms_stats.gather_database_stats_job_proc; exception when others then error_code := sqlcode; select sql_id, prev_sql_id into v_curr_sqlid, v_prev_sqlid from v$session where sid = sys_context('userenv','sid'); -- hopefully the sql_id and prev_sql_id are still available in v$session -- and v$sqlstats insert into last_sql_tab(sql_date, sql_error, sql_id, sql_fulltext) select systimestamp , error_code, sql_id, sql_fulltext from v$sqlstats where sql_id in (v_curr_sqlid,v_prev_sqlid); commit; -- reset to the default gather stats program once we have a failure with trace file DBMS_SCHEDULER.SET_ATTRIBUTE ( name => 'GATHER_STATS_PROG', attribute => 'program_action', value => 'dbms_stats.gather_database_stats_job_proc' ); end; / [/code] The next step was to set the new wrapper as the procedure for the auto stats job to run: [code] BEGIN DBMS_SCHEDULER.SET_ATTRIBUTE ( name => 'GATHER_STATS_PROG', attribute => 'program_action', value => 'sys.db_stats_wrapper' ); END; / [/code] The next day, success! While the stats job failed as expected with ORA-00942, this time we had useful trace files. While a lot of interesting information could be found in the trace files created by the dbms_stats tracing mechanism, ultimately the issue was found in the 10046 output, near the end of the trace file: [code] ERROR #18446740225354926104:err=942 tim=5273897072222[/code] Looking for the first occurrence of the cursor address, the failing statement was located. I won’t reproduce all of it here as it is quite large (16k). There is no doubt that the query is used to gather stats: [code] PARSING IN CURSOR #18446740225354926104 len=16707 dep=2 uid=0 oct=189 lid=0 tim=5270675200777 hv=3012852732 ad='1991c6b148' sqlid='63n9pwutt8yzw' MERGE /*+ dynamic_sampling(ST 4) dynamic_sampling_est_cdn(ST) */ INTO STATS_TARGET$ ST USING (SELECT STALENESS, OSIZE, OBJ#,… [/code] So the culprit has been found, now what? There was nothing obviously amiss with the SQL statement, so a new SR was opened with Oracle Support. After a couple days of discussions with Support, we were no closer to a solution. It seems this was something they had not seen until now.
The Value of NetworkingWhile discussing this topic with one of my fellow DBAs another topic came up. During the discussion, my co-worker mentioned that he had experienced issues caused by one of Oracle’s hidden parameters being set to a non-default value. The parameter in question: _optimizer_push_pred_cost_based Excusing myself, I immediately logged on to the database and checked the value for that parameter. [code lang="sql"] 1 select 2 a.KSPPINM NAME 3 , a.KSPPDESC DESCRIPTION 4 , b.KSPFTCTXVL VALUE 5 , decode(b.KSPFTCTXDF,'TRUE','Y','N') ISDEFAULT 6 from X$KSPPI a, x$ksppcv2 b 7 where b.kspftctxpn = a.indx+1 8 and a.KSPPINM like '_optimizer_push_pred_cost_based' 9* order by name 14:43:31 oravm01.jks.com - sys@oravm1 SQL> / DEF PARAMETER NAME DESCRIPTION VALUE VAL? ----------------------------------- -------------------------------- -------- ---- _optimizer_push_pred_cost_based use cost-based query transformat FALSE N ion for push pred optimization [/code] The parameter was set to a non-default value. Might this be the cause of the problem?
The Culprit is FoundA search of Oracle Support revealed exactly what I was looking for: Bug 14341002 : WRONG RESULT ON "UNION-ALL PARTITION" "_OPTIMIZER_PUSH_PRED_COST_BASED=FALSE" This note contains test cases to show what can happen when _optimizer_push_pred_cost_based = FALSE and there are UNION ALL set operators present in the SQL. Going back to the trace file, I checked to see if SQL in question contained any UNION ALL operators. The answer is yes, it did, as 11 of them were found in the SQL. It looked like the solution had been found.
The Final TestOnce again a wrapper was employed to test the effects of changes on the auto stats job. This time the purpose was not to trace, but to reset the value of _optimizer_push_pred_cost_based back to the default value of TRUE for the session that would gather database statistics. [code lang="sql"] -- db_stats_wrapper_2.sql -- use a wrapper for dbms_stats.gather_database_stats_job_proc -- setting parameters that will allow stats to (usually) complete without ORA-00942 create or replace procedure sys.db_stats_wrapper_2 as error_code number; v_sql_fulltext clob; v_curr_sqlid varchar2(13); v_prev_sqlid varchar2(13); n_table_not_found number := -942; table_not_found exception; pragma exception_init(table_not_found,-942); begin dbms_application_info.set_module( module_name => 'sys.db_stats_wrapper_2', action_name => 'auto optimizer stats collection' ); -- default value for _optimizer_push_pred_cost_based is true - database is set to false -- setting to false is known to cause incorrect results with UNION ALL in 22.214.171.124 -- the SQL in dbms_stats that gets ORA-00942 has several UNION ALL execute immediate q'[alter session set "_optimizer_push_pred_cost_based" = true]'; dbms_stats.gather_database_stats_job_proc; exception when others then error_code := sqlcode; select sql_id, prev_sql_id into v_curr_sqlid, v_prev_sqlid from v$session where sid = sys_context('userenv','sid'); -- hopefully the sql_id and prev_sql_id are still available in v$session -- and v$sqlstats insert into last_sql_tab(sql_date, sql_error, sql_id, sql_fulltext) select systimestamp , error_code, sql_id, sql_fulltext from v$sqlstats where sql_id in (v_curr_sqlid,v_prev_sqlid); commit; -- make another attempt to gather stats if error_code = n_table_not_found then dbms_stats.gather_database_stats_job_proc; end if; end; / -- set the new procedure as the stats gather job BEGIN DBMS_SCHEDULER.SET_ATTRIBUTE ( name => 'GATHER_STATS_PROG', attribute => 'program_action', value => 'sys.db_stats_wrapper_2' ); END; / [/code] The next day was a happy one as the statistics ran without error and continued to do so each day. Soon it was determined that the _optimizer_push_pred_cost_based parameter could be permanently set back to its default value of FALSE. So ends the story. But, what about that DMBS_STATS tracing?
DBMS_STATS Tracing – Take 2Ultimately this tracing did not help us in troubleshooting this issue. Nonetheless this is quite an interesting undocumented feature of DBMS_STATS. I have not yet spent too much time working with it, but a cursory examination of trace file recently created to look at how Oracle goes about calculating the number of distinct values (NDV) for columns shows that it is quite verbose. Following is a partial example DBMS_STATS: Start gather table stats -- tabname: MYTABLE DBMS_STATS: Started table MYUSER.MYTABLE.SYS_P100050 at 16-OCT-13 06.36.51.805782000 PM -07:00. Execution phase: 1 stattype: DATA DBMS_STATS: Specified granularity = ALL, Fixed granularity = ALL DBMS_STATS: parameters ** pfix: granularity: Z gIdxGranularity: executionPhase: 1 pname: SYS_P100050 stime: 10-16-2013 18:36:51 method_opt: FOR ALL COLUMNS SIZE AUTO DBMS_STATS: Start construct analyze using SQL .. Execution Phase: 1 granularity: ALL global_requested: pfix: DBMS_STATS: gather stats on subpartition SYS_SUBP202384:stats are stale DBMS_STATS: Start gather_stats.. pfix: ownname: MYUSERtabname: MYTABLE pname: SYS_P202408 spname: SYS_SUBP202384 execution phase: 1 DBMS_STATS: Using approximate NDV pct=0 DBMS_STATS: NNV NDV AVG MMX HST EP RP NNNP IND CNDV HSTN HSTR COLNAME DBMS_STATS: Y Y Y Y Y Y Y Y CLIENT_ID DBMS_STATS: Y Y Y Y Y Y GROUP_ID DBMS_STATS: Y Y Y Y Y Y Y CONTACT_ID DBMS_STATS: Y Y Y Y Y Y Y MYDATE DBMS_STATS: Approximate NDV Options DBMS_STATS: NDV,NIL,NIL,NDV,NIL,NIL,NDV,NIL,NIL,NDV,NIL,NIL … DBMS_STATS: Building Histogram for CLIENT_ID DBMS_STATS: bktnum=-1, nnv=13643, snnv=13643, sndv=1, est_ndv=1, mnb=254 DBMS_STATS: Trying frequency histogram DBMS_STATS: Building Histogram for GROUP_ID DBMS_STATS: bktnum=-1, nnv=13643, snnv=13643, sndv=1, est_ndv=1, mnb=254 DBMS_STATS: Trying frequency histogram DBMS_STATS: Building Histogram for CONTACT_ID DBMS_STATS: bktnum=-1, nnv=13643, snnv=13643, sndv=13643, est_ndv=13643, mnb=254 DBMS_STATS: Trying height balanced histogram DBMS_STATS: Building Histogram for SEGMENT_DATE_JOINED DBMS_STATS: bktnum=-1, nnv=13643, snnv=13643, sndv=10749, est_ndv=10749, mnb=254 DBMS_STATS: Trying height balanced histogram DBMS_STATS: ==================================================================================================== DBMS_STATS: Statistics from clist: DBMS_STATS: ==================================================================================================== DBMS_STATS: Number of rows in the table = 13643 DBMS_STATS: COL_NAME ACL NV SNNVDV NDV SNDV DENS CCNT DBMS_STATS: ---------------------------------------------------------------------------------------------------- DBMS_STATS: CLIENT_ID 4 0 13643 1 1 .00003664 1 DBMS_STATS: min: Typ=2 Len=3: c2,18,63 DBMS_STATS: max: Typ=2 Len=3: c2,18,63 DBMS_STATS: Histograms: DBMS_STATS: ------------------------------------------------------------------------------------------- DBMS_STATS: BVAL EAVAL EDVAL DBMS_STATS: ------------------------------------------------------------------------------------------- DBMS_STATS: # Typ=2 Len=3: c2,18,63 Anyone investigating how DBMS_STATS works under the covers would probably find this quite useful. Here are a couple of other bits that will be of use should you wish to experiment with tracing DBMS_STATS This bit of SQL will create the table needed should you choose to use trace level 32. [code lang="sql"] create table sys.stats_target$_log as select t.*, rpad('X',30,'X') session_id, 1 state from sys.stats_target$ t where 1=0; [/code] The following SQL will provide some history of the options used. The SPARE1 column indicates if the option has ever been enabled. [code lang="sql"] set linesize 200 COLUMN sval1 ON FORMAT 999999 COLUMN sval2 ON FORMAT a36 COLUMN spare1 ON FORMAT 999999999999999999 head 'TRACE|ENABLED?' COLUMN spare2 ON FORMAT a6 COLUMN spare3 ON FORMAT a6 COLUMN spare4 ON FORMAT a40 COLUMN spare5 ON FORMAT a6 COLUMN spare6 ON FORMAT a6 select sname, sval1, sval2 ,spare1 --, spare2 --, spare3 , spare4 --, spare5 --, spare6 from SYS.OPTSTAT_HIST_CONTROL$ order by sname / [/code]
SummaryThere are a few lessons to be learned here:
- Setting an instance-wide parameter to deal with a problem can cause other issues that may be quite difficult to troubleshoot.
- Never underestimate the value of networking with your peers.
- Don’t think that by opening an SR you no longer have to work the issue – keep digging. It could be that no one else has seen this issue.
- Never give up. If one method of troubleshooting does not work, look for others. Not only will this help solve the problem, but you may just learn something new in the process.
- Solaris 11
- Oracle 11.2.03 in a three node RAC configuration