Upgraded to 11GR2? Congrats! You Are in Direct Reads Trouble!

Yury Velikanov

November 14, 2012

I made this blog title intentionally provocative. However, it isn’t far from the truth. Don’t believe me? Continue reading. :)

Emergency call

I came up with the graph bellow to demonstrate the problem. I was called on an emergency to help a desperate client to take their Oracle E-Business Suite system’s performance under control yesterday. After researching for several hours, we found the root cause of the problem. The graph reflects the correlation between direct reads percentage and all physical reads. You would say that looking at aggregated data and percentages is a bad style and a waste of time. Well, I will leave it to you to decide. I think the graph demonstrates the problem very well, and this is all that counts.

You do not need to be a scientist to tell when the system has been upgraded to the 11.2.0.3 version. It was kind of obvious looking at the graph, right? Well, it took us a bit of time to find where the problem was. To save you some time, I put together a SQL that you can use to build a graph just like this one.

Did you migrate your Oracle database to 11.2.0.2 or a higher version recently? “Are you ready see how deep the rabbit hole goes?” (c) Matrix. Here you go. :)
direct_reads_11g_problem.sql

If you see a significant change in the pattern after an upgrade to 11G, you may be interested in the next set of statistics.

The blue line represents the Physical Reads statistic. The red line is the Direct Reads. Did you notice that Physical read volume increased from 100MB/s to 400MB/s? Well my client didn’t just notice it. An expensive storage array’s resources got saturated and the whole user community was paralyzed for several days. A good portion of the first day, the client worked with a storage vendor as all other Oracle IOs (e.g. log file sync, db file scattered read, db file scattered read, etc.) became very slow. Many hours of work with Oracle Support followed. After running out of ideas, the client called us. If we would have had the SQL handy, we would have avoided 1 day of troubleshooting efforts and focused our efforts on how to fix the issue. Keep the script handy. Who knows, it may save you some troubleshooting efforts.

Other recently migrated client

Following up on the emergency call’s results, out of curiosity, I decided to check another client’s environment that recently migrated to 11.2.0.3 version. Have a look. I don’t think I need to say much.

And the Direct Reads contribution to total Physical Reads graph for the same database looks like the following:

Quick fix

We are still discussing how to resolve the issue for good. The following two actions have been implemented as a quick fix to give a bit of relief to the system’s user community.

Disable a new functionality related to Direct reads enhancement introduced in the 11.2.0.2 version. Google the event number and you will find several related blog posts including Tanel’s blog post here.

SQL> alter system set events '10949 trace name context forever';

[see my comment bellow] ~~Switch off~~ Automatic Memory Management. IMHO, it should be switched off for any serious Oracle database. It introduces more problems than it adds value

High 'direct path read' waits in 11g [ID 793845.1]

NOTES:

Be aware that the script is AWR based. Therefore, a database should have a Diagnostic licence in order to use it
If you don’t have the Diagnostic licence, just use the idea and convert it to STATSPACK. (Don’t forget to share it with me and others!) :)
Please do not hesitate to drop me a message if you spot any mistakes (I know I do those all the time) or have related comments.

Let me know if the script I have introduced to you in this blog post helped you to spot something interesting in your system.

======================================================

Added by Yury on 2012.11.24. As a result of internal discussion with Pythian folks (special thanks to Marc Billette):

Additional Comments related to this blog post

% of DPR doesn’t mean anything. A high number could be a good or bad thing for different systems. The answer, as always: It depends. :)
However, in my clients’ case, the DPR increase was accompanied by SIGNIFICANT physical IO increase (see my second graph).
As the change in behavior happened right after the 11G upgrade, it was enough to point us to the temporary fix.
A single serial DPR is more efficient (cheap from resources point of view) than traditional Oracle physical Read (db file sequential read or db file scattered read).
If it is given that a process MUST read X physical blocks and none of table blocks are cached at the moment, the DPR will be faster and utilize less resources on the server.
However, the DPR doesn’t cache anything. Therefore, none of the other processes can share the work the process did.

On Exadata

The Exadata is designed to make a lot of DPR to leverage data processing offloading to cells hosts.
In fact, I think the new serial direct reads decision mechanism was implemented because and for the Exadata (just guessing here).
High DPR is just an indication that an environment doesn’t use buffer cache. That’s all. Is it good or bad? As always, it depends. :)

Just an idea

Just an idea of what I would do to verify if DPRs are good in your environment. Look for smallest tables processes read via DPR. If the size of the table isn’t too big (e.g. ~5% of your buffer cache), but way too important for the application (e.g. CLIENTS_CHARGE_PLANS), have a look at what the total % of physical IO the table generates compared to the total amount of physical IO in the system. If you discover that this table generates 98% of physical IO (or even 30-50%), you may cut it by ensuring that data from this table is cached in buffer pool (keep pool, cache attribute, etc.) and that foreground processes don’t make DPR against that table (_%small% threshold parameter tuning).

P.S. If you are interested, I can help with this analysis (as I am interested myself). ;)

Insight and analysis of technology and business strategy