Upgraded to 11GR2? Congrats! You Are in Direct Reads Trouble!

Nov 14, 2012 / By Yury Velikanov

Tags: , , ,

I made this blog title intentionally provocative. However, it isn’t far from the truth. Don’t believe me? Continue reading. :)

Emergency call

I came up with the graph bellow to demonstrate the problem. I was called on an emergency to help a desperate client to take their Oracle E-Business Suite system’s performance under control yesterday. After researching for several hours, we found the root cause of the problem. The graph reflects the correlation between direct reads percentage and all physical reads. You would say that looking at aggregated data and percentages is a bad style and a waste of time. Well, I will leave it to you to decide. I think the graph demonstrates the problem very well, and this is all that counts.

You do not need to be a scientist to tell when the system has been upgraded to the 11.2.0.3 version. It was kind of obvious looking at the graph, right? Well, it took us a bit of time to find where the problem was. To save you some time, I put together a SQL that you can use to build a graph just like this one.

Did you migrate your Oracle database to 11.2.0.2 or a higher version recently? “Are you ready see how deep the rabbit hole goes?” (c) Matrix. Here you go. :)
direct_reads_11g_problem.sql

If you see a significant change in the pattern after an upgrade to 11G, you may be interested in the next set of statistics.

The blue line represents the Physical Reads statistic. The red line is the Direct Reads. Did you notice that Physical read volume increased from 100MB/s to 400MB/s? Well my client didn’t just notice it. An expensive storage array’s resources got saturated and the whole user community was paralyzed for several days. A good portion of the first day, the client worked with a storage vendor as all other Oracle IOs (e.g. log file sync, db file scattered read, db file scattered read, etc.) became very slow.  Many hours of work with Oracle Support followed. After running out of ideas, the client called us. If we would have had the SQL  handy, we would have avoided 1 day of troubleshooting efforts and focused our efforts on how to fix the issue. Keep the script handy. Who knows, it may save you some troubleshooting efforts.

Other recently migrated client

Following up on the emergency call’s results, out of curiosity, I decided to check another client’s environment that recently migrated to 11.2.0.3 version. Have a look. I don’t think I need to say much.

And the Direct Reads contribution to total Physical Reads graph for the same database looks like the following:

Quick fix

We are still discussing how to resolve the issue for good. The following two actions have been implemented as a quick fix to give a bit of relief to the system’s user community.

    • Disable a new functionality related to Direct reads enhancement introduced in the 11.2.0.2 version. Google the event number and you will find several related blog posts including Tanel’s blog post here.
SQL> alter system set events '10949 trace name context forever';
    • [see my comment bellow] Switch off Automatic Memory Management. IMHO, it should be switched off for any serious Oracle database. It introduces more problems than it adds value
High 'direct path read' waits in 11g [ID 793845.1]

NOTES:

  • Be aware that the script is AWR based. Therefore, a database should have a Diagnostic licence in order to use it
  • If you don’t have the Diagnostic licence, just use the idea and convert it to STATSPACK. (Don’t forget to share it with me and others!) :)
  • Please do not hesitate to drop me a message if you spot any mistakes (I know I do those all the time) or have related comments.

Let me know if the script I have introduced to you in this blog post helped you to spot something interesting in your system.

======================================================

Added by Yury on 2012.11.24. As a result of internal discussion with Pythian folks (special thanks to Marc Billette):

Additional Comments related to this blog post

  • % of DPR doesn’t mean anything. A high number could be a good or bad thing for different systems. The answer, as always: It depends. :)
  • However, in my clients’ case, the DPR increase was accompanied by SIGNIFICANT physical IO increase (see my second graph).
  • As the change in behavior happened right after the 11G upgrade, it was enough to point us to the temporary fix.
  • A single serial DPR is more efficient (cheap from resources point of view) than traditional Oracle physical Read (db file sequential read or db file scattered read).
  • If it is given that a process MUST read X physical blocks and none of table blocks are cached at the moment, the DPR will be faster and utilize less resources on the server.
  • However, the DPR doesn’t cache anything. Therefore, none of the other processes can share the work the process did.

On Exadata

  • The Exadata is designed to make a lot of DPR to leverage data processing offloading to cells hosts.
  • In fact, I think the new serial direct reads decision mechanism was implemented  because and for the Exadata (just guessing here).
  • High DPR is just an indication that an environment doesn’t use buffer cache. That’s all. Is it good or bad? As always, it depends. :)

Just an idea

Just an idea of what I would do to verify if DPRs are good in your environment. Look for smallest tables processes read via DPR. If the size of the table isn’t too big (e.g. ~5% of your buffer cache), but way too important for the application (e.g. CLIENTS_CHARGE_PLANS), have a look at what the total % of physical IO the table generates compared to the total amount of physical IO in the system. If you discover that this table generates 98% of physical IO (or even 30-50%), you may cut it by ensuring that data from this table is cached in buffer pool (keep pool, cache attribute, etc.) and that foreground processes don’t make DPR against that table (_%small% threshold parameter tuning).
P.S. If you are interested, I can help with this analysis (as I am interested myself). ;)
View Yury Velikanov's profile on LinkedIn

23 Responses to “Upgraded to 11GR2? Congrats! You Are in Direct Reads Trouble!”

  • Gwen Shapira says:

    Also great example to the benefits of visualizing your performance metrics.

  • Noons says:

    +1 to disabling automatic memory management. Other than very small, inconsequential, test databases, I have yet to see one instance where it is worth even considering!
    The thing that has to be thought out is this: automating-this-or-that does not immediately imply that the target will be ideally tuned!
    “Automatic management” is *NOT* a synonym for “automatically tuned”! Far from it.

    Automation takes some time to recognize a change in conditions that warrants a change in parameters. In turn, that change of parameters may involve extensive re organization of the memory and its consequent use by other sections. That consequent change in turn may trigger other boundaries that cause further automation changes, with the consequence that the system may easily be chasing its tail trying to self-tune for a situation that *might* have been relevant hours ago!
    Add to that an application that might work on daily cycles like e-business suite, and you got a test case for wild oscillations.

    In electronics, this situation is called a self-oscillation: a condition where control feedback in a closed system reaches and affects the input processing waaaaay after it should have done so. And the system promptly enters into potentially destructive oscillation. I say potentially destructive because that might be the intention: oscillators and wave generators work exactly on such – self-oscillation through in-phase feedback, also called positive feedback.
    As opposed to out-of-phase feedback which acts to control a system and not let it oscillate outside of a small range.
    Unfortunately, most current “automated” software setups of late seem to suffer from a total lack of knowledge of their developers of these basic engineering principles. ANY closed system is susceptible to oscillation through feedback, and software is certainly no exception.
    I have yet to see ANY of my 11.2.0.3 databases have ANY I/O problems, quite the contrary in fact. But then again, there is not a single one of them that is using automatic memory management.

  • As rightly mentioned by Martin Berger (@martinberx) on tweeter http://bit.ly/RT5f1l the AMM wasn’t really disabled.

    We just increased minimal limits for shared_pool_size=5G and db_cache_size=38G as the sga_target was still set to the higher value AMM was still enabled.

    On the other hand it looks we could implement one of the changes to fix the issue as
    – setting the event disables the Direct Reads related feature introduced in 11.2.0.2
    – increasing db_cache_size impacts on the default default threshold value (2%) for the small tables

    Well in this case the team didn’t want to take any changes and therefore both changes found their way in prod.

    Yury

    PS Now we are advising/planning to disable AMM in this envrionment for good in few weeks time :)

  • Ajith Narayanan says:

    Hi Yury,

    This post was informative great troubleshooting, just few additions

    ASMM Enabled in large Oracle Apps environment indicates that ERP runs on a vanilla installation
    above which there would be tonnes of application customizations that will make the ASMM confused
    or mislead it with its ever confusing workload, In Prod never recommended, Sholud take the memory
    advisories suggestion to set the SGA manually, and non-zero value on sga_target will ignore the
    manual AMM settings, So sga_target=0 will be good :)

    Whereever there are such vanilla instances running the below is one common mistake that goes unnoticed
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    And, I most cases there would be GATHER_STATS_JOB(default with 10g & above) enabled instead of
    FND_STATS, If its running on a vanilla installation the performance degrades again.

    If the above case is true, then it should be disabled using below command
    exec DBMS_SCHEDULER.DISABLE(‘GATHER_STATS_JOB’);

    Below should be the concurrent program that should be enabled in that case.

    Name of the Concurrent / DBMS job: Gather Schema Statistics
    Executable Name/short Name: FNDGSCST
    Schedule / Submit Interval: , From start of prior run.
    Parameter Reference:
    Schema Name=ALL
    Estimate Percent=10
    Degree=3
    Backup Flag=NoBackup
    History Mode=None
    Gather Option=Gather
    Keep other parameters as defaulted.

  • lyxx says:

    Hi, thanks for sharing. can you please share how did you create these graphs out of the sql you shared ?

  • Nice article , thank you for your effort

  • Noons says:

    Maybe not necessary to disable AMM, Yury. Pegging the minimal amount of db cache and minimal amount of shared memory will ensure things don’t oscillate violently between too wide values. You’re simply forcing the AMM to not go bersek and cause other issues. “Putting on the brakes”, so to speak.
    I think you guys went the right way with these changes, although likely I wouldn’t have entirely disabled the direct reads – they might serve a good purpose in some cases. Putting the “brakes” on AMM should be enough to keep things under control. Definitely keep us posted on the results once you turn AMM off completely and any observations after that.

    • Andjelko Miovcic says:

      I’m agree, “not necessary to disable AMM”, setting minimum value for key memory pools can help, and always “set up” KEEP or RECACLE pool to appropriate size, they will not be adjusted by AMM and not all memory pools are automatically resized by AMM.
      What actually I hate with AMM I can’t lock the SGA to prevent “paging out” to disk, in rare situation it could be handy. Honestly I never use this in production, but I like to try in QA for instance
      I like to enable AMM, the potential performance gain from AMM are significant, but everything depending on “workload”. I recently disable AMM on one instance, simply “workload” contains a huge number of sort & read intensive operations and causing shifting memory between SGA and PGA very frequently, and then single session is taking a huge amount of memory, I’m fed up with this and disable AMM.
      In my opinion, it is not bad idea to enable AMM, the potential performance gain from AMM is significant, but monitoring and “setting minimum size for certain memory area” is extremely important.
      By the way compatibility could be issue like AMM and LINUX &HugePages, …
      So lot of plus and minus, nothing is perfect, but I’m not agreeing that AMM is useless.

      Andjelko Miovcic

  • It’s a feature that I’m unconvinced about.

    I understand that benefit that it’s meant to deliver but in reality I’ve seen it cause more problems that it resolves (but that’s because we get called about the problems, right?).

    +1 for turning off AMM regardless. It probably works well on perfect systems but who’d know?

    +1 for increasing the size of the buffer cache where possible. That’s my advice.

    The last significant upgrade to 11gR2 that I was involved involved multiple testing cycles eventually led to a 4x increase in the BC – not just because of direct path reads – but that increase meant direct path was generally fine when used.

    Should really be picked up by proper testing cycles in advance of migration.

    Similar story from Coskan:
    http://coskan.wordpress.com/2011/08/15/what-happens-when-you-did-not-do-a-load-test-before-release-upgrades/

    Of course, the other factor of the feature is the “adaptive” part of “adaptive serial direct reads” where the behaviour can adapt to what is actually cached.

    I blogged observations on this in a prod system where the plan didn’t change, the buffer cache size didn’t change, just the amount of data cached (probably) changed:
    http://orastory.wordpress.com/2012/06/19/flippin-witch-hunt-adaptive-direct-path-read/

  • Paul Aitken says:

    I’ve seen this happen in two large databases that have been upgraded to 11GR2.

    What seems to happen is that memory is very heavily allocated to the PGA and you end up with a very small SGA as a result. Hence you end up with ever increasing Direct Reads.

    You can set SGA_TARGET and PGA_TARGET (which are minimum values) so that you leave only a small portion of your MEMORY_TARGET for the database to automatically manage.

    I have also seen a suggestion (which I have not tested yet) that setting Dead Connection Detection will mitigate this problem. The reasoning being that orphaned sessions in the database that consumed a lot of PGA memory are not being cleaned up, so you end up with more memory being allocated to the PGA over time unnecessarily .

    • Hey Paul. As I am working with EBS environments and those tends to be relatively important to the clients (read gets a bit more DBA time than others) I don’t use MEMORY_TARGET at all and probable will not use in the nearest future. One of the disadvantages of MEMORY_TARGET is the fact that you can’t use Huge Pages in such configuration. IMHO: For SGAs over 5-10GB it starting to impact performance significantly.

  • patrick says:

    I never saw anything with reads, however I did find that oracle took way more memory that it was assigned in 11.2+, even when memory target is set to around 500M, the processes for the database would spill well over that value.

    Turning AMM off didn’t fix the issue.

    This was on AIX 6.

  • Ramanathan.V says:

    In our customers place the oracle version is 11.2.0.2. Frequently the error ‘program unit not found’appears when our APIs are invoked. Observed that some objects are uncompiled and when compiled it gets compiled automatically. There is no change in user objects done. What could be the reason? Please help.

  • sudershan virdi says:

    The issue with KEEP pool leading to direct I/O was killing the performance for us.

    Due to this bug, tables with size >10% of cache size, were being treated as ‘large tables’ for their reads and this resulted in execution of a new SERIAL_DIRECT_READ path in 11g.

    KEEP BUFFER POOL Does Not Work for Large Objects on 11g [ID 1081553.1]

  • I. Gorbatovsky says:

    Hi Yury,
    Thanks for your article! I was curious to try your case on my fresh migrated (11.2.0.3) environment. I saw completely opposite picture. Direct path read felt 3 times. All memory parameters tuned manually, though. That could be the case.

  • Darryl Griffiths says:

    This makes for an interesting read.
    I’m wondering if the effects of the “direct path” reads is more profound on systems where the O/S file system mount point options have been “tuned” to minimise O/S file system cache usage.
    That would provide another penalty in performance. Not only is the work not shared at the DB level cache, but it wouldn’t be shared at the O/S file system cache either.
    Maybe Oracle whitepapers should be rewritten to *not* suggest adjusting the O/S file system mount point settings with relation to file system cache if Oracle are going to use more “direct path” reads.

  • Vsevolod Afanassiev says:

    Hi Yury,

    We also encountered this issue. An application has a medium-size table (800 MB) that gets accesses through full table scan approx 1,000 times per day (application design issue). With db_cache_size set to 7 GB this full table scan performs zero disk reads and takes 0.5 – 0.6 second. As records are added to the table it grows and at some point it exceeded 5 * _small_table_threshold and switched to direct reads. So overnight performance deteriorated from 0.5 second to 30 seconds per execution. This feels even worse then post-upgrade problem: at least one expects problems after upgrade, but who expects problems as a result of table growing from 800 MB to 810 MB? If Oracle feels the need to “promote” direct reads it should have been done gradually, without changing default behaviour.

  • Vsevolod Afanassiev says:

    Regarding disabling AMM: there are some reasons for keeping AMM enabled. Consider Oracle 10.2 that has ASMM. It is enabled when SGA_TARGET > 0.
    Reasons for keeping ASMM enabled:
    - Ability to add memory to SGA dynamically. One some platforms like AIX if SGA_TARGET < SGA_MAX_SIZE Oracle allocates memory according to SGA_TARGET.
    Then if memory is dynamically added to LPAR we could dynamically add memory to SGA.
    - Flexibility: a DBA may configure SGA for a given workload, but workloads tend to change. In a perfect world DBA will have opportunity to re-configure SGA, but it doesn't always happen. For example, we have databases with large number of connections (5,000). These connections go through shared server and use Large Pool. Sometimes there is a spike in the number of connections, with ASMM enabled Large Pool grows accordingly.
    - Memory leaks inside SGA: Oracle has bugs that cause memory leaks inside SGA, typically in shared pool (in one sub-pool of the shared pool). For example, 9.2 had a leak that was triggered by TRUNCATE or DROP operations and caused growth of 'miscellaneous' sub-heap of the pool. 11.1 has a leak that causes growth of CCursor. 11.2 has a leak that causes growth of PRTMV sub-heap.
    By enabling automatic re-configuration of SGA we give ourselves more time for dealing with these leaks.

  • Phani Pendurthi says:

    Did anyone come across “direct path write” after the DB upgrade to 11.2.0.4 ? This is killing our database performance , huge I/O waits,our response times were increased a lot .

Leave a Reply

  • (will not be published)

XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>