THE WORLD DISCUSSES #PYTHIAN ON TWITTER. HAVE A QUESTION? USE OUR HASHTAG AND ASK AWAY.

Upgrading to Fedora 12? You might need more /boot space!

Today, I had a spare Fedora 11 machine sitting next to me, so I thought I’d try the upgrade to the newly-released Fedora 12, aka “Constantine.” Fedora support cycles are rather short compared to Ubuntu, so Fedora 11 will likely be de-supported in 6 to 7 months. Normally I’d wait a little longer into the Fedora 12 cycle for others to find the fun upgrade bugs and have them fixed for me, but I didn’t mind having to re-install from scratch on this machine if I needed to.

Following the Fedora documentation, I decided to use the “preupgrade” tool. Everything was going smoothly until the machine restarted to begin installation of the new packages. I got a message that there wasn’t enough space in my /boot partition. Specifically, the message claimed that there was insufficient disk space in /mnt/sysimage/boot. I found this rather odd and troubling, since I had let the Fedora installer determine the /boot partition size when I originally installed Fedora 11.

Turns out that this is a known problem with the preupgrade tool. A kind soul in #fedora on IRC directed me to the list of common Fedora 12 bugs, in particular the preupgrade free space check. I installed the updated preupgrade package as directed, but again got the error. That’s when I followed the next link for additional tips to free up space in /boot. The first was to remove obsolete kernels, which I had already done. The next was to run tune2fs on /boot filesystem to free up reserved blocks, which aren’t needed for /boot. I strongly suggest you visit the links provided for helpful screenshots and commands to follow.

After making these changes, the upgrade worked and am I’m the proud owner of a Fedora 12 Constantine laptop, with a slightly brighter shade of blue desktop than that crusty old Fedora 11. ;)

HOWTO: Oracle Cross-Platform Migration with Minimal Downtime

I recently performed a migration from Oracle 10gR2 on Solaris to the same version on Linux, immediately followed by an upgrade to 11g. Both platforms were x86-64. Migrating to Linux also included migrating to ASM, whereas we had been using ZFS to hold the datafiles on Solaris. Restoring files into ASM meant we would have to use RMAN (which we would probably choose to use anyway).

As with many databases, the client wanted minimal downtime. It was obvious to us that the most time-consuming operation would be the restore and recovery into the new instance. We were basically doing a restore and recovery from production backups and archived redo logs. It quickly dawned on me that we could start this operation well before the scheduled cutover time and downtime window, chopping at least six hours from the downtime window. The client would only need to keep the new instance in mount mode after the initial restore/recovery finished, periodically re-catalog the source instance’s FRA (which was mounted via NFS), and then re-run the recover database command in RMAN. Once the time comes to cutover, simply archivelog current the original instance and shutdown immediate. Then open the new instance with the RESETLOGS option, and voila! Migration complete!

I’ll try to recreate a simple example here. Read the rest of this entry . . .

Moving Oracle Datafiles to a ZFS Filesystem with the Correct Recordsize

Full credit for this tale should go to my colleague Christo Kutrovsky for the inspiration and basic plan involved.

We recently migrated a large database from Solaris SPARC to Solaris x86-64. All seemed to go well with the migration, but in the next few weeks, we noticed some I/O issues cropping up. Some research led us to find that the ZFS filesystem used to hold the datafiles was killing us on I/O. The default “recordsize” setting for ZFS was 128k.

$ /usr/sbin/zfs get recordsize zfs-data
NAME      PROPERTY    VALUE     SOURCE
zfs-data  recordsize  128K      default

An Oracle database typically uses 8k for the block size, but in this case it was 16k. We saw basically the same thing that Neelakanth Nadgir described in his blog post, Databases and ZFS:

With ZFS, not only was the throughput much lower, but we used more [than] twice the amount of CPU per transaction, and we are doing 2x times the IO. The disks are also more heavily utilized. We noticed that we were not only reading in more data, but we were also doing more IO operations [than] what is needed.

The fix is to set the ZFS recordsize for a datafile filesystem to match the Oracle instance’s db_block_size. We also read in the ZFS Best Practices Guide that redo logs should be in a separate filesystem with the default ZFS recordsize of 128k. We already had them separate, so we just needed to get our datafiles on a ZFS filesystem with a 16k recordsize.

The first step is to change the ZFS recordsize for the current filesystem. Easy enough, but the change in recordsize will only apply to new files written there. It doesn’t do anything for our existing terabyte-plus of datafiles. This means we need to move the files to a new ZFS filesystem with the correct recordsize already set. That’s where the magic happens.

Read the rest of this entry . . .

Setting up Network ACLs in Oracle 11g… For Dummies

Having recently performed a test upgrade for a client from Oracle RDBMS 10g to 11g, I can tell you that one of the big changes that will likely require action on your part as DBA is the new fine-grained access control for the packages UTL_SMTP, UTL_TCP, UTL_MAIL, UTL_HTTP and UTL_INADDR. Part of the Oracle 11g pre-upgrade tool will notify you of users that will require new privileges.

Of course, Oracle’s post-upgrade network ACL setup documentation is much more confusing than it needs to be, at least for simple minds like me. A colleague stepped forward with a simple set of commands for a basic setup that even the tired and stressed can understand.

I’ll share that here, with some basic explanation:

Read the rest of this entry . . .

Turn Off db_cache_advice To Avoid Latch Contention Bugs

A couple of weeks ago, we noticed some timeouts in some of our standard Oracle RDBMS health check scripts on a new instance. I had just migrated this instance to bigger, better, badder hardware and so it had been given more SGA to use, namely a bigger buffer cache. The software version was still Oracle 10.2.0.2, as we wanted to introduce as few variables as possible (we were already moving to a new platform with an endian change).

At first the timeouts were infrequent, but over the course of a week started to grow in frequencey until the point where none of the checks were finishing in the allowed timeframe. We ran an AWR report, and tucked far down in the “Latch Activity” section, a colleague noticed this:

                                           Pct    Avg   Wait                 Pct
                                    Get    Get   Slps   Time       NoWait NoWait
Latch Name                     Requests   Miss  /Miss    (s)     Requests   Miss
------------------------ -------------- ------ ------ ------ ------------ ------
...
simulator lru latch          10,032,617    3.3    0.7  44950      336,837    0.3
...
Latch Activity                             DB/Inst: FOO/foo  Snaps: 156-157
-> "Get Requests", "Pct Get Miss" and "Avg Slps/Miss" are statistics for
   willing-to-wait latch get requests
-> "NoWait Requests", "Pct NoWait Miss" are for no-wait latch get requests
-> "Pct Misses" for both should be very close to 0.0

                                           Pct    Avg   Wait                 Pct
                                    Get    Get   Slps   Time       NoWait NoWait
Latch Name                     Requests   Miss  /Miss    (s)     Requests   Miss
------------------------ -------------- ------ ------ ------ ------------ ------
transaction branch alloc        112,412    0.0    0.0      0            0    N/A
undo global data                466,321    0.0    0.0      0            0    N/A
user lock                         7,440    0.8    0.4      1            0    N/A
          -------------------------------------------------------------

The “simulator lru latch” event brought us to MetaLink note 5918642.8 and bug 5918642. Affecting 10g and 11g prior to 10.2.0.4 and 11.1.0.7, respectively. The bug is with the database buffer cache advisor, controlled by the parameter db_cache_advice, which defaults to ON (depending on statistics_level). The note simply states:

High simulator lru latch contention can occur when db_cache_advice is set to ON if there is a large buffer cache.

We simply set db_cache_advice to OFF (thankfully it is a dynamic parameter), and pretty quickly our checks were running just fine.

My suggestion is to simply turn this off unless you are actively using the cache advisor to tune an instance. Once you are done tuning, and are no longer using the advisor, turn it off.

NOTE: As Mladen Gogola pointed out in the comments, turning this off will cause problems if you are using automatic memory management (i.e. sga_target > 0). Re-pasting his post here:

The problem with that advice is that it will prevent automatic memory management from resizing the buffer cache and the instance will end up with a huge, mostly empty, shared pool and default buffer cache. Automatic memory management is biased toward shared pool even with the cache dvice turned on, without it, buffer cache will be reduced to the minimum size, usually only 64MB. If you disable cache advice, I would also recommend disabling the automatic memory management and configuring SGA manually.

Sending Timezone-Aware Email with UTL_SMTP

I’m back again with another in what I hope will be a long line of “Quick Tips for Newbies” series.

At The Pythian Group, we have employees all over the globe, from our headquarters in Ottawa to regional offices in Boston, Prague, India and Sydney, and a few scattered remote workers in Seattle, Paris, Kiev, Brazil, South Africa and Wisconsin, among other places. In other words, we are spread across multiple timezones, and since it wasn’t too long ago that everyone was in Ottawa, this is something that still presents little quirks.

One such quirk involved email generated by one of our internal Oracle instances—via a stored procedure that used UTL_SMTP to send the messages—did not have timezone information in the “Date” email header. As a result, they would be stamped with the hour in Eastern timezone (Ottawa time), but the mail clients would think that hour was local. Depending on where you are relative to Ottawa, this could be many hours in the past or future. Of course, this wouldn’t be noticed if you were in Ottawa or even Boston, both in Eastern. For the rest, it was at the very least, an annoyance—but one that is easily fixed.

Read the rest of this entry . . .

ORA-16069? You May Need A New Standby Controlfile

On a recent Monday, I had to perform an emergency Oracle standby switchover for a client whose primary instance host had mysteriously rebooted itself over the previous day. Confidence in that host was, understandably, shaken.

The Oracle Data Guard configuration is a 3-instance setup using Data Guard Broker: one primary, we’ll call it OraA, feeding two standby instances, OraB and OraC. In this particular configuration, we perform switchovers between OraA and OraC. Caught in the middle is OraB, which is on a 60-minute standby delay.

After this particular switchover, OraB started complaining with this message in the alert log:

Read the rest of this entry . . .

RMAN Redundancy is not a Viable Retention Policy

The story you are about to read is based on actual events. Names and paths have been changed to protect the innocent. I call this scenario “The Perfect Storm” because it took just the right combination of events and configurations. Sadly, this doesn’t make it an unlikely occurrence, so I’m posting it here in hopes that you’ll be able to save yourselves before it’s too late.

I have always had a preternatural dislike for using REDUNDANCY as a retention policy for Oracle RMAN, greatly preferring RECOVERY WINDOW instead, simply because REDUNDANCY doesn’t really guarantee anything valuable to me, whereas RECOVERY WINDOW guarantees that I’ll be able to do a point-in-time recovery to anytime within the past x days. Plus, I had already been burned once by a different client using REDUNDANCY. With the story I’m about to tell, this dislike has turned into violent hatred. I’m going to be light on the technical details, but I hope you’ll still feel the full pain.

Read the rest of this entry . . .

GNU basename in PL/SQL

In the process of scripting a database migration, I was in need of something akin to the GNU basename utility that I know and love on Linux. basename is most famous for taking a full file path string and stripping away the leading path component, returning just the name of the file. This can be emulated in PL/SQL with calls to SUBSTR and INSTR, like this:

substr(dirname,instr(dirname,'/',-1)+1)

(Thanks to Ian Cary, who shared this logic on oracle-l)

As you can see, this simply finds the last occurence of /, which is our directory separator on *nix and Solaris operating systems. On Windows, it would be \. It then returns a substring beginning one character after that last separator until the end of the string. Voila, a basic basename routine!

Upon reading the basename man page again, I found that basename also takes an optional parameter, a suffix string. If this suffix string is provided, basename will also truncate that string from the end. For example:

Read the rest of this entry . . .

Ubiquity and Tahiti: Together At Last!

Almost everyone and their mum has been twittering about (or from) Mozilla Labs’ newest creation, Ubiquity. You can get a quick introduction of this firefox extension via their blog post, which has a video tour.

Well, after playing around with the simple stock commands (email, wikipedia search, twittering), I decided it was time for a handy Oracle search function. I emerged from my cave with a pretty basic Ubiquity command that will search the tahiti documentation, and optionally search the documentation for a specific version of Oracle from 9iR2 onward. Right now it will simply take you to the Oracle search results page. If I find that Oracle is providing an API to the tahiti search engine, I may enhance it to include results in the Ubiquity preview pane.
Read the rest of this entry . . .

Start NowWith Pythian - database design, management and emergency handling capabilities...

Live Updates

pythian: RT @sheeri: #confoo talk "Bending Queries to your Will with EXPLAIN" slides http://bit.ly/explainslides & handout
more



Testimonials

  • Serge Racine

    DBA, Brookfield Energy

    We are very satisfied by the service given to us by Andre and Shakir in support of our recent data quality and reorganization initiative.... more