1.617.682.4508

Pythian Blog

The world discusses #Pythian on Twitter. Have a question? Use our hashtag and ask away.

Emergency

24x7 Support

Not a Pythian client but need help now? No problem. Click here.

Are you aware of an existing DBA opening or consulting requirement in your organization? Enter your email for a chance to win one year's access to Safari Books.

  

Which Risks Are You Protected From?

By: Alex Gorbachev

Kevin recently mentioned one very nice blog. I was going through some posts there and this entry reminded me one story. I’m sure many of you can recall similar cases.

I worked on one site for a while and during 2.5 years it didn’t face a single media corruption of Oracle datafiles. Not that it’s a low profile site - quite the opposite and storage infrastructure was setup very well there - no SPOF, mirrored inside SAN boxes and between boxes, redundant switches, HBAs, controllers, you name it - dream of a DBA. Even change management procedures were followed thoroughly.

But one day, my fellow DBA (who is usually extremely cautious and reviews his actions at least twice) overwrote a controlfile with some crap. Even the fact that controlfiles were on raw devices didn’t prevent this disaster from happening. Trivial error as we found out later - a DBA mistakenly swapped arguments of a tar command (like “tar cvf * file.tar” instead of “tar cvf file.tar *”) and tar happily used controlfile as a tape device. :) End result - 10 minutes outage while I was figuring out what happened, dd’ing controlfile image from another mirror and starting the instance. By the way, it was a RAC database and, of course, RAC didn’t help - surprisingly for some managers.

So they were kind of protected with multiplexed controlfiles even though recovery wasn’t transparent (wouldn’t it be nice if Oracle could survive loss of minority of multiplexed controlfiles - just like CRS with voting disks?). Interesting, that online redo logs were not multiplexed and recovery could have been a bit trickier should the current redo log be overwritten. The reason for that was that they had already quadruple mirroring and people were blindly ignoring human factor and Mr. Murphy - “it must be enough if we already mirrored it 4 times”.

What we see? Well implemented protection against one class of problems while ignoring obvious threats from another side. Perhaps, because of all kind of vendors making fuss about their technology and its importance, while nobody focusing attention on the areas that require low investments but as much important or even more.

In my experience human factor risk is one areas that is heavily underestimated most of the times.

So what are your stories?

Leave a Reply

Start NowWith Pythian - database design, management and emergency handling capabilities...

Pythian Blog

Connecting to Oracle with SQL Server 2005 x64
The quirks of connecting to Oracle from SQL 2005 64
more



Live Updates

pythian: Pythian is now official members of the Microsoft Partner Program. Thanks Peter
more



RSSTestimonials

  • Casey Dyke

    Database Team Manager Service Delivery and Applications , Telstra

    Pythian were recently engaged to take a lead role in a high end infrastructure build project at Telstra. Our requirements were a combination of... more