THE WORLD DISCUSSES #PYTHIAN ON TWITTER. HAVE A QUESTION? USE OUR HASHTAG AND ASK AWAY.

Production Advice for Developers (3 short pieces)

I’ve spent over a decade managing various production systems. After spending so much time with systems that are mostly not working as they should, one develops certain outlook on life. Like the deep belief that the only way to keep civilization functioning is by taking backups and testing them.

Recently I had few discussions with developers, and it turned out that ideas that I consider trivial can be viewed as deeply insightful by someone with different experience. Call that ideas arbitrage – goods that are common in one environment are valuable in another. Good hummus costs around 2$ in Tel-Aviv, while in Palo Alto I’ll be happy to pay many times that price if anyone would sell it!

Problem:
I’m tired of spending time chasing pseudo-bugs that are actually problems in configuration. Things keep changing in QA, staging and production servers and I have to keep figuring out why my code no longer works

Solution:
Figure out what your program needs in order to run properly. Memory, permissions, settings in files, etc. Then write a script that verifies all that.

This has 3 benefits:

  1. Sysadmins can check that the environment is correct before installing the software and before calling you for help.
  2. If something goes wrong, you can ask the sysadmin to run your script and send you the results – assisting in faster resolution.
  3. If the script doesn’t catch an issue and you have to spend hours debugging a broken configuration, you can later modify the script to catch the new issue and you’ll never have to spend those hours again.

Problem:
I’m designing a new search system. Should I use Oracle or Voldermort?

Solution:
Since you are asking me for advice, my guess is that neither of these solutions have a single compelling feature or limitation that make the decision clear-cut.
Therefore, go with the technology you understand better. Imagine yourself, a year in the future and the system just returned a wrong result. Which system you’ll find easier to solve the problem? Obviously the one you understand better, and the one that has troubleshooting tools you are more comfortable with.

Go with that one.

Problem:

My users complain about a performance problem. My system is memory bounded and I believe that adding more memory will solve the issue. I opened a ticket for our sysadmins to add memory, but they have been ignoring my ticket and nothing was done. Now I’m blamed for not solving the performance issue!

Solution:
Here are several possible solutions, ordered from recommended to highly risky:

  1. Update the ticket and ask for ETA.
  2. Use whatever internal process you have to escallate the ticket or make it higher priority.
  3. Make friends with a sysadmin and ask your friend to check the ticket and help you.
  4. Complain to your manager
  5. Complain to the manager of the sysadmin
  6. Go to the sysadmin cube and ask him nicely when the ticket will be ready
  7. Go to the sysadmin cube and yell
  8. Ask for a meeting or conference call involving more than 2 managers.

Replication is the New Durability (Thoughts About Dynamo)

“Dynamo: Amazon’s Highly Available Key-value Store” is a high level description of a data store, written by Amazon to solve the problem of a system where updates must never ever fail and must take less than a specific amout of time in 99.99% of the cases. No matter what happens to the servers or the network, updates to the system must continue as usual, and they emphasize that they deal with hardware and network failures nearly constantly.

The paper has one of the best descriptions on the trade-offs involved in eventual consistency, and when it makes sense. But even more interesting is the implicit decision that disks, commits and synchronous writes to redo logs are not really needed for durability.

To allow for simple design, robustness and high performance, Amazon limited the reporting capabilities – data is accessed by primary key only. To implement the high availability requirement, Dynamo replicates each key-value pair between multiple machines in different data stores. And to make sure that replication doesn’t impact performance, Amazon decided they can live without consistency – requests from the system can sometimes return old data or several contradicting versions of data – for Amazon’s requirements, this is much better than failure or delays.
Read the rest of this entry . . .

Environment Variables in Grid Control User Defined Metrics

Emerson wrote: “Foolish consistency is the hobgoblin of small minds”. I love this quote, because it allows me to announce a presentation titled “7 Sins of Concurrency” and then show up with only 5. There are places where consistency is indeed foolish, while other times I wish for more consistency.

Here is a nice story that illustrates both types of consistency, or lack of.

This customer Grid Control installed in their environment. We were asked to configure all kinds of metrics and monitors for several databases, and we decided to use the Grid Control for this. One of the things we decided to monitor is the success of the backup jobs.
Read the rest of this entry . . .

It’s the End of the World As We Know It (NoSQL Edition)

Everyone knows that seminal papers need a simple title and descriptive title. “A Relational Model for Large Shared Data Banks” for example. I think Michael Stonebraker overshot the target In a 2007 paper titled, “The End of an Architectural Era”.

Why is this The End? According to Michael Stonebraker “current RDBMS code lines, while attempting to be ‘one size fits all’ solution, in face, excel at nothing. Hence, they are 25 years old legacy code lines that should be retired in favor of a collection of ‘from scratch’ specialized engined”.

He makes his point by stating that traditional RDBM design is already being replaced for a variety of specialized solutions: Data-warehouses, streams processing, text and scientific databases. The only uses left for RDBMS is OLTP and hybrid systems.
Read the rest of this entry . . .

Notes on Learning MySQL (as an Oracle DBA)

I spent some time last month getting up to speed on MySQL. One of the nice perks of working at Pythian is the ability to study during the workday. They could have easily said “You are an Oracle DBA, you don’t need to know MySQL. We have enough REAL MySQL experts”, but they didn’t, and I appreciate.

So how does an Oracle DBA goes about learning MySQL?
Obviously you start by reading the docs. Specifically, I looked for the MySQL equivalent of the famous Oracle “Concepts Guide”.
Unfortunately, it doesn’t exist. I couldn’t find any similar overview of the architecture and the ideas behind the database. The first chapter of “High Performance MySQL” had a high level architecture review, which was useful but being just one chapter in a book, it lacked many of the details I wanted to learn. Peter Zaitsev’s “InnoDB Architecture” presentation had the kind of information I needed – but covered just InnoDB.

Thats really too bad because I definitely feel the lack – which I can easily tell you what Oracle does when you connect to a database, run a select, an update, commit or rollback – I can’t say the same about MySQL. So far I managed without this knowledge, but I have a constant worry that this will come back and bite me later.
Read the rest of this entry . . .

Cloning Oracle Home from RAC to Stand-Alone

This should have been the easiest task on my todo list: Install Oracle 10.2.0.3 EE standalone on a new Linux RHEL 5 server, later to be used as a standby for a production RAC system. This means 2 lines of “runinstall -silent …”, less than 5 minutes of DBA work and maybe 20 minutes of waiting. I did not expect to spend over 5 hours doing this.

Problems started when I discovered that I don’t have the 10.2.0.3 patchset and another patch that exists on production and should be installed on the standby. I had to wait for my Metalink credentials to be approved for this customer CSI before I could download the patches for them.

“Why don’t you just clone the software from production?” asked a helpful colleague.

Read the rest of this entry . . .

BAAG, Best Practices and Multiple Choice Exams

I’ve been following the discussion in various MySQL blogs regarding the sort_buffer_size parameters. As an Oracle DBA, I don’t have an opinion on the subject, but the discussion did remind me of many discussions I’ve been involved in. What’s the best size for SDU? What is the right value for OPEN_CURSORS? How big should the shared pool be?

All are good questions. Many DBAs ask them hoping for a clear cut answer – Do this, don’t do that! Some experts recognize the need for a clear cut answer, and if they are responsible experts, they will give the answer that does the least harm.

Often the harmless answer is “Don’t touch anything, because if you have to ask this question you don’t have the experience to make the correct decision”. As Sheeri noted, it is a rather patronizing answer and it is stands in the way of those who truly want to learn and become experts.
Read the rest of this entry . . .

On the Difficulty of Data Migrations (Especially to NoSQL Databases)

I’ve been reading a lot of NoSQL blogs recently, and one thing that bothers me is that many of the leading NoSQL bloggers seem to have very different experience in operations that I’ve had.

Here’s an example:
Over at the O’Reilly community blogs, Andy Oram interviewed two MongoDB experts about migrating from a relational databases to MongoDB.

Here’s what the experts said:

” 1. Get to know MongoDB. Download it, read the tutorials, try some toy projects.
2. Think about how to represent your model in its document store.
3. Migrate the data from the database to MongoDB, probably simply by writing a bunch of SELECT * FROM statements against the database and then loading the data into your MongoDB model using the language of your choice.
4. Rewrite your application code to query MongoDB through statements such as insert() or find().

OK, so which step do you think takes the longest? And the answer is…step 2. Design is critical, and there are trade-offs that provide no simple answers but require a careful understanding of your application. Migrating the data and rewriting the application are straightforward by comparison. “

I’ve never migrated anything to MongoDB, but I was involved in the migration of a large application from SQLServer to Oracle. Both are relational databases so there was almost no need to rethink the data model. The rewrite and the migration took over two years, with significant bugs discovered and fixed up to the last week. The majority of the time spent on migration. None of it was done by “simply by writing a bunch of SELECT * FROM statements against the database”.

We did not lack expertise – we had plenty SQLServer and Oracle developers and DBAs with 10+ years of experience. Note that no one has 10 years of MongoDB experience.

I don’t doubt that modeling is critical and the trade-offs are always difficult, but I’ve yet to see a modeling phase that took more than rewrite + migration of large applications with big data. Note that large applications and big data are the target customers of NoSQL databases, so I’m not inventing irrelevant issues here.

I’ve experienced two major difficulties with migrations:
The first one is that you normally have large number of users, and you may be reluctant to migrate everyone to a new system at once. No matter how good your load testing skills are, you will still not be 100% certain your new system will have perfect performance under peak load. So you do phased migration. Start by moving 5% of the users, then another 15%, then another 30%, and then if everything goes well, you may migrate the rest.

Why is this a difficulty? First, the users may share data with users that have not yet migrated. There could be dependencies. You’ll need to figure these out and write temporary code to solve those that will be used only during the migration phase. But before that, you need to find a way to migrate specific parts of your data. This requires figuring out how to tear things apart carefully within and across tables. A mini modeling project in its own right. This complicates the “bunch of SELECT * FROM statements” quite a bit.

Oh, and the migration may fail. Spectacularly. At 3am. You now need to migrate all the users back. With the new data they inserted into the new DB. I hope you prepared a script in advance to do that.

And that is just the first difficulty. The second major problem is that you may have large amounts of data arriving at high rates. You could declare 3 days downtime to move all the data, but I can see some reasons not to do that.

The alternative is to move the data in increments. First select and copy all the data inserted until today at 8am. Once this is done, select and copy all the data inserted between 8am and now. Then all the data between the previous now and the now-now. All in ever shrinking deltas of data that will eventually converge to a point where you can switch the users over. This requires that all large tables will have timestamps, preferably indexed, hopefully partitioned. Even with timestamps it is not a trivial application to write, and it has to take care of dependencies – you can’t migrate comments on a document without migrating the document itself.

During the incremental migration and the data streaming phase, you have to support two systems with the same one operational group. The same operational group that now have to learn to support a new database and a lot of new code rewritten for it. Not impossible, but far from “straightforward”.

I always thought that the biggest misconception developers have about operations is the “just add a bunch of servers to solve the performance issue” myth. I can add “migration to a new system is straighforward” as another dangerous myth.

I’m not blaming them, they are architects and developers. Solving difficult operational problems is not their job. The “migration is straightforward” attitude is a problem only when you ask your developers to support your operations. Something that seems depressingly common when NoSQL databases arrive to operations. Operations have no NoSQL experience and management asks the developers to help out until the ops teams learn to support the new beast. Problem is that NoSQL developers without operations experience are likely to cause just as much damage as operations without NoSQL experience.

Log Buffer #184, a Carnival of the Vanities for DBAs

This is the 184th edition of Log Buffer, the weekly review of database blogs. I’ve edited a couple of Log Buffers before, but this is the first time I get to post directly to the Pythian blog. Just one of the many perks of being a Pythian employee ;)

On the Oracle front:

It is always good to start the day with a pop quiz to get the brain into gear: Charles Hooper posted a 3-part series with seemingly innocent True/False questions. He covers sorting, SQL tuning and wait events.

If you enjoyed the performance tuning quiz, you will also enjoy reading about 5 dangerous myths of SQL Performance that Iggy Fernandez reviewed in “So Many Oracle Manuals” blog.
Read the rest of this entry . . .

Start NowWith Pythian - database design, management and emergency handling capabilities...

Live Updates

pythian: RT @alexgorbachev: Oracle OpenWorld 2010 - Bloggers Meetup announced! http://bit.ly/92Qg76
more



Testimonials

  • Serge Racine

    DBA, Brookfield Energy

    We are very satisfied by the service given to us by Andre and Shakir in support of our recent data quality and reorganization initiative.... more