THE WORLD DISCUSSES #PYTHIAN ON TWITTER. HAVE A QUESTION? USE OUR HASHTAG AND ASK AWAY.

Oracle Database or Hadoop?

Some decisions sound easy, but its also easy to get them wrong. Today I had a choice of hanging around New York city, or working on my big data presentation for RMOUG. Sounds easy, and yet I spent the day working on that presentation.

Whenever I tell an experienced Oracle DBA about Hadoop and what companies are doing with it, the immediate response is “But I can do this in Oracle”.

It is true. Oracle is a marvelous generic database and can be used for many things.
You can process your web logs and figure out which pathes customers are likely to take through your online store and which are more likely to lead to a sale. You can built a recommendation engine with Oracle. You can definitely do ETL in Oracle. I’m hard pressed to think of a use-case that is simply impossible with Oracle. I can’t say for certain that Google and Facebook could replace their entire data centers with Oracle, but perhaps its possible.

Just because it is possible to do something, doesn’t mean you should. There are good reasons to use Oracle as your default solution, especially when you are an experienced oracle DBA.

But, do you really want to use Oracle to store millions of emails and scanned documents? I have few customers who do it, and I think it causes more problems than it solves. After you stored them, do you really want to use your network and storage bandwidth so  the application servers will keep reading the data from the database? Big data is… big. It is best not to move it around too much and run the processing on the servers that store the data. After all, the code takes fewer packets than the data. But, Oracle makes cores very expensive.  Are you sure you want to use them to run processing-intensive data mining algorithms?

Then there’s the issue of actually programming the processing code. If your big data is in Oracle and you want to process it efficiently, PL/SQL is pretty much the only option. PL/SQL is a nice language, but it lags behind Java and Python for the availability of tools, ease of use and mostly in its standard library. I don’t think anyone seriously considers writing their data mining programs in PL/SQL (except maybe Gerwin Hendriksen). Once you write your code in Java or Python, to access mostly unstructured data in Oracle, you go through many layers of abstraction that result in uglier and slower code. I fail to see the benefits in forcing both PL/SQL and Oracle to do something un-natural.

Christo says that this means that Big Data is actually a license issue. It is partially a license issue – Oracle Database is expensive and MySQL isn’t good at data warehouse stuff. It is partially a storage and network issue of scaling large volumes of data, locality of data is becoming more critical. But I see it mostly as using the right tool for the job – and just because Oracle can do something, doesn’t make it the best way to do it. Sometimes you just need Python, a good file system and a nice distributed framework.

Facebook Twitter Email

3 Responses

  1. Fahd Mirza says:

    With Exadata, the issue of network and storage scalability has been addressed to upto a great extent, and Oracle is improving that too.

    So does that leave just the license issue?

  2. Gwen Shapira says:

    Exadata helps a lot with storage/network speeds, but its not the same as running code on the same box with direct attached storage (Exadata does a bit of that, but not a lot).

    The “Coding in PL/SQL” is still a big issue. With Exadata, either you need to find someone to write data mining code in PL/SQL, or move the data over non-IB network for processing in the app server.

    Exalytics and Exalogic may help with the code issue.

    After all that, we are faced with the license cost issue.

  3. Yury Velikanov says:

    Hey Gwen! Thx for sharing your thoughts on the subject! I think many DBAs/architects facing the same question over and over again. I hope to see more discussion on this topic.

    From my experience one of the main technical disadvantages of the high volume/big data processing within oracle database is lack of option that would allow us to avoid UNDO and related REDO streams generation.

    I admire Oracle DB read consistency and recoverability features. However for Big data it should be left to developers to decide if those features (and related overhead) should be on or off.

    I think people would have less tech arguments about leaving binary data in DB if oracle would provide this option (I predict that we will see it in 12 or 13 …..).
    Device => ASM => foreground process => ASM => Device would become fastest option than any FS could provide (Mentioned ASM just to make the point on the cfg. data sent directly to a device from f process.)

    Just my 0.02$,
    Yury

Leave a Reply

Start NowWith Pythian - database design, management and emergency handling capabilities...

Live Updates

pythian: RT @gwenshap: Sohan DeMel promoting #Pythian's ODA free migration and mentioned that we sold the biggest ODA deal to date :)
more



Testimonials

  • Serge Racine

    DBA, Brookfield Energy

    We are very satisfied by the service given to us by Andre and Shakir in support of our recent data quality and reorganization initiative.... more



Social links powered by Ecreative Internet Marketing