Oracle Database or Hadoop?

Jan 16, 2012 / By Gwen Shapira

Tags: ,

Some decisions sound easy, but its also easy to get them wrong. Today I had a choice of hanging around New York city, or working on my big data presentation for RMOUG. Sounds easy, and yet I spent the day working on that presentation.

Whenever I tell an experienced Oracle DBA about Hadoop and what companies are doing with it, the immediate response is “But I can do this in Oracle”.

It is true. Oracle is a marvelous generic database and can be used for many things.
You can process your web logs and figure out which pathes customers are likely to take through your online store and which are more likely to lead to a sale. You can built a recommendation engine with Oracle. You can definitely do ETL in Oracle. I’m hard pressed to think of a use-case that is simply impossible with Oracle. I can’t say for certain that Google and Facebook could replace their entire data centers with Oracle, but perhaps its possible.

Just because it is possible to do something, doesn’t mean you should. There are good reasons to use Oracle as your default solution, especially when you are an experienced oracle DBA.

But, do you really want to use Oracle to store millions of emails and scanned documents? I have few customers who do it, and I think it causes more problems than it solves. After you stored them, do you really want to use your network and storage bandwidth so  the application servers will keep reading the data from the database? Big data is… big. It is best not to move it around too much and run the processing on the servers that store the data. After all, the code takes fewer packets than the data. But, Oracle makes cores very expensive.  Are you sure you want to use them to run processing-intensive data mining algorithms?

Then there’s the issue of actually programming the processing code. If your big data is in Oracle and you want to process it efficiently, PL/SQL is pretty much the only option. PL/SQL is a nice language, but it lags behind Java and Python for the availability of tools, ease of use and mostly in its standard library. I don’t think anyone seriously considers writing their data mining programs in PL/SQL (except maybe Gerwin Hendriksen). Once you write your code in Java or Python, to access mostly unstructured data in Oracle, you go through many layers of abstraction that result in uglier and slower code. I fail to see the benefits in forcing both PL/SQL and Oracle to do something un-natural.

Christo says that this means that Big Data is actually a license issue. It is partially a license issue – Oracle Database is expensive and MySQL isn’t good at data warehouse stuff. It is partially a storage and network issue of scaling large volumes of data, locality of data is becoming more critical. But I see it mostly as using the right tool for the job – and just because Oracle can do something, doesn’t make it the best way to do it. Sometimes you just need Python, a good file system and a nice distributed framework.

6 Responses to “Oracle Database or Hadoop?”

  • Fahd Mirza says:

    With Exadata, the issue of network and storage scalability has been addressed to upto a great extent, and Oracle is improving that too.

    So does that leave just the license issue?

  • Gwen Shapira says:

    Exadata helps a lot with storage/network speeds, but its not the same as running code on the same box with direct attached storage (Exadata does a bit of that, but not a lot).

    The “Coding in PL/SQL” is still a big issue. With Exadata, either you need to find someone to write data mining code in PL/SQL, or move the data over non-IB network for processing in the app server.

    Exalytics and Exalogic may help with the code issue.

    After all that, we are faced with the license cost issue.

  • Yury Velikanov says:

    Hey Gwen! Thx for sharing your thoughts on the subject! I think many DBAs/architects facing the same question over and over again. I hope to see more discussion on this topic.

    From my experience one of the main technical disadvantages of the high volume/big data processing within oracle database is lack of option that would allow us to avoid UNDO and related REDO streams generation.

    I admire Oracle DB read consistency and recoverability features. However for Big data it should be left to developers to decide if those features (and related overhead) should be on or off.

    I think people would have less tech arguments about leaving binary data in DB if oracle would provide this option (I predict that we will see it in 12 or 13 …..).
    Device => ASM => foreground process => ASM => Device would become fastest option than any FS could provide (Mentioned ASM just to make the point on the cfg. data sent directly to a device from f process.)

    Just my 0.02$,

  • Joe Friday says:

    The main thing to me in considering any ACID compliant RDBMS vs Hadoop/NoSQL solutions is quality and reliability of results. What often seems overlooked in this discussion is results sets in Hadoop/NoSQL are allowed to be “lossy” meaning all “rows” which may satisfy a given query are not guaranteed to be returned in the result set.

    In certain applications, such as a web search engine or statistical analysis, a results set with “most” of the rows is acceptable. However, if used for accounting, inventory, logistics, medical, or many other important planning and operations applications, all rows are required in which case Hadoop/NoSQL cannot be used as all rows are not guaranteed to be returned.

    Prices change, licensing models change, and Oracle RDBMS supports Java in addition to PL/SQL in the database engine, so many of the typical arguments for Hadoop/NoSQL and against Oracle RDBMS in the “big data” dialogue are either transient or inaccurate. Accuracy and consistency of data is not.

    • Gwen Shapira says:

      Hey Joe,

      Been a year since I wrote this post. Cool to see that old stuff is still getting read!

      I obviously didn’t make myself clear enough – Hadoop and NoSQL are two different things with different use-cases, benefits and trade-offs.

      Most important, Hadoop is a file system and job scheduling framework. Its not a database. Nothing prevents someone from writing ACID compliant system on top of Hadoop. Just like Oracle runs on top of Linux while Linux itself is not ACID.

      That said, while ACID is critical for many applications, some businesses can make the decision that consistency (different than accuracy!) can be compromised in some cases. Even traditional systems like MS SQL Server allow dirty reads when requested – allowing developers to trade consistency and accuracy for speed when they decide it makes sense.

      IMO, more options is always a good thing.

      That said, I’d love to hear which arguments pro Hadoop or pro NoSQL are inaccurate or transient.

  • [...] Shapira wrote a very interesting article entitled Oracle Database or Hadoop?. What she (generally) says in not only interesting and joy to read but most of all true: whenever I [...]

Leave a Reply

  • (will not be published)

XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>