What Should We Store on Hadoop?

Oct 3, 2012 / By Gwen Shapira

Tags: ,

Last year at Oracle OpenWorld, the most frequent questions about Hadoop and Big Data were either “What is it?” or “Will Hadoop replace Oracle?”.
The consistent message, both from Oracle and from independent data architects, has been: “Hadoop will not replace Oracle. Each system has its strengths, and they can be used side by side to offer a wider range of data storing and processing possibilities”.

It seems that most professionals understood and agreed with this message, since the most frequent question this year is: “Which kind of data should we store in Hadoop and in Oracle?”

I can’t claim to have the definitive answer, but I can offer some pointers and start the discussion.

Let’s start with size. Hadoop typically doesn’t make much sense at less than 6 node clusters. This is where the maintenance overhead and the headache of adding another system start to pay off. Most production clusters are at 20-30 nodes. Each server is likely to have at least 1T of storage, and data is replicated 3 times. So I’d say that if you have less than one terabyte of data, Hadoop is likely to be more of a problem than a solution. We may not know how big data should be to count as Big Data, but 1T seems like a lower bound.

If you have and plan on actually processing images, videos, sound files, and anything else that is non-text unstructured data, Hadoop is definitely the solution. Storing 1T of image files in a database or storage device and copying them over the network for process on application servers is far less efficient than using Hadoop to process the data where it is stored.

If you store images and videos without processing them, Hadoop is still a fairly cost effective solution, but at this point you need to calculate cost per terabyte and see which data store makes the most sense.

If you store text files that are truly unstructured, such as blog posts, and plan on processing them using natural language processing tools, Hadoop is a good solution. It allows you to store unstructured text and process it at the point of storage.

If you just want to search your text files and don’t plan on processing them, a text-index solution such as SOLR makes more sense.

Semi-structured text such as log files or XML is the most difficult to place. Since they have structure, they can be stored in relational databases. The decision is whether you want to structure the data when you save it or impose the structure later on when you retreive the data.
If at all possible, it is preferable to structure the data when you first store it. This way, you only structure it once instead of multiple times by each application retrieving the data. You also reduce the possibility of errors due to undocumented structure. Typically, it also reduces disk space usage.

However, as you recall, database experts caution against using the database as a “bucket of bits”. The performance of relational databases depend on having the right data model. If you don’t know how the data will be used and don’t have a good data model, it is better to store the data unstructured on Hadoop until you know how the data will be used and can create a good relational data model for it.

Finally, it can also make financial sense to use Hadoop to store large amounts of structured data that is queried very infrequently. If the business demands storing 15 years of data but only ever queries the last two years, Hadoop could be cost efficient storage for the other 13 years. With the added bonus of still being able to query the data if needed. Again, calculate the cost per terabyte to see if this is a good fit for you.

Note that I was only discussing the data stored. There is also the question: “What types of data processing should be done with Hadoop?”
I will try to answer this question at a later point. For now, if you determined a certain report is best done on Hadoop, you should store all the data used by the report in Hadoop.

I hope this is helpful. More suggestions and corrections are always welcome.

9 Responses to “What Should We Store on Hadoop?”

  • Deepak Sharma says:

    Hi,

    I attended your session “Building an Integrated Data Warehouse with Oracle Database and Hadoop”. Where can I find the slides?

    Thanks,
    Deepak

  • Giorgio Chiappone says:

    Hi,
    I would like to know the benefits of hadoop against Appliance and the bechmarck technological Hadoop.

    thanks

  • manjush says:

    hi,
    I would like to know the difficulties in following hadoop. Hadoop is not always a better option. It cannot visualize the data and give you some interactive solutions to the problems faced by the companies. It has back up problems. Since only one name node knows all the data distributions, it will take long time to load a new application in to it. While this process is going on, other programs, that work under hadoop do not work. The hype created on hadoop is making it look bigger. But i think that it has its own limitations. So focus on the limitations and how these limitations can be solved by working on different environments like cloudera.

  • Nilesh says:

    I would like to know is there any way to see the location blocks of data?

    I means where exactly the data is stored in slaves
    In which machine which data is stored?

    How to check their location??

    Kindly tell me something about Data Storage in HADOOP..

    Thanks…

    • Gwen Shapira says:

      To find the locations of blocks of a file, use FSCK. You can also use it on entire directories.

      For example /user/cloudera/passwd has one block on 127.0.0.1:

      [cloudera@localhost target]$ hdfs fsck /user/cloudera/passwd -files -blocks -locations
      Connecting to namenode via http://localhost.localdomain:50070
      FSCK started by cloudera (auth:SIMPLE) from /127.0.0.1 for path /user/cloudera/passwd at Mon Apr 15 19:45:23 EDT 2013
      /user/cloudera/passwd 2132 bytes, 1 block(s): OK
      0. BP-1809851170-127.0.0.1-1360017167006:blk_8640724210642370732_1012 len=2132 repl=1 [127.0.0.1:50010]

      Status: HEALTHY

      • Nilesh says:

        actually I want to know the locations of data that is stored on the cluster of HADOOP
        for eg I had a cluster setup of say 4 machines
        1 Master Node & other 3 as SLAVE Nodes
        suppose I uploaded some file on master by running command ./hadoop fs -copyFromLocal /home/data/ /inp/
        It’ll copy my all files present in data folder to inp folder in HADOOP Cluster
        Now my point is where this data is stored?
        I have set replication factor as 3
        where is the file present 3 times?
        Where the blocks of files are stored on my slaves?
        Is there any method to see it

        or
        atleast to show it..

        Please reply

        Thanks for the last reply
        It also helps me a little…

        Thanks

  • Rashed Mustafa says:

    Thanks for the discussion about data storing in Hadoop. I am doing research on content based video processing using Hadoop. Is it possible to read video contents using Hadoop? I can split the video according to frames or time but how can I store videos according to specific contents?

  • Grace says:

    Thanks for the above, it was really helpful. I am currently doing a research on modeling and simulation in big data. one of the first thing i need to do is for my model to take unstructured data form different source as input be it image, video, text etc and structure(output of the model) it using a tool, it there any Hadoop tool or any other tool that i can use to archive this. Thank you

Leave a Reply

  • (will not be published)

XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>