What Should We Store on Hadoop?
Last year at Oracle OpenWorld, the most frequent questions about Hadoop and Big Data were either “What is it?” or “Will Hadoop replace Oracle?”.
The consistent message, both from Oracle and from independent data architects, has been: “Hadoop will not replace Oracle. Each system has its strengths, and they can be used side by side to offer a wider range of data storing and processing possibilities”.
It seems that most professionals understood and agreed with this message, since the most frequent question this year is: “Which kind of data should we store in Hadoop and in Oracle?”
I can’t claim to have the definitive answer, but I can offer some pointers and start the discussion.
Let’s start with size. Hadoop typically doesn’t make much sense at less than 6 node clusters. This is where the maintenance overhead and the headache of adding another system start to pay off. Most production clusters are at 20-30 nodes. Each server is likely to have at least 1T of storage, and data is replicated 3 times. So I’d say that if you have less than one terabyte of data, Hadoop is likely to be more of a problem than a solution. We may not know how big data should be to count as Big Data, but 1T seems like a lower bound.
If you have and plan on actually processing images, videos, sound files, and anything else that is non-text unstructured data, Hadoop is definitely the solution. Storing 1T of image files in a database or storage device and copying them over the network for process on application servers is far less efficient than using Hadoop to process the data where it is stored.
If you store images and videos without processing them, Hadoop is still a fairly cost effective solution, but at this point you need to calculate cost per terabyte and see which data store makes the most sense.
If you store text files that are truly unstructured, such as blog posts, and plan on processing them using natural language processing tools, Hadoop is a good solution. It allows you to store unstructured text and process it at the point of storage.
If you just want to search your text files and don’t plan on processing them, a text-index solution such as SOLR makes more sense.
Semi-structured text such as log files or XML is the most difficult to place. Since they have structure, they can be stored in relational databases. The decision is whether you want to structure the data when you save it or impose the structure later on when you retreive the data.
If at all possible, it is preferable to structure the data when you first store it. This way, you only structure it once instead of multiple times by each application retrieving the data. You also reduce the possibility of errors due to undocumented structure. Typically, it also reduces disk space usage.
However, as you recall, database experts caution against using the database as a “bucket of bits”. The performance of relational databases depend on having the right data model. If you don’t know how the data will be used and don’t have a good data model, it is better to store the data unstructured on Hadoop until you know how the data will be used and can create a good relational data model for it.
Finally, it can also make financial sense to use Hadoop to store large amounts of structured data that is queried very infrequently. If the business demands storing 15 years of data but only ever queries the last two years, Hadoop could be cost efficient storage for the other 13 years. With the added bonus of still being able to query the data if needed. Again, calculate the cost per terabyte to see if this is a good fit for you.
Note that I was only discussing the data stored. There is also the question: “What types of data processing should be done with Hadoop?”
I will try to answer this question at a later point. For now, if you determined a certain report is best done on Hadoop, you should store all the data used by the report in Hadoop.
I hope this is helpful. More suggestions and corrections are always welcome.
Discover more about our expertise in Hadoop.