Hadoop FAQ – Getting Started Version
Jan 11, 2013 / By Gwen Shapira
Following my “Building Integrated DWH with Oracle and Hadoop” webinar for IOUG Big Data SIG on Tuesday, I got a bunch of excellent follow up questions. The most frequently asked questions are: What is the minimum I need to do to get started with Hadoop? and How do I load data into Hadoop?
Since so many people are interested in the same question, it makes more sense to answer on the blog than copy paste to everyone personally. Also, in the grand open source tradition – there is more than one way to do it. There are many ways to get started with Hadoop, and many ways to get data into Hadoop.
Lets go over few options for getting started with Hadoop:
First, Hadoop can run in few different modes:
- Local mode – where you can run your map-reduce code over files in your local filesystem. In this mode there is no cluster and no HDFS. Its used mostly to test if your fancy map-reduce jar files will run at all.
- Pseduo-distributed mode - In this mode, you are running all Hadoop processes as you would in a real cluster, but they are all running from a single server. My test “cluster” is a pseudo-distributed Hadoop running in a VM, and for many tests this is enough. For example, I ran all the Hadoop-Oracle integration scenarios in this setup.
- Fully-distributed mode – Here the sky is the limit, because all production clusters run in this mode and they can be huge and complex. But for starters, we are looking at just two or three machines. Either VMs or in the cloud. Running just one of each of the basic processes. For some tests, you need a fully distributed cluster – for example there is a blog post coming up about configuring HA HDFS, and I couldn’t do this with just a single node.
Getting started in Local Mode is probably easiest. You download a release of Apache Hadoop and install it. Then configure /etc/hadoop/conf/hadoop-env.sh with your JAVA_HOME (You need java 1.6 to run Hadoop). At this point you can run /bin/hadoop jar <your job> and presto! You are running Hadoop!
This may be good enough for developers, but DBAs usually don’t feel they really experimented with a new data store if they don’t have some place to store the data. Local file system isn’t what we are looking for. So we usually get started with the Pseudo-distributed Mode.
In Pseudo-distributed mode you will be running all basic Hadoop processes:
- HDFS name node – master process for HDFS. There is always just one of those, and it is responsible for keeping track of the filesystem meta data, such as file names and directories.
- HDFS data node – slave process for HDFS. In a real cluster there are many of those. They are responsible for communicating with the client, storing the data and replicating it.
- Map-Reduce Job Tracker – master process for Map-Reduce. This process is keeping track of job progress, failures and scheduling.
- Map-Reduce Task-Tracker – slave process for Map-Reduce. Responsible for running specific map and reduce processes and reporting their progress.
- (Optional) Secondary name-node – Offloads checkpointing activity off the Name-node.
To get started with Pseudo-Distributed mode, you have two easy options:
- Install Cloudera’s distribution from a package built for your flavor of Linux. Its relatively easy if you are familiar with your version of Linux, instructions are on the Cloudera website.
- Download Cloudera’s demo VM and run it. Can’t get any easier than that.
I used the first method when I wanted to run Hadoop on a system that already had Oracle installed. Its easier to install Hadoop on Oracle server than vice versa. The rest of the time I’m using the VM for my tests.
Full cluster Mode is required when you run a more serious POC, or when you want to test some of the HA features.
There are no easy ways to get started with a full cluster, both require a bit deeper understanding of Hadoop. But they are not very difficult either:
- Take the VM from the Pseudo-distributed step, and run two of it. Make sure they can communicate with each other. Stop all services Hadoop on both VMs. Configure /etc/hadoop/conf/core-site.conf , /etc/hadoop/conf/hdfs-site.conf and /etc/hadoop/conf/mapred-site.conf with the appropriate IP addresses so the services will be able to find each other. Start name-node and job-tracker on one node and data-node and task-tracker on the other.
- As an alternative, you can create the cluster in the Amazon cloud with Whirr. Note that this can get costly, if you forget to take the cluster down like I did last week (ouch!). If you are familiar with Amazon AWS services, you can follow the instructions on Cloudera website, if you are a cloud newbie, this blog post offers more handholding.
OK, in one way or another, you now have Hadoop. Now lets see how we get data on HDFS. I’m going to assume a Pseudo-distributed cluster here, since this is what I mostly use.
There are literally endless ways to get data on Hadoop, so lets review some of my favorites. Note that none of these methods require any special Java or even programming knowledge. They can be used by any simple DBA:
- Copy a file: hadoop fs -put localfile /user/hadoop/hadoopfile
- If you have large number of files (very common in my experience), a shell script that will run multiple “put” commands in parallel will greatly speed up the process. File copy is easy to parallelize without any need to write fancy MR code.
- You can also have a cron job scanning a directory for new files and “put” them in HDFS as they show up.
- Mount HDFS as a file system and simply copy files or write files there. The instructions for mounting are in my presentation.
- Use Sqoop to get data from a database to Hadoop. Was also covered in my presentation.
- Use Flume to continuously load data from logs into Hadoop. There are some catches there – you want to make sure you get relatively fresh data, so you don’t want Flume to do too much buffering, but you also don’t want many small files because thats bad for HDFS. I’ll probably blog about this in the future.
In my presentation, I mentioned Perl as a method of ETLing data from MySQL to Hadoop. It was mentioned because I use Perl whenever possible, not necessarily because it was recommended. Most of the time, using Sqoop will be a better idea, pre-processing the data in the DB can also be a good idea, and if you need custom code (like I did) and can program in Java (like I hate doing), you should probably write proper MR code.
In any case, I used an interface called Streaming which allows running any shell command that reads and writes key-value pairs as mappers and reducers. I wrote my own perl scripts to get data from the DB and do some pre-processing before dumping it into Hadoop.
If you have other tips on getting started with Hadoop or for loading data – the comments are all yours :)