Liveblogging: Leveraging Hadoop to Augment MySQL Deployments by Sarah Sproehnle, Cloudera, Hadoop Instructor

Posted in: Cloud, Hadoop, Technical Track

Sarah Sproehnle is an excellent speaker, so I was excited to see her speak aboutHadoop, which I know very little about…so here’s the liveblogging:

Hadoop is an “open source framework for storing and processing data on a cluster of computers.”  Processing is key — analysis of data.

Built-in HA, scalability.  Designed for batch processing, Optimized for streaming reads.  Not meant for queries returned in milliseconds, it’s meant for queries that run in minutes — but in those minutes it processes terabytes of data.

Why consider Hadoop?  LOTS of data (multi-terabytes) – Facebook has roughly 15 petabytes in their Hadoop cluster.  SANs can hold a lot of data, but not process it all.

Traditional ETL separates storage from processing — it moves data in order to process it, because processing is a bottleneck.

Yahoo Mail does spam detection using Hadoop

Facebook and LinkedIn suggest friends/connections using Hadoop.  eHarmony uses Hadoop to help match people.

How is Hadoop different from MySQL?  It’s not a substitute.

Data capacity:  Hadoop petabytes, mysql terabytes

Data per query: Hadoop petabytes, mysql gigabytes

read/write: Hadoop sequential scans, append-only.  MySQL random read/write

query language:  Hadoop Java MapReduce, scripting langs, HiveQL.  MySQL is SQL and UDFs.

(Sarah mentioned that Java MapReduce is not the easiest way to process data, hence getting HiveQL)

Transactions:  Hadoop No, MySQL Yes

Indexes: Hadoop No (always scanned in parallel), MySQL Yes

Latency: Hadoop minutes to hours (if that’s too long, add nodes), MySQL sub-second (hopefully)

Data structure: Hadoop can be structured or unstructured, MySQL is structured.

How Hadoop works:

spreads data onto tens, hundreds or thousands of machines usin gthe Hadoop Distributed File System (HDFS).  You just put a file into Hadoop and it decides where to put it.  Built-in redundancy (replication) for fault-tolerance.  Replicates data 3x.

MapReduce is based on “map” and “reduce” — basically divide and conquer — split tasks and run in parallel.  Hadoop is an open-source implementation using Google’s whitepaper on map reduce.  Google’s implementation costs $$, it’s not open source.

Problem with MapReduce:

The developer has to worry about job setup, InputFormat and OutputFormat, may have to write custom key/value classes, not just analysis and processing logic.

Data being schema-less makes it hard to process, the developer needs to know what the data looks like.

Often requires several MapReduce passes to accomplish a final result.

Facebook as an example:

TB of data per day, coming from all over – MySQL, Oracle, Apache web logs.  Using Hadoop for log processing, text mining, doc indexing, BI/analytics.  They wanted a commandline interface, a schema for the data, and an easier dev environment that supported ad hoc queries (not Java classes for everything).

Out of Facebook’s desires came Hive, often called “A data warehouse for Hadoop”.  HiveQL is an SQL-like language for queries, and a metastore which stores schema information (typically on a mysql server).  Hive translates the HiveQL to MapReduce code.  About a year ago Facebook runs over 7500 Hive queries per day and scans more than 100Tb per day.

Differences between HiveQL and SQL:


SQL-92 standard, maybe


Transactions, indexes


Subset of SQL-92 plus Hive-specific extensions


No transactions, no indexes


Get started now, it’s easy!

1) setup Hadoop (NameNode, DataNodes, JobTracker, TaskTrackers).  Or go to Cloudera’s website and download a free VM to play with it.

2) create hive tables — similar to MySQL create table, except uses STRING instead of TEXT.

3) load data in.  LOAD DATA INPATH ‘/path-in-hdfs’ INTO TABLE t;



Under the hood, each part of a query gets converted to a map part and a reduce part.

WHERE — as a filter, this becomes a map task.  Map scans data

GROUP/ORDER BY — reduce task

JOIN — map or reduce depending on optimizer

EXPLAIN shows the MapReduce plans.  Very different from MySQL EXPLAIN plan.

Extra features:

Partitioning (horizontally)

UDF’s/UDAF’s (written in Java)

support for sampling — instead of going through all data, pick out a random, say, 10 Mbs.

JDBC and ODBC interfaces, so that BI vendor support can come out very soon.

Integration with HBase (Hadoop NoSQL solution)


It gets even easier with Sqoop (pronounced “scoop”), which is just SQL-to-Hadoop.  This is an open source product from Cloudera.  Parallel import of data from many databases to Hadoop, and Parallel export of data from Hadoop to databases.


sqoop import –connect jdbc:mysql://hostname/dbname –table employees –hive-import –fields-terminated by ‘\t’ –lines-terminated-by ‘\n’

can choose rows with –where, can also choose which columns to import.  Can do incremental loads by TIMESTAMP or AUTO_INCREMENT field, etc.


Discover more about our expertise in Hadoop.

Interested in working with Sheeri? Schedule a tech call.

No comments

Leave a Reply

Your email address will not be published. Required fields are marked *