Deploying Cloudera Impala on EC2 with Example Live Demo

Posted in: Big Data, Cloud, Technical Track
Pythian Big Data Impala Implementation

A little while ago I blogged about (and open sourced) an Impala-powered soccer visualization demo, designed to demonstrate just how responsive Impala queries can be. Since not everyone has the time or resources to run the project themselves, we’ve decided to host it ourselves on an EC2 instance. You can try the visualization; we’ve also opened up the Impala web interface, where you can see query profiles and performance numbers, and Hue (username and password are both ‘test’), where you can run your own queries on the dataset. Note: The demo is now offline, contact us if you’re interested in seeing it.

Deploying Impala on EC2

While there are many tools to deploy a Hadoop cluster on EC2 – like Apache Whirr, or even Cloudera Manager – I only wanted to use a single instance for the entire cluster. Starting from the base Ubuntu (Precise) image, I added Cloudera’s apt repos, and installed the single node configuration. Impala doesn’t support using Derby for the Hive metastore, so I installed MySQL and configured Hive to use it instead. Then I installed Impala using Cloudera’s instructions. Impala, and all of the Hadoop daemons, are running comfortably on one M3 2XLarge EC2 instance. Given our modest demands, this may actually be overkill; I over-specced the server trying to find a (now-obvious) performance problem involving short-circuit reads.

Short-Circuit Reads

On the Pythian cluster, we could consistently  return a query in around half a second. On EC2, queries took closer to 5 seconds. A bit of investigation showed that in getting the server up and running, I had disabled short-circuit reads, which slows down Impala considerably. While Impala isn’t supposed to start without short-circuit reads, it only throws an error if you have short-circuit reads enabled but misconfigured. If short-circuits are off in the hdfs-site configuration, it will happily start and run very slowly. With the default DEB install, the libhadoop library isn’t installed to the LD_PATH on Ubuntu, which prevents short-circuit read from working. The easiest solution was to create symlinks for libhadoop to /usr/lib/, then run ldconfig:

ln -s /usr/lib/hadoop/lib/native/ /usr/lib/
ln -s /usr/lib/hadoop/lib/native/ /usr/lib/

To confirm whether your cluster has short-circuit reads enabled, you can visit the Impala web interface (by default, port 25000 on any system running impalad) and click on the ‘/varz’ tab. Search for ‘’ – it should be set to ‘true’.


With libhadoop installed and short-circuit reads enabled, the next greatest performance improvement came from partitioning the table on the sensor id. Since all of our web interface queries filter by sensor id, Impala can perform some serious partition elimination: looking at the query profiles, partitioning the table reduced the amount of data read from HDFS from 4GB to 50MB, and the query time from 2.6s to 130ms. The README on Github has instructions on how to use dynamic partitioning in Hive to quickly partition the soccer data; these steps can be generalized to any dataset.

Discover more about our expertise in Hadoop

Interested in working with Alan? Schedule a tech call.

4 Comments. Leave new

Maris Elsins
April 5, 2013 3:08 am

That’s a nice Visualisation.
What was the score in the end?
From what I see the ball1 and ball2 spent a lot of time in the center of the field :)

Alex Gorbachev
April 12, 2013 4:38 pm

It’s heatmap.js. App sources on github. The balls were not played all the time. Some balls were used for a bit only so I assume they were in the middle of the filed before game starts after a goal (so stayed there longer). Change the time sliders and you’ll see. Also, the original source of data has videos from the top of the field and also when happened what like score, ball possession and etc. See the DEBS challenge source. :)

Latest data Industry news round up, Log Buffer #314
April 5, 2013 7:27 am

[…] Gardner is deploying Cloudera Impala on EC2 with Example Live […]

For a Limited Time: Live Impala Demo on EC2
April 6, 2013 10:52 am

[…] Alan Gardner from Pythian has deployed the app for a limited time on Amazon EC2. We republish his original post […]


Leave a Reply

Your email address will not be published. Required fields are marked *