Last week I was at the 2014 Hadoop Summit in San Jose, trying to keep abreast of the ever-changing Apache landscape: what projects are up-and-coming, what projects are ready for production, and most importantly which projects can solve problems for our clients. It was also a great chance to hear about real, production deployments – both in the halls and in some of the presentations. And Doug Cutting and Arun Murthy had an excellent panel about the future of the platform, both from a business and technology perspective.
Hadoop Security was an incredibly popular topic this year, reflecting the fact that Hadoop deployments are growing up and fulfilling their stated purpose: to consolidate organizations’ data and make it visible to everyone. “Visible to everyone” is a noble goal, but in reality PII and other sensitive information needs to be guarded, and access needs to be audited and limited. Apache Knox makes it possible to audit all user interactions with the cluster, tying user access to an existing identity management system. Cloudera Sentry provides fine-grained user permissions for Hive, HBase and Search, similar to existing RDBMSes. During the conference Cloudera also announced their acquisition of Gazzang, who make a platform for key management and data encryption at rest in Hadoop (similar to Intel’s Project Rhino).
Booz Allen Hamilton also put on an excellent presentation about a real client system storing sensitive data on EMR using ephemeral storage – I strongly recommend looking at this as an example of what’s possible now, and also how difficult it is to implement cell or row-level security policies in Hadoop.
YARN is the new “data operating system” responsible for all computations running on your clutster. It handles container placement and resource allocation to allow multiple frameworks like MapReduce, Tez and Spark to co0exist on the same nodes without competing for resources. Applications can also be written to run directly on YARN, opening up the Hadoop cluster to support more general purpose tasks (Yahoo is apparently encoding video on YARN with low latency, although details were scarce), and making it easier for developers to provide distributed, fault-tolerant applications. Early adopters have been using YARN in production for a while, but now every major vendor is including it in their distribution, and features like the High-Availability ResourceManager (or “HARMful YARN”) are available.
Many talks from different sources (Twitter, Yahoo, HortonWorks) focused on different aspects of YARN: new features, production deployment hints, and the general architecture.
I thought with Spark becoming so popular and widely supported – in every major distribution – Spark Streaming would supplant Storm as the leading complex event processing engine. Visiting Hadoop Summit, however, it seems like Storm has plenty of momentum. It’s been ported to YARN to work seamlessly within your cluster, and multiple presentations demonstrated real-world systems running on Storm right now, as well as integrations with other technologies like R and Pig. Spark overall had nearly as many presentations, but these were more technical and theoretical: it might be another year before we see many presentations about Spark and Spark Streaming applications being deployed at scale.
Apache Falcon had two talks this summit, and it’s been incubating since last year. It caught my attention as an open-source project which is aiming to supplant existing proprietary tools. Falcon allows you to declaratively define ETL flows in terms of sources, sinks and transformations, and schedule them on a regular basis. Flows are monitored and idempotent, and late data can be handled according to user-defined rules. Right now the emphasis is on power: an XML config coordinates Hive, Pig, Oozie and distcp, but more user-friendly features like libraries of transformations and a web UI for visualizing flows will bring Falcon closer to the feature set of commerical ETL tools.
SQL on Hadoop
This space has settled down a lot since last year, when Stinger and Impala seemed to invade every track and time slot. Yahoo still put on a number of Hive-on-Tez architecture and performance reviews, and less established projects like Apache Tajo (incubating), BlinkDB, Actian’s Vortex and Facebook’s Presto made appearances. Even though performance has been increasing year over year, SQL-on-Hadoop engines are still wildly variable in their features and performance, and there aren’t any clear winners right now – new entrants still have a chance to make their mark. If you’re curious about choosing a SQL-on-Hadoop engine, check out my presentation this year surveying the landscape.
More to Watch
There were so many great presentations, it was hard to choose for every time slot. Once the videos are released I also recommend watching:
- Ted Dunning’s stunningly simple anomaly detection
- Jagane Sundar (of WanDisco) explaining Paxos
- Koji Noguchi (of Yahoo) with tips about stabilising your Hadoop clusters
Were you at Hadoop Summit? What were your favourite presentations and what trends did you notice?
Leave a Reply