When I heard that Intel announced their own Hadoop distribution, my first thought was “Why would they do that?”. This blog post is an attempt to explore why anyone would need their own Hadoop distribution, what Intel can gain by having their own and who is likely to adopt Intel’s distribution.
Why does anyone need an Hadoop distribution? Hadoop is open source, and it would make sense that RedHat and Canonical would package Hadoop and add it to their own distribution – just like they do to MySQL and other open source applications. Instead, we have Cloudera, Hortonworks, MapR, EMC, Intel and probably many more, each with their own Hadoop distribution.
When you try to pick a Hadoop distribution, the first thing you’ll notice is that each one has a slightly different set of components. Cloudera includes Flume and Scoop, which HortonWorks doesn’t. HortonWorks includes Ambari and a platform by Talend. Having a distribution gives companies a chance to define Hadoop. This matters a lot to new adopters and especially to larger companies – we look at the distribution as an indication of which components are safe to use, and are reluctant to add components outside their distribution. As an example, Oozie and Azkaban are similar tools performing the similar task of managing jobs in Hadoop. In my experience, Oozie is far more popular, not because its a superior tool, but because it is part of the popular Cloudera distribution.
There’s a reason Hadoop users prefer to use a distribution as a whole rather than mix and match toolchains: Considering the many components in a Hadoop production system, matching the versions to make sure all the tools are working well together is a challenging task. Companies that release their own distribution pick the correct versions, test a lot, and furiously patch to make sure all the components will work as a whole. This is somewhat similar to the way Oracle will announce that 11g is supported on RHEL5 but not RHEL6, except much more so. Of course, Redhat could do the same, as they do to all software in their Linux distribution, but as you can see, they don’t.
When users choose a well known distribution they don’t just get a well chosen and tested mix of components. They also get the option of purchasing support for this distribution. That’s the main benefit for companies selling their own Hadoop distribution: You go through all the trouble of picking components and testing them, so that you are well positioned to provide support for them. Other companies can of course sell support for the same distribution – Pythian will happily support any Hadoop distribution you choose. But the owner of the distribution has some advantage since it is much more difficult for 3rd party supporters to offer bug fixes in Hadoop code.
Of course, all this doesn’t apply to Intel, who shows no intention of selling support.
So why would Intel need their own distribution?
Let’s start from basics: Intel sells CPUs. That’s their main line of business, but they also write software. For example, Intel’s C compiler is first rate. I used to love working with it. Intel wrote their own compiler so executables generated with it will always use the best Intel features. This means that popular software would run faster on Intels, because their performance features will be used even when developers don’t know about them (Oracle Optimizer attempts to do the same, but with less success).
How does it apply to Hadoop? Clearly Intel noticed that Hadoop clusters tend to have lots of CPUs, and they are interested in making sure that these CPUs are always Intel, possibly by making sure that Hadoops run faster on Intel CPUs.
Let’s look at Intel’s blog post on the topic: http://blogs.intel.com/
“The Intel Distribution for Apache Hadoop software is a 100% open source software product that delivers Hardware enhanced performance and security (via features like Intel® AES-NI™ and SSE to accelerate encryption, decryption, and compression operation by up to 14 times).”
“With this distribution Intel is contributing to a number of open source projects relevant to big data such as enabling Hadoop and HDFS to fully utilize the advanced features of the Xeon™ processor, Intel SSD, and Intel 10GbE networking.”
“Intel is contributing enhancements to enable granular access control and demand driven replication in Apache HBase to enhance security and scalability, optimizations to Apache Hive to enable federated queries and reduce latency. ”
Intel is doing for Hadoop the same thing it did for C compilers – make sure they use the best hardware enhancements available in the CPUs and other hardware components available from Intel. The nice thing is that the enhancements are available as open source – Intel doesn’t care that the software is free, since they are selling the hardware!
And since it’s open source, we can take a peak at Intel’s Github repository: https://github.
What can we find there? We have Project Rhino (https://github.com/intel-
Improved Hadoop security is on the top of the list of things the enterprise needs from Hadoop (http://tdwi.org/Blogs/
None of those were officially released yet, and I didn’t try to compile the code and run, so I can’t say much about what is actually delivered. Perhaps someone did and can comment. But I did notice another interesting detail. The Project Rhino README lists all the Hadoop components that Intel intends to include in its unified and integrated security model:
- Core: A set of shared libraries
- HDFS: The Hadoop filesystem
- MapReduce: Parallel computation framework
- ZooKeeper: Configuration management and coordination
- HBase: Column-oriented database on HDFS
- Hive: Data warehouse on HDFS with SQL-like access
- Pig: Higher-level programming language for Hadoop computations
- Oozie: Orchestration and workflow management
- Mahout: A library of machine learning and data mining algorithms
- Flume: Collection and import of log and event data
- Sqoop: Imports data from relational databases
Looks familiar to anyone? That’s because it’s more or less identical to Cloudera’s Hadoop distribution. Why did Intel choose to use CDH? Possibly because of its focus on the enterprise toolchain – those are the tools you’ll need to build an ETL pipeline and a data-science practice on Hadoop. If Intel’s unified solution won’t include these tools, getting the enterprise adoption they are looking for will be a much bigger challenge. However, it does open new questions: Will Intel offer support for the distribution, or will they leave it to Cloudera, who already supports all the components? And can you have a “unified security solution” that leaves HortonWorks and MapR completely out of the plan?
It’s far too early to tell where this will all go, but so far Intel has made interesting decisions that make me look forward to the day when they have more to download than just a PDF. If you have any thoughts on where this is all going, I’d love to read your comments too.
4 Responses to “Thoughts on Intel’s Hadoop Distribution”
Leave a Reply