Essential Hadoop Concepts for Systems Administrators

Jul 2, 2014 / By Kevin Pedersen

Tags: , , ,

Of course, everyone knows Hadoop as the solution to Big Data. What’s the problem with Big Data? Well, mostly it’s just that Big Data is too big to access and process in a timely fashion on a conventional enterprise system. Even a really large, optimally tuned, enterprise-class database system has conventional limits in terms of its maximum I/O, and there is a scale of data that outstrips this model and requires parallelism at a system level to make it accessible. While Hadoop is associated in many ways with advanced transaction processing pipelines, analytics and data sciences, these applications are sitting on top of a much simpler paradigm… that being that we can spread our data across a cluster and provision I/O and processor in a tunable ratio along with it. The tune-ability is directly related to the hardware specifications of the cluster nodes, since each node has processing, I/O and storage capabilities in a specific ratio. At this level, we don’t need Java software architects and data scientists to take advantage of Hadoop. We’re solving a fundamental infrastructure engineering issue, which is “how can we scale our I/O and processing capability along with our storage capacity”? In other words, how can we access our data?

The Hadoop ecosystem at it’s core is simply a set of RESTfully interacting Java processes communicating over a network. The base system services, such as the data node (HDFS) and task tracker (MapReduce) run on each node in the cluster, register with an associated service master and execute assigned tasks in parallel that would normally be localized on a single system (such as reading some data from disk and piping it to an application or script). The result of this approach is a loosely coupled system that scales in a very linear fashion. In real life, the service masters (famously, NameNode and JobTracker) are a single point of failure and potential performance bottleneck at very large scales, but much has been done to address these shortcomings. In principal, Hadoop uses the MapReduce algorithm to extend parallel execution from a single computer to an unlimited number of networked computers.

MapReduce is conceptually a very simple system. Here’s how it works. Given a large data set (usually serialized), broken into blocks (as for any filesystem) and spread among the HDFS cluster nodes, feed each record in a block to STDIN of a local script, command or application, and collect the records from STDOUT that are emitted. This is the “map” in MapReduce. Next, sort each record by key (usually just the first field in a tab-delimited record emitted by the mapper, but compound keys are easily specified). This is accomplished by fetching records matching each specific key over the network to a specific cluster node, and accounts for the majority of network I/O during a MapReduce job. Finally, process the sorted record sets by feeding the ordered records to STDIN of a second script, command or application, collecting the result from STDOUT and writing them back to HDFS. This is the “reduce” in MapReduce. The reduce phase is optional, and usually takes care of any aggregations such as sums, averages and record counts. We can just as easily pipe our sorted map output straight to HDFS.

Any Linux or Unix systems administrator will immediately recognize that using STDIO to pass data means that we can plug any piece of code into the stream that reads and writes to STDIO… which is pretty much everything! To be clear on this point, Java development experience is not required. We can take advantage of Linux pipelines to operate on very large amounts of data. We can use ‘grep’ as a mapper. We can use the same pipelines and commands that we would use on a single system to filter and process data that we’ve stored across the cluster. For example,

grep -i ${RECORD_FILTER} | cut -f2 | cut -d’=’ -f2 | tr [:upper:][:lower:]

We can use Python, Perl and any other languages with support configured on the task tracker systems in the cluster, as long as our scripts and applications read and write to STDIO. To execute these types of jobs, we use the Hadoop Streaming jar to wrap the script and submit it to the cluster for processing.

What does this mean for us enterprise sysadmins? Let’s look at a simple, high level example. I have centralized my company’s log data by writing it to a conventional enterprise storage system. There’s lots of data and lots of people want access to it, including operations teams, engineering teams, business intelligence and marketing analysts, developers and others. These folks need to search, filter, transform and remodel the data to shake out the information they’re looking for. Conventionally, I can scale up from here by copying my environment and managing two storage systems. Then 4. Then 8. We must share and mount the storage on each system that requires access, organize the data across multiple appliances and manage access controls on multiple systems. There are many business uses for the data, and so we have many people with simultaneous access requirements, and they’re probably using up each appliance’s limited I/O with read requests. In addition, we don’t have the processor available… we’re just serving the data at this point, and business departments are often providing their own processing platforms from a limited budget.

Hadoop solves this problem of scale above the limits of conventional enterprise storage beautifully. It’s a single system to manage that scales in a practical way to extraordinary capacities. But the real value is not the raw storage capacity or advanced algorithms and data analytics available for the platform… it’s about scaling our I/O and processing capabilities to provide accessibility to the data we’re storing, thereby increasing our ability to leverage it for the benefit of our business. The details of how we leverage our data is what we often leave for the data scientists to figure out, but every administrator should know that the basic framework and inherent advantages of Hadoop can be leveraged with the commands and scripting tools that we’re already familiar with.

Leave a Reply

  • (will not be published)

XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>