The Hadoop and the Hare

May 12, 2014 / By Alan Gardner

Tags: ,

I speak to a lot of people who are terribly concerned with “real-time”. Can data get into the warehouse in real-time? Can we record everything the user does on the site in real-time? Real-time is a magic phrase, because it really means “so fast that nobody cares”. In a business sense, it usually means that nobody has really thought about the latency requirements, so they’re going to push for the lowest latency possible. This is a dangerous habit, and you need to fight against it at every turn. Some of the best, cheapest, most scalable solutions will only be available when you accept that you can wait for an hour (or even hours!) to see some results.

Case in point, a client asking for a NoSQL store to hold user event data. It would have to handle millions of concurrent users, all generating dozens of events within minutes. They were (rightly) concerned about write performance and scaling this NoSQL store horizontally to handle the volume of writes. When asked why they needed the events, they didn’t know: they wanted to do exploratory analytics and see if there were interesting metrics to be mined. I proposed dumping the events on a queue – Kafka, Kinesis – or just writing logs for Flume to pick up. Rather than asking a data store to make all of these messages available for querying immediately, do things asynchronously, run some batch jobs to process data in Hadoop, and see if interesting data comes out. The Hadoop cluster and the ingestion pipeline is an order of magnitude less expensive and risky than the cool, real-time solution which confers no immediate benefits.

Maybe this client will decide that batch analytics are good enough, and they’ll stick with using Hive or Impala for querying. Maybe they discover a useful metric, or a machine learning technique they want to feed back into their user-facing application. Many specific metrics can be stored as simple counters per user, in a store like Redis, once they’ve been identified as valuable. Machine learning techniques are interesting because (besides cold start situations) they should be stable: the output shouldn’t change dramatically for small amounts of new data, so new information only needs to be considered periodically. In both cases, the requirements should be narrowed down as far as possible before deciding to invest in “real-time”.

In another case, a client presented the classic data warehousing problem: store large volumes of events with fine grain, and slice and dice them in every imaginable way. Once again, the concept of reducing latency came up. How quickly can we get events into the warehouse so people can start querying them? The answer is that we can make anything happen, but it will be needlessly expensive and malperformant. The main goal of the project was to provide historical reporting: there was no indication that clients would want to do the same kind of pivots and filtering on data only from the past minute. The low latency application would be cool and interesting to develop, but it would be much simpler to find which reports  users want to be low-latency, then precompute them and store them in a more appropriate way.

A more appropriate way is Dynamo, Cassandra, or your preferred NoSQL key-value store. Precompute aggregates you know the user wants with very low latency, and store them keyed on time with a very fine grain: you have the benefit of high write throughput here, but at the cost of little query complexity. Once the data is no longer interesting – it’s visible in the warehouse with much more grain along the various dimensions – then drop it from NoSQL.

Starting with a relatively slow, batch platform gives very high flexibility at a low cost, and with little development effort. Once your users – internal or clients – have explored the data set and started complaining about latency, then is the time to build specialized pipelines for those specific use cases. Since Kafka, Kinesis and Flume all support streaming computations as well as batch, these will serve as good connection points for your “real-time” pipeline.

One Response to “The Hadoop and the Hare”

  • Mohsin says:

    Talking of real time processing of data in the same context as NoSQL store and batch processing of Hadoop seems to be the norm requirement by some clients – until the exponential increase of cost in deploying and managing such an infrastructure is presented to them.
    Nice and interesting article. Thanks.

Leave a Reply

  • (will not be published)

XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>