Big Data

TechTalk v5.0 – The Age of Big Data with Alex Morrise

Who: Hosted by Blackbird, with a speaking session by Alex Morrise, Chief Data Scientist at Pythian. What: TechTalk presentation, beer, wine, snacks and Q&A Where: Blackbird HQ – 712 Tehama Street (corner of 8th and Tehama) San Francisco, CA When: Thursday July 31, 2014 from 6:00-8:00 PM How: RSVP here! TechTalk v5.0 welcomes to the…

Logging for Slackers

When I’m not working on Big Data infrastructure for clients, I develop a few internal web applications and side projects. It’s very satisfying to write a Django app in an afternoon and throw it on Heroku, but there comes a time when people actually start to use it. They find bugs, they complain about downtime,…

Small Files on MapR-FS

One of the well-known best practices for HDFS is to store data in few large files, rather than a large number of small ones. There are a few problems related to using many small files but the ultimate HDFS killer is that the memory consumption on the name node is proportional to the number of…

Cloudera Challenge 2014

Yesterday, Cloudera released the score reports for their Data Science Challenge 2014 and I was really ecstatic when I received mine with a “PASS” score! This was a real challenge for me and I had to put a LOT of effort into it, but it paid off in the end! Note: I won’t bother you…

Essential Hadoop Concepts for Systems Administrators

Of course, everyone knows Hadoop as the solution to Big Data. What’s the problem with Big Data? Well, mostly it’s just that Big Data is too big to access and process in a timely fashion on a conventional enterprise system. Even a really large, optimally tuned, enterprise-class database system has conventional limits in terms of…

Ambari Blueprints and One-Touch Hadoop Clusters

For those who aren’t familiar, Apache Ambari is the best open source solution for managing your Hadoop cluster: it’s capable of adding nodes, assigning roles, managing configuration and monitoring cluster health. Ambari is HortonWorks’ version of Cloudera Manager and MapR’s Warden, and it has been steadily improving with every release. As of version 1.5.1, Ambari added support for a declarative configuration (called a Blueprint) which makes it easy to automatically create repeatable clusters in the cloud. I’ll give an example of how to use Ambari Blueprints, and compare them with existing one-touch deployment methods for other distributions.

2014 Hadoop Summit Summary

Last week I was at the 2014 Hadoop Summit in San Jose, trying to keep abreast of the ever-changing Apache landscape: what projects are up-and-coming, what projects are ready for production, and most importantly which projects can solve problems for our clients. It was also a great chance to hear about real, production deployments –…

The Hadoop and the Hare

I speak to a lot of people who are terribly concerned with “real-time”. Can data get into the warehouse in real-time? Can we record everything the user does on the site in real-time? Real-time is a magic phrase, because it really means “so fast that nobody cares”. In a business sense, it usually means that…

Page 2 of 712345...Last Page »