I’ve attended the Strata conference in NYC last week. Its been many years since I’ve last attended a conference without presenting in it. On one hand, attending only makes for a far more relaxed experience. On the other hand, I missed having random people come up to me and talk about my presentation. I decided to attend the conference since it is considered the foremost data science conference. And I was very much interested in what those data scientists are up to.
Good data scientists combine the abilities of business analysts, statisticians and software engineers. They have the skills, the tools and the mandate to mine and analyze all the data the organization collects to deliver valuable insights to the business and data-based features to the customers of the business. In addition, it is considered the hottest job around. Of course, it is data scientists who mined job postings and job moves to come up with this conclusion, so maybe take it with a grain of salt.
Data-scientists normally work with very large amounts of data, both structured (the enterprise data warehouse) and unstructured (web server logs, blog posts). Since I’m a big fan of big data, I was very curious to see what those data scientists care about.
So, in no particular order – stuff data scientists like:
Cleaning up data:
Maybe data scientists like cleaning up their data, or maybe they hate it. In any case, it is clear that they do a lot of it.
It turns out that most data in the world is incredibly messy. I’m not even talking of outliers that can influence the analysis, but may or may not be valuable. I’m talking about customers who misspell their names, employees who spell the customer name different each time they enter the data into the database, several different ways to type the same credit card number (spaces, dashes, or just big line of numbers?), invalid addresses and phone numbers. The list is endless.
So, how do they deal with it? It didn’t sound like there’s a standardized toolkit for data scrubbing in the big data world. Every presenter who dealt with the topic is using a bunch of heuristics to figure out if two customer names are similar enough to be the same, and they all use home-grown software to do this. One of the presenters demonstrated the use of google-refine, which currently looks like the best solution available. However, it sounded like no automated software is capable of doing a good-enough job, and large amounts of human intervention and judgment are mandatory.
What is important to remember that dirty data is the number one difficulty in most data analysis projects, and the number one cause of failure. So its important to tackle this early in the game.
The really ugly relative of data scrubbing.
In the last year, I registered to two O’Reilly conferences – I registered to MySQL conference as Gwen Shapira, and to Strata as Chen Shapira. How can O’Reilly figure out that both records refer to the same real life person? They both have the same phone number and email address. LinkedIn has the same problem – some of our employees have “The Pythian Group” as their company, others simple say “Pythian”. LinkedIn really want to know whether we all work for the same company or not, and they try to figure it out by comparing our networks.
Entity resolution is usually defined as mapping your data records to canonical real-world entities. As you can see, it isn’t easy, and requires a lot of heuristics. But, the more data companies have, the easier it becomes to figure it out. LinkedIn and Google have easier time resolving their records to real world entities because they can apply massive amounts of data from many different collection methods to the problem. O’Reilly may never figure out that Chen and Gwen are the same person.
Map Reduce is hard:
Hadoop, the #1 big-data analysis platform, uses MapReduce as the main paradigm for data analysis. MapReduce allows to write data analysis code that is easy to parallelize over large clusters. This is all cool, but it turns out that writing MapReduce code is not as easy as writing SQL. And the businesses want their SQL back, or at least they want their BI tools to work.There are three popular solutions to this issue:
- Use a big-data tool that isn’t Hadoop
- Feed data from Hadoop into your favorite RDBMS.
- Get tools that runs on top of Hadoop and make it somewhat more friendly.
I’ve seen Lexis-Nexis and Splunk taking the first approach. Splunk is very popular in the application-log analysis market, and the company is trying to leverage this success to become a general unstructured-data crunching solution. LexisNexis are a legal research firm who are open-sourcing and selling the database they developed for their internal use. Both have impressive products, but I think its time to admit – Hadoop is here to stay.
The seconds and third approaches are more popular and sometimes even combined. EMC integrated their Hadoop machine with their Greenplum RDBMS and they added analytics tools integration on top. Meanwhile, Ben Werther, formerly head of Greenplum, is now in Platfora. Platfora are building BI and analytics system that will support Hadoop and are worth watching. Hadapt, Daniel Abadi’s company is building a better Hadoop by adding, among many other improvements, SQL support. Karmasphere and Datameer are also in the business of providing analytics on top of Hadoop, and Informatica added Hadoop to the list of technologies supported by its ETL and MDM solutions.
Datastax are attacking the whole problem from a new angle and integrate Hadoop with Cassandra, their NoSQL engine. Integrating Hadoop with NoSQL is a good move, and done internally in some companies where the results of Hadoop’s ETL are simple key-value pairs that need to be served quickly and with high concurrency. Cassandra, on the other hand, is not a simple key-value store. It has one of the more complex data models around. So we’ll see how well this will work.
A lot of these integrations are aimed at solving the oldest problem in data modeling – how to run large complex queries and small real-time queries on the same data and still see reasonable scalability and performance. And they come up with the same old conclusion – use the best tool for the job, a fast network and lots of ETL.
Notably absent are Cognos and Business Objects. They didn’t notice the demand for Hadoop BI? They don’t care because they believe that Oracle-Hadoop integration will solve all their problems? Who knows. Another notable absent is NetApp. Why aren’t they pushing the Hadooplers harder?
The most surprising new player in lets-integrate-RDBMS-with-Hadoop game is Oracle. Absent from Strata, they are getting ready to announce their big data offering (Exa-Hadoop as the tweeps call it) in OOW next week. I think its significant that EMC announce their amazing Hadoop appliance in Strata while Oracle are doing it in OpenWorld. The solutions will probably be similar, but the focus and target market seem slightly different
This is not something data scientists like. This is the elephant in the room.
Companies want to mine data for value, but the data doesn’t necessarily belong to them. And not everything a company can do with data will be appreciated by the public from whom the data was collected. Cory Doctorow’s book “Little Brother” has a scene where cops detain people for suspicious behavior – actions that were determined statistically to be different than the norm. Such as driving somewhere other than back home after work. What about companies using data to discriminate against employees who are statistically determined to possibly develop health issues in the future? Those are difficult and disturbing questions. Data science, like any science, can be used for good or evil.
Data scientists, like all scientists through history, cheerfully ignore the moral dillemas. Most DBAs were trained that they may have access to sensitive data, but it is not their data and it should only be accessed in ways that are compatible with privacy and security policies. Data scientists often seems to think that they own the data and can use it to impress their friends with how much they know about them. This can make conversations with a data scientist from twitter a somewhat awkward experience – how much does he know about me?
Combine this with the fact that Hadoop has no concept of user privileges, auditing and restricted data access, and therefore no way to support things like SOX and HIPAA, and you have a field that is just waiting for regulatory trouble.
Driving insights back to the business:
At the end of the day, this is the goal of the big-data game. We are not analysing massive amounts of data and investing millions on the infrastructure to manage it because we enjoy producing pretty charts (although we do!). The idea is that somehow, all this data will bring value.
For some companies, having more data than anyone else is their strategic advantage. Thats how Google rules the online advertisement business.
For other companies, big data is seen as a way to fine-tune its business. If you are a large enterprise like EMC or HP, there are profit margins to be found by data mining your supply-chain, your vendor contracts, your sales cycle, your marketing campaigns.
Other companies use big data analytics to bring new features to their users and drive more business their way – LinkedIn’s “people you may know” and Amazon’s “Customer who bought this product also bought” features are definitely in this category.
A very hot topic is how to show your results in pretty pictures.
Everyone knows that distilling data and business insights to a single pretty chart is critical in getting anyone to listen to what you have to say. However, Data Scientists typically don’t come from the background that allows them to effortlessly come up with stunning visual designs. Some have the privilege of working with professional designers (and even professional data visualizers – I met few of those), others try to educate themselves on design concepts and use the right tools to make life easier.
The visualization trend starts at simple and effective charts of the type a data scientist can do in R, and end with very complex infographics. The consensus is that infographics are generally not very effective in conveying information to the audience, but are stellar in getting their attention. There is also a general agreement that your visualization should either tell a story (or make a business case), drawing attention to the data that supports the story. Or allow for exploration of the data in multiple dimensions
Professional designers will use design tools like Photoshop and Illustrator to make their stunning designs. Data scientists will use R, D3.js and Processing. Rich data scientists will use Tabluea, which is a seriously impressive tool that lets the nerdiest statistician create incredible eye-candy that actually means something. The presentations that gave visualization and story-telling advice were packed.
So, busy two days. Lots of insights and inspiration. Somewhat less technical meat-and-potatoes than I hoped for. Maybe next year when the field will standardize a bit and move away from the data-hero mode, we’ll get people to share how they did what they did and not just the awesome results.
Discover more about our expertise in Hadoop.
Share this article
Leave a Reply