Analytics Google Cloud

Data is everything, and everything is data

3 min read

Apr 19, 2016 12:00:00 AM

Data is everything, and everything is data | Official Pythian® Blog

4:18

Exploring the phenomenon of “datafication”

In the year 2000, only a quarter of the world’s stored information was digital; the rest was on paper, film, and other analog media. Today, less than two percent of all stored information is nondigital. (1)

This is largely the result of “datafication”, a process that turns all aspects of life—preferences, opinions, telephone calls and sensor-driven information—into data.

Datafication is the driving force behind Big Data. It’s also causing a threefold shift in how we look for meaning in the information available to us: away from traditional sampling approaches, toward greater tolerance of messy, unstructured data, and into the search for correlations rather than absolute, singular causes to explain trends and events. These changes are already having major impacts in every area of our lives—from scientific research to business and finance, to healthcare, education and transportation.

From sampling to knowing

Representative sampling is based on the idea that, within a certain margin of error, we can make inferences about a total population from a small, randomized subset. This works well for simple questions like, “Which of our customers generate the most revenue?” but lacks the detail to effectively answer queries like, “Which customers are most profitable?” or, “Which customers are considering to leave us for another vendor?”

Inexpensive computer memory, powerful processors and sophisticated algorithms now allow us to analyze vast amounts of data rather than small samples. Using Big Data in this way has the considerable advantage of predictive capability—it can identify patterns and trends that aren’t detectable in a small sample, giving an unprecedented view of future behavior.

From clean to messy

What’s new about Big Data isn’t just that there’s lots of it. Because it comes from many different sources in many different formats, it’s not tidy like traditional datasets. Tolerating some inaccuracy may require data analysts to shift their outlooks a little, but when you’re trying to answer big, complex questions, the gain in data scope is a good trade-off against using smaller amounts of very exact data. Here’s an example.

In 2009, Google showed it’s possible to predict locations of seasonal outbreaks of the flu using nothing more than archived records of Google searches. The sheer size of the data set (think a billion searches a day in the U.S. alone) more than compensated for its messiness. After running nearly half a billion calculations against the data, Google identified 45 terms—words such as “headache” and “runny nose”—that had a strong correlation with the CDC’s data on flu outbreaks.

From cause to correlation

The Google example points to a third change brought about by datafication and Big Data: abandoning the search for certainty. Instead of looking for causes, innovative data users are looking for correlations. For example, automotive and aviation engineers are collecting and analyzing massive quantities of information on engines that have failed, looking for patterns that will help them predict when other engines might be at risk of failing in the future. They’re not seeking a single cause for a single event; they’re mapping correlations between huge numbers of events to recognize patterns that can be put to practical, preventative use.

The correlation approach has been used to spot infections in premature babies before overt symptoms appear and to predict everything from manhole cover failures to consumer purchasing habits.

Big Data insights require Big Thinking

Harnessing the powerful, often unpredictable, insights available from Big Data requires three things: as complete a dataset as possible, people with the skills required to collect, manage and analyze that data, and people who know how to ask unexpected, even visionary questions. It’s not just a matter of the right technologies—it’s about a fundamental shift in how we relate to data and what can be done with it.