The Top Four Reasons Cloud Data Platforms and Machine Learning Were Meant For Each Other and How to Make Them Work Well Together
Machine learning (ML) workloads have some unique properties, characteristics and complexity that separate them from less advanced analytics. Developing an ML model requires data scientists to understand what kind of data their model will need, and to run a number of experiments to choose the most suitable algorithm. They then need to train a model by using a training data set to adjust model parameters in multiple iterations until model accuracy is within acceptable limits. Once they've trained a model, they need to validate it to ensure the model produces good results on data other than the training data set. Finally, the model can start serving results to its end users by accepting new data, applying calculations and producing results. In many organizations, this is a complicated process with many manual steps. To make the machine learning process efficient requires four key ingredients: lots of data, a scalable computation environment, the ability to use a variety of tools, integrated experimentation and collaboration. A well-designed cloud data platform makes all of these possible and cost effective.
1. A cloud data platform can cost-effectively hold ALL THE DATAWhile many cloud data platforms start out as a way to encourage more business access to data via reports and dashboards, if properly designed, they can store all the data available to your organization. This includes archives of historical data and access to both raw data that is ingested from the source systems as-is, and preprocessed data that has been cleaned up according to organizational standards. Access to raw and precleaned data is of huge value to data scientists who want to feed their models. And it's not just access to data; cloud vendors also have a variety of tools to make it easy to split data into training and validation data sets.
2. A cloud data platform can make unlimited compute available when neededMachine learning models need access to significant compute capacity to run the training process. This includes rerunning training steps hundreds or thousands of times. The data processing layer of the cloud data platform offers a scalable compute platform which data scientists can use to train models on much larger datasets than would be possible on a personal computer. Today all cloud vendors offer access to VMs with powerful Graphics Processing Units (GPUs) which can significantly speed up the ML model training process.
3. A cloud data platform provides multiple and different ways to access dataA cloud data platform also provides multiple different ways to access the data: SQL, Apache Spark, direct files access, etc. This is important because it allows you to use the ML tools and libraries that may have different requirements for working with data. More choice means higher productivity to data scientists.
4. A cloud data platform brings integrated experimentation and collaboration toolsCommunication and collaboration between team members during the model development process is very important. If each data science team member experiments with data on their own computers it can be challenging to reconcile or share these results with other team members. Cloud vendors have realized the importance of a seamless collaboration for the model development process and offer a number of tools that allow data scientists to run, share and discuss the results of their experiments with other team members or stakeholders.
Implementing machine learning on your cloud data platformThese requirements make a cloud data platform an ideal place for the machine learning workloads. Here’s how this might look in practice: A typical, if overly simplified, machine learning lifecycle has the following steps:
- Ingest and prepare data sets
- Train / validate model loop
- Deploy the model to production to serve results to the end users