MySQL Technical Track

An overview of best practices for implementing ML systems - Part 1

3 min read

Aug 12, 2019

In this series of blog posts, we will recommend some best practices identified from our own failures and successes throughout our time implementing machine learning (ML) systems. We won't discuss ML techniques here, but instead, provide an upper-level overview of how we design and develop ML products. Google offers a guide which elaborates on this topic: https://developers.google.com/machine-learning/guides/rules-of-ml/ . This series is comprised of three blogs. First, we will discuss the initial steps and the first pipeline. Second, we will elaborate on evolving the initial pipeline and third, we will present troubleshooting the pipeline and final adjustments.

Terminology

Some terms are a bit confusing, so let's clarify those definitions and vocabulary. Instance: This is the subject to be labeled. For example, if you’re implementing a computer vision application to classify cats / no cats images, the picture is the instance. I prefer calling it the subject. Feature: This is the transformed variable used as a prediction input. For example, when representing images, the pixels may be the feature of the image. Those representations usually assume the form of a vector, which is called a feature vector. The feature vectors enables analytical processing of the data through the ML algorithm that will perform the learning task. Label: The ground-truth answer provided on the training dataset. It is the expected outcome of the ML system. For example, the label for a weather forecast system can be sunny, cloudy, rainy to next-day forecast. Example: Represents an instance. It is a set of features and a label. Model: Mathematical representation of a prediction method. This can be a simple analytic function, it can be a rule-based logic, a statistical method or a deep neural network. Objective: Mathematical function or statistic that your training algorithm is trying to optimize. Pipeline: The infrastructure and data flow on which the ML model training and prediction algorithm are based. It includes data ingestion, feature transformation, validation, model building / restoring, training, model assessment, and model deployment in production. Here is the pipeline proposed by Tensorflow team which has those modules: https://www.tensorflow.org/tfx/guide . (You can expect a future blog post with details about the Tensorflow pipeline.) Pipeline module: Every modularised and composable unit used to build the pipeline. Those modules are the “bricks” that compose the pipeline. The most basic pipeline is composed of three modules: data ingestion, features transformation, model building / restoring.

Before Machine Learning

No data? No ML. Using ML seems awesome, but without enough data to express the observed event, no ML technique will be able to detect patterns that explain the event. However, if a heuristic technique can explain the phenomena, don’t hesitate to use it until you have enough data to implement an ML algorithm. In other words, better something today than nothing at all. Ensure you have enough statistics and information about the problem to be solved. Choose a simple, observable and accountable statistic as the primary goal. The goal of ML should be easy to measure and should be an indicator of the “real” goal. Often, there is no “real” objective; in that case, implement a business logic layer to translate the prediction result into the business result.

First Pipeline Implementation

Most of the problems to be faced are data and software engineering problems. The first victory is to build a robust ML pipeline. The following victories are due to the quality and selection of the features. Don’t focus too much on the ML algorithm on the beginning. Start with a reasonable but simple goal to answer the business problem. Implement a way to assess the model predictions in a controlled situation. Regarding infrastructure, maximize your time by using a serverless infrastructure such as Lambda functions, managed services, and APIs provided by cloud vendors like Google Cloud Platform (GCP) and Amazon Web Services (AWS). Adopt scalable products which will promote robustness in the solutions being deployed. Test the pipeline code to guarantee that every step performs as expected. Generate statistics about the data before and after the most important transformations. In general, apply DevOps practices to automate and monitor your pipeline and model. Starting with the first deployment, keep a log of all modules and a log of the data that flows. Use the same modules for training and prediction. Most importantly, the same feature transformations should be applied to the examples for training and prediction. Lastly, generate statistics that assess your model using testing examples. Be prepared to fail. Determining the impact of risks on the first pipeline is essential. Start answering questions like: How much does performance degrade if your model’s age is one day, a week, a month? What should be done if there is not enough data available to generate model predictions? How do bad predictions affect the overall performance? What can be done if a feature quality deteriorates? What happens if the transformation fails due to schema inconsistency? Finally, invest in data cleansing and transformation and add features that have a causal relation to the label. Evolve the pipeline by improving existing and adding new modules. Keep it solid.