How to Implement Airflow Best Practices From a Data Scientist's Perspective

3 min read

Nov 17, 2020 12:00:00 AM

Editor’s Note: Because our bloggers have lots of useful tips, every now and then we update and bring forward a popular post from the past. Today’s post was originally published on August 8, 2019. This blog post is a compilation of suggestions for best practices drawn from my personal experience as a data scientist building Airflow DAGs (directed acyclic graphs) and installing and maintaining Airflow. Let's begin by explaining what Airflow is and what it is not. From the official documentation ( https://airflow.readthedocs.io/en/stable/index.html), Airflow is a platform to programmatically author, schedule and monitor workflows. The documentation recommends using Airflow to build DAGs of tasks. The solution includes workers, a scheduler, web servers, a metadata store and a queueing service. Using my own words, Airflow is used to schedule tasks and is responsible for triggering other services and applications. The workers should not perform any complex operations but must coordinate and distribute operations to other services. That way, workers don't need to use too many resources. On the other hand, according to the official documentation, Airflow is not a data streaming or data flow solution. Data must not flow between steps of the DAG. I'll add more: Airflow is not a data pipeline tool. Avoid building pipelines that use a secondary service like an object storage (S3 or GCS) to store intermediate state for the next task to use. Airflow is not an interactive and dynamic DAG building solution. Avoid changing the DAG frequently. Workflows are expected to be mostly static or slow-changing. But wait a second ... this is exactly the opposite of how I see data engineers and data scientists using Airflow. Indeed, perhaps you use Airflow as warned against in the above paragraph. However, after working with DAGs after the first month of deployment, you start getting stressed. Every time you have a change in the code, you need to change something on the DAG, and you risk breaking it and having to wait for the next available hour when no DAGs that can be impacted by the change are running. Believe me, I've experienced waiting a couple of hours to finalize a five-minute fix in the code. Another recommendation is to keep the code elegant, pythonic and do defensive programming ( https://www.pluralsight.com/guides/defensive-programming-in-python). Enjoy the opportunity of using Jinja templating and building pythonic code. And please, implement custom exceptions and logging on the piece of code that is going to run every step of the DAG. The Airflow Web UI is going to print every single logging message. Also, don't forget to implement code in all your assumptions by using assert and check data boundaries. Next, be careful with the operators you're using. Don't think they keep up with all the updates in the available third-party services. For example, imagine how frequently Google Cloud SDK and AWS SDK evolve — do you really think Airflow operators are evolving as fast as they are? Probably not. Therefore, test and implement your own versions of the operators. The last experience I would like to share in this first part of this series is about time and time zones. Airflow core uses UTC, by default. Specify your default time zone in airflow.cfg. Following that, use pendulum pypi package to define the time zone in which your DAGs should be scheduled. Here is an example: https://airflow.readthedocs.io/en/stable/timezone.html Some final advice: the date and time you can see on the header of Airflow Web UI is not the one being used by the system. The Airflow Web UI naively uses a static version of JQuery Clock ( https://github.com/JohnRDOrazio/jQuery-Clock-Plugin) to print UTC time. Holy cow, I spent half an hour working on this until I realized this flaw. That's all I have to start with. In the following posts, I'm going to go over more specific best practices for scheduling machine learning pipelines.