BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage News Shopify’s Practical Guidelines from Running Airflow for ML and Data Workflows at Scale

Shopify’s Practical Guidelines from Running Airflow for ML and Data Workflows at Scale

Shopify engineering shared its experience in the company's blog post on how to scale and optimize Apache Airflow for running ML and data workflows. They shared practical solutions for the challenges they faced like slow file access, insufficient control over DAG (directed acyclic graph), irregular level of traffic, resource contention among workloads, and more.

Apache Airflow (in short Airflow) is a platform to write, schedule, and monitor workflow operations. Workflows are defined as the DAGs of tasks. Airflow executes tasks using workers while following the dependencies among them. It is one of the most popular orchestration platforms in any enterprise for data and service management and DevOps. In the blog post written by Spotify engineers, Airflow is used for orchestrating different applications like machine learning model training, and data pipeline operations. They were running Airflow 2.2 on Kubernetes, using the Celery executor and MySQL 8. The following diagram shows Airflow architecture and deployment in Shopify.

Shopify’s Airflow Architecture

Airflow has been used to handle a large number of workflows. They mentioned in their blog post about the scale of Airflow usage at Shopify:

Shopify’s usage of Airflow has scaled dramatically over the past two years. In our largest environment, we run over 10,000 DAGs representing a large variety of workloads. This environment averages over 400 tasks running at a given moment and over 150,000 runs executed per day. 

To run Airflow at the mentioned scale, there is a need to tune and customize Airflow for the optimal experience. Here are some highlighted guidelines shared by engineers at Shopify:

Slow File Access When Using Cloud Storage

Airflow keeps DAG presentations updated by scanning all files in the DAG configuration directory. These files should be consistent across all workers in the Airflow platform. Shopify initially saved all these files on GCS (Google Cloud Storage). It was a bottleneck once they scaled the work. It produced a delay for workers' read/write operations on the cloud. To solve this issue, they used NSF (Network File System) as the initial cache in the Kubernetes cluster.

Airflow Degradation due to the Increasing Volumes of Metadata 

Airflow operations produce lots of metadata like DagRuns, TaskInstances, Logs, TaskRetries, etc. Once the number of workers increases these metadata sizes also increase. This will decrease the operation performance at scale. To solve this problem Shopify’s engineers settled on a metadata retention policy of 28 days. That solution improves the performance of the operations while still having metadata for debugging, monitoring, and other operational works. They shared codes for this retention and cleaning policy on github.

Hardness of DAGs Association to the Users and Teams

Different engineers could create DAGs from different environments. That makes it hard to track and trace back to the group or project who is responsible. To solve this issue, they introduced a registry of the Airflow namespace, which they referred to as the Airflow environment’s manifest file. This manifest file contains information about the owner, repositories, restrictions, etc. They shared a sample manifest file on github.

Limiting Power of DAG Authors

To control the level of authority of the developers, they implemented DAG policy which makes sure there is no restriction and conflict with namespace.

Uniform Load Distribution

When running DAGs at scale it is very hard to make sure the load of the operation is evenly distributed. One feasible solution is to randomize schedule intervals for automatically generated DAGs (with no restriction on start time) to make sure that the load is evenly distributed. Engineers shared a sample file for this randomized scheduling.

Airflow has a big technical community with regular meetups that also shares best industrial practices for the users and community.

Automating workflow operations is one of the important tasks in software DevOps. There are other open source solutions like Apache NiFi, Apache Oozie (for Hadoop Ecosystem), or cloud solutions like AWS Step Function for further reading and investigation.

About the Author

Rate this Article

Adoption
Style

BT