Google Cloud Platform (GCP) recently announced the beta launch of Cloud AI Platform Pipelines, a new product for automating and managing machine learning (ML) workflows, which leverages the open-source technologies TensorFlow Extended (TFX) and Kubeflow Pipelines (KFP).
In a recent blog post, product manager Anusha Ramesh and developer advocate Amy Unruh gave an overview of the offering and its features. Cloud AI Platform Pipelines addresses the problem of managing end-to-end ML workflows, which span the lifecycle from ingesting raw data, to model training and evaluation, to serving model inference in production. The new product contains tools for building workflows, tracking workflow artifacts and lineage, and an "enterprise-ready" workflow execution infrastructure that integrates with other GCP services such as BigQuery and Dataflow. According to Ramesh and Unruh,
Cloud AI Platform Pipelines provides a way to deploy robust, repeatable machine learning pipelines along with monitoring, auditing, version tracking, and reproducibility, and delivers an enterprise-ready, easy to install, secure execution environment for your ML workflows.
Cloud AI Platform Pipelines is a managed implementation of TensorFlow Extended (TFX) and Kubeflow Pipelines (KFP) which runs on a Google Kubernetes Engine (GKE) cluster. TFX is an abstraction layer in which the core concept is a pipeline, a series of data transformation steps (pipeline components) that must be coordinated or orchestrated; data transferred between components are called artifacts. KFP is the orchestrator, which executes each component in the pipeline on a pod in the GKE cluster. TFX also defines a datastore for ML metadata (MLMD), which allows for tracking the history and versions of a pipeline as well as the artifacts produced by it. Cloud AI Platform Pipelines supports two SDKs, the TFX higher-level SDK and the lower-level KFP SDK; however, Google plans to merge the two into a single TFX SDK.
Source: https://cloud.google.com/blog/products/ai-machine-learning/introducing-cloud-ai-platform-pipelines
TFX was first described by Google in a paper presented at KDD 2017, documenting the results of Google's effort to build an end-to-end ML platform that included all phases of the ML process, including data analysis and transformation, model training and evaluation, and inference in production. The original execution infrastructure was Apache Beam, which was itself based on Google's Flume and now powers Google Cloud Dataflow. TFX still uses Beam to define data-parallel operations, but now also supports Kubeflow and Apache Airflow as orchestration engines. Airflow is the technology behind another GCP product, Cloud Composer.
Airflow and Cloud Composer are general-purpose workflow orchestration technologies and have been recommended by Google in the past for managing ML workflows. In 2018, Google open-sourced Kubeflow as a ML-specific platform targeted for Kubernetes; Spotify recently adopted it as their standard ML platform and open-sourced their Terraform templates for creating clusters. The new Cloud AI Platform Pipelines offering from Google abstracts away even more of the work by managing the GKE cluster. In a discussion on Hacker News, one user noted:
The battle of ML pipeline ecosystems is the engine - not so much the API. It had been Beam vs [Apache] Spark. Now, Google are changing tack and saying it is TensorFlow on Kubernetes with distributed processing vs Spark-based ML pipelines.
The source code for both Tensorflow Extended and Kubeflow Pipelines are available on GitHub.