Cloudflare's blog described its MLOps platform and best practices for running Artificial Intelligence (AI) deployments at scale. Cloudflare's products, including WAF attack scoring, bot management, and global threat identification, rely on constantly evolving Machine Learning (ML) models. These models are pivotal in enhancing customer protection and augmenting support services. The company has achieved unparalleled scale in delivering ML across its network, underscoring the significance of robust ML training methodologies.
Cloudflare's MLOps effort collaborates with data scientists to implement best practices. Jupyter Notebooks, deployed on Kubernetes via JupyterHub, provides scalable and collaborative environments for data exploration and model experimentation. GitOps emerges as a cornerstone in Cloudflare's MLOps strategy, leveraging Git as a single source of truth for managing infrastructure and deployment processes. ArgoCD is employed for declarative GitOps, automating the deployment and management of applications and infrastructure.
The future roadmap includes migrating platforms, such as JupyterHub, and Kubeflow, which is a machine learning workflow platform on Kubernetes that recently became a CNCF incubation project. This move is facilitated by the deployKF project, offering distributed configuration management for Kubeflow components.
To help the data scientists initiate projects confidently, efficiently, and with the right tools, the Cloudflare MLops team provides model templates that serve as production-ready repositories with example models. These templates are currently internal, but Cloudflare plans to open-source them. The use cases covered by these templates are:
- Training Template: Configured for ETL processes, experiment tracking, and DAG-based orchestration.
- Batch Inference Template: Optimized for efficient processing through scheduled models.
- Stream Inference Template: Tailored for real-time inference using FastAPI on Kubernetes.
- Explainability Template: Generates dashboards for model insights using tools like Streamlit and Bokeh.
Another crucial task of the MLOps platform is orchestrating ML workflows efficiently. Cloudflare embraces various orchestration tools based on team preferences and use cases:
- Apache Airflow: A standard DAG composer with extensive community support.
- Argo Workflows: Kubernetes-native orchestration for microservices-based workflows.
- Kubeflow Pipelines: Tailored for ML workflows, emphasizing collaboration and versioning.
- Temporal: Specializing in stateful workflows for event-driven applications.
Optimal performance involves understanding workloads and tailoring hardware accordingly. Cloudflare emphasizes GPU utilization for core data center workloads and edge inference, leveraging metrics from Prometheus for observability and optimization. Successful adoption at Cloudflare involves streamlining ML processes, standardizing pipelines, and introducing projects to teams lacking data science expertise.
The company vision is a future where data science plays a crucial role in businesses,and this is why Cloudflare invests in its AI infrastructure and collaborates with other companies like Meta, for example, making LLama2 globally available on its platform.