Pinterest, the visual discovery platform, has revealed details about its journey to modernise its machine learning infrastructure using Ray, an open-source distributed computing framework. In a recent blog post, the company shared insights into the challenges faced and solutions implemented as they integrated Ray into their large-scale production environment.
The project was driven by a need to enhance Pinterest's machine-learning capabilities in solving essential business problems.
Pinterest faced several unique challenges in building its Ray infrastructure. They chose to run Ray on PinCompute, their general-purpose federated Kubernetes cluster, which restricted the installation of necessary operators like KubeRay and its Custom Resources Definitions. This limitation required creative solutions to implement Ray effectively.
Other challenges included the need for persistent logging and metrics, integration with Pinterest's proprietary time series database and visualisation tools, and adherence to company-wide AAA (Authentication, Authorisation, and Accounting) guidelines.
To address these challenges, Pinterest developed a custom solution with an API Gateway, a Ray Cluster Controller, a Ray Job Controller, and a MySQL database for external state management. This approach provides an abstraction layer between users and Kubernetes, simplifying provisioning and managing Ray clusters.
The company also created a dedicated user interface for persistent logging and metrics. This UI allows for log analysis without needing an active Ray cluster, helping to mitigate costs associated with idle resources such as GPUs. For observability, Pinterest integrated Ray's metrics with its in-house time series database, Goku, which has APIs that are compliant with OpenTSDB. They also followed Ray's recommendation of persisting logs to AWS S3.
Pinterest used network isolation with full authentication to ensure proper security. They deployed the Ray Dashboard behind Envoy, their service mesh in the Kubernetes environment, and used TLS with custom modifications for gRPC communications.
The blog post emphasises the importance of incremental improvement, leveraging existing infrastructure, and regular meetings with internal customers to gather feedback. Pinterest reports that the adoption of Ray has led to increased speed in bringing machine learning ideas to production, with this now taking days rather than weeks.
In a presentation at QCon Plus in 2023, Zhe Zang from Anyscale provided additional context on how companies like Pinterest use Ray for large-scale machine-learning tasks. Zhe highlighted that Ray's flexibility and ease of use make it particularly attractive for organisations looking to modernise their ML infrastructure. He described how Ray allows for seamless data loading and preprocessing scaling beyond a single instance, which is crucial for companies with massive datasets. This capability addresses a common bottleneck in ML workflows where data processing can't keep up with GPU computation.
Zang also pointed out that Ray's ability to efficiently support heterogeneous compute environments, combining CPU and GPU resources, is a significant advantage for companies like Pinterest that need to optimise their hardware utilisation. He also discussed how Ray's ecosystem, including libraries like Ray Serve, enables rapid prototyping and deployment of ML models, which aligns with Pinterest's reported improvements in development velocity.
DoorDash, another tech company operating at a similar scale to Pinterest, has also undergone a journey to modernise its machine learning infrastructure. While their approaches are similar, there are also some differences. In a presentation at Ray Summit 2023, Siddarth Kodwani and Kornel Csernai from DoorDash discussed their Ray use case.
Like Pinterest, DoorDash faced challenges with its existing ML-serving platform. Sibyl, a Kotlin microservice deployed in Kubernetes, was optimised for high-throughput, low-latency use cases but lacked flexibility for newer ML paradigms and libraries. DoorDash's solution, Argil, shares some similarities with Pinterest's approach. Both companies built custom controllers to manage Ray clusters and jobs and integrated them with their existing Kubernetes infrastructure.
However, DoorDash strongly felt they should create a self-serve platform for their data scientists and ML engineers. They developed a client library in Kotlin to make it easier for their predominantly Kotlin-based services to interact with the Ray infrastructure. DoorDash also faced unique challenges with GPU accessibility in their Kubernetes environment, which required close collaboration with Nvidia to resolve driver compatibility issues.
One notable difference between Pinterest and DoorDash's rollouts is the deployment strategy. While Pinterest focused on building a custom solution from the ground up, DoorDash leveraged existing tools like Helm charts provided by the Ray team and Argo CD for deployment management.
DoorDash reported benefits similar to those of Pinterest, such as increased velocity and flexibility. They reduced the time to take an ML idea to production from weeks to a couple of days. They also saw significant performance gains, with 10-20x improvements for some use cases migrated from their previous system to Ray. Both companies emphasised the importance of observability and monitoring. While Pinterest integrated with their custom Goku time series database, DoorDash mentioned straightforward integration with Prometheus for metrics collection.
In conclusion, while Pinterest and DoorDash took slightly different paths in adopting Ray, both companies have reported significant improvements in their ML infrastructure flexibility, development velocity, and performance.