Instacart created the next-generation platform based on experiences using the original Griffin machine-learning platform. The company wanted to improve user experience and help manage all ML workloads. The revamped platform leverages the latest developments in MLOps and introduces new capabilities for current and future applications.
Instacart introduced its original Griffin platform in 2022 to support its journey toward leveraging machine learning for product development. Using a unified platform helped triple the number of ML applications in a year as it provided some key capabilities, including containerized environments, workflow management, feature marketplace, and real-time inference. Despite all the benefits, the first incarnation of the ML platform proved to be cumbersome to use by machine learning engineers (MLEs) due to relatively complex tooling, fragmented user experience, lack of standardization and metadata management, and insufficient scalability.
The company wanted to address the deficiencies identified in the first-generation platform and focus on providing a unified and scalable platform with a great user experience that would offer new and emerging capabilities such as distributed training and fine-tuning of Large Language Models (LLMs).
Griffin 1.0 Architecture Overview (Source: Instacart Technology Blog)
The second version of the Griffin platform replaced CLI and Git-based tooling with a service-oriented service-oriented architecture exposing REST APIs. These APIs are used by the web UI to provide a seamless experience to ML engineers, and Griffin SDK, which enables integrating other tools with Griffin, for instance, BentoLM, Instacart’s in-house ML notebook cloud-based development environment.
The platform backend comprises three major subsystems. The model training platform (MLTP) leverages Ray to offer a horizontally scalable computing environment that supports the distributed training of ML models. MLTP unifies various training backend platforms on Kubernetes and provides configuration-based runtimes for Tensorflow and LightGBM.
The model service platform (MLSP) provides streamlined and automated model artifact storage, model deployments, and provisioning of inference services. MLSP allows fine-tuning service resources and scalability configurations, which results in a quick and low-maintenance approach toward making ML models available for usage at scale.
The feature store supports feature computation, ingestion, discoverability, and shareability. With the new UI-based workflow, the users can configure new feature sources and fine-tune feature computation. Feature data validation helps catch errors much earlier, and storage optimization provides low-latency access to features.
Griffin 2.0 Architecture Overview (Source: Instacart Technology Blog)
All platform capabilities are available to ML engineers in a unified web UI application where they can create ML services end-to-end. At the same time, the process is guided by built-in validation at different stages to help rectify errors earlier in the process.
Rajpal Paryani, engineering manager at Instacart, summarizes the company’s journey in building and operating ML platforms:
Since our early days, we’ve witnessed rapid advancements in MLOps. The emergence of technologies like ChatGPT has revolutionized the utilization of Large Language Models (LLMs) across various industries. Our company is at the forefront of these developments. The guiding principles behind Griffin 2.0 [...] ensure that our ML infrastructure is well-prepared for advanced applications like LLM training, fine-tuning, and serving in the future.