Instacart Creates a Self-Serve Apache Flink Platform on Kubernetes

Instacart moved their Apache Flink workloads from AWS EMR to Kubernetes to meet the high demand for data processing use cases using Flink within the organization, as using EMR became problematic for many teams with different requirements. As a result, they made the platform easier to use and reduced their operational and infrastructure costs.

The company has been using Apache Flink on AWS EMR since 2021 for several use cases, ranging from real-time decision-making and data augmentation to machine learning feature generation and OLAP data ingestion. The usage of Flink grew gradually, with 50 product teams using it and running hundreds of pipelines.

Source: https://tech.instacart.com/building-a-flink-self-serve-platform-on-kubernetes-at-scale-c11ef19aef10

As more and more teams adopted the technology, it became apparent that EMR services come with some limitations, including the lack of secret and config management, fine-grained permissions, Flink multi-version support, or CI/CD support. Furthermore, with EMR, cluster autoscaling or job failure recovery didn’t work well, and users had to interact with cluster nodes using SSH without any security or auditing tools.

Sylvia Lin, a data infrastructure engineer at Instacart, highlights the challenges of scaling Flink on EMR:

To meet the high demand, we needed to delegate job ownership to product teams and make our platform self-serve. Running Flink on EMR did not scale to meet such a high demand. In addition, the lack of native tooling makes Flink self-serve difficult for running on EMR.

The team decided to use Kubernetes to run Flink clusters primarily because of built-in fault tolerance and autoscaling, the tools available, and a strong community. They took advantage of the Flink Kubernetes operator, released in early 2022, and created their custom controller to help with service provisioning and onboarding. Moving to Kubernetes offered more control over the Flink development and deployment, including automatic builds and deployments, secret and config management, and service isolation.

Source: https://tech.instacart.com/building-a-flink-self-serve-platform-on-kubernetes-at-scale-c11ef19aef10

The team has used Karpenter for Kubernetes cluster node management to meet complex resource-isolation requirements. Karpenter can provision right-sized nodes from the start and provides better bin packing for Flink tasks compared to the standard cluster autoscaler.

Following the transition from EMR to Kubernetes, the pipeline onboarding time has been reduced from one week to a few minutes while reducing issues resulting from human errors due to many manual steps. The team calculated it had saved 50 weeks of development effort, reduced operational effort by 20%, and improved developer productivity by 15%. Additionally, they reduced infrastructure costs by 50-70% on compute instances. Lastly, thanks to automatic failure recovery, the team managed to avoid having to respond to critical night-time incidents, which were common previously.

About the Author

Rafal Gancarz

Show moreShow less

InfoQ Software Architects' Newsletter

Login with:

Don't have an InfoQ account?

InfoQ Article Contest

About the Author

Rafal Gancarz

Rate this Article

This content is in the Enterprise Architecture topic

Related Topics:

Related Editorial

Related Sponsored Content

Popular across InfoQ

The InfoQ Newsletter