Figma migrated its compute platform from AWS ECS to Kubernetes (EKS) in less than 12 months with minimal customer impact. The company decided to adopt Kubernetes to run its containerized workloads primarily to take advantage of the large ecosystem supported by the CNCF. Additionally, the move was dictated by pursuing cost savings, improved developer experience, and increased resiliency.
Figma moved to run application services in containers and adopted Elastic Container Service (ECS) as its container orchestration platform by early 2023. Using ECS allowed the company to quickly roll out containerized workloads, but since then, engineers have run into problems with certain limitations of using ECS, mainly the lack of support for StatefulSets, Helm charts, or the ability to easily run OSS software like Temporal.
Moreover, the company recognized it was missing out on the wide range of capabilities offered for Kubernetes within the CNCF community, including advanced autoscaling using Keda or Karpenter, service mesh using Istio/Envoy, and numerous other tools and features. The organization also considered the substantial engineering effort required to customize ECS for its needs and the availability of engineers experienced with Kubernetes on the job market.
Kubernetes Migration Timeline (Source: Figma Engineering Blog)
After deciding to switch to Kubernetes (EKS), the team agreed on the scope of the migration, focusing on minimizing changes required to services to avoid delays and risks. Despite limiting the project's scope, the company wanted to include specific improvements, like simplified resource definitions to improve developer experience and improved reliability by splitting the deployment into three Kubernetes clusters to avoid the impact of bugs and operator errors.
Ian VonSeggern, software engineering manager at Figma, discusses the cost optimization goals of the migration project:
We didn’t want to tackle too much complex cost-efficiency work as part of this migration, with one exception: We decided to support node auto-scaling out of the gate. For our ECS on EC2 services, we simply over-provisioned our services so we had enough machines to surge up during a deploy. Since this was an expensive setup, we decided to add this additional scope to the migration because we were able to save a significant amount of money for relatively low work. We used the open-source CNCF project Karpenter to scale up and down nodes dynamically for us based on demand.
To ensure a successful project outcome, Figma created a well-staffed team to drive the migration effort and engage with the broader organization to get their buy-in. The engineers prepared for the production rollout by conducting load testing of the Kubernetes setup to avoid surprises, implementing an incremental switchover mechanism using weighted DNS entries, and deploying services into the staging Kubernetes cluster early in the process to iron out any issues. The compute platform team has worked with service owners to provide a golden path and ensure consistency and ease of maintenance.
The initial migration took less than 12 months, and after migrating core services, the team started looking at follow-up activities like introducing Keda-based autoscaling. Additionally, based on user feedback, engineers simplified developer tooling to work with three Kubernetes clusters and new fine-grained RBAC roles.