Netflix has successfully implemented a federated GraphQL API at scale. In a recent blog post series and QConPlus talk, engineers from Netflix describe their journey and the lessons learned in the process.
Netflix's software system is composed of hundreds of independent microservices that evolve at a different pace and scale separately. It employs a unified API aggregation layer (or API gateway) that encapsulates the service structure and hides its complexity from UI developers. However, as the system grew in complexity, the API gateway was increasingly harder to maintain by a single team. It has effectively become a monolith.
To solve this problem, Netflix decided to employ a solution based on GraphQL Federation. With this approach, the API gateway implementation is distributed to the backend teams owning the individual domain services via Domain Graph Services (DGS) that they implement. This change allows the federated gateway to delegate all domain-specific business logic is to the DGSs. The gateway itself only handles query planning and centralized tasks such as logging and monitoring.
Source: https://netflixtechblog.com/how-netflix-scales-its-api-with-graphql-federation-part-1-ae3557c187e2
They decided to implement the first federated gateway for the Netflix Studio API. The Studio API is responsible for managing business processes from the time a TV show or a movie is pitched to when it’s available on Netflix. The engineers state that "GraphQL and Federation have been a productivity multiplier." They summarize the process as follows:
Despite our positive experience, GraphQL Federation is early in its maturity lifecycle and may not be the best fit for every team or organization. Learning GraphQL and DGS development, running a federation layer, and doing a migration requires high commitment from partner teams and seamless cross-functional collaboration. (...) For ecosystems like ours with a large swath of microservices that need to be aggregated together, the development velocity and improved operability has made the transition worth it.
The following summarizes Netflix's evolution to a Federated GraphQL implementation:
Source: https://netflixtechblog.com/how-netflix-scales-its-api-with-graphql-federation-part-1-ae3557c187e2
- Initially, a monolithic backend implemented all of Netflix's functionality. The complexity of the monolith grew over time until it was too hard to maintain.
- Later, the backend monolith was decomposed to multiple microservices, exposed directly to UI developers. This decomposition has proven worthwhile for backend development but dramatically increased the complexity for clients.
- A gateway aggregation layer (API gateway) was added as an abstraction to hide the complexity from the microservices' clients. This layer has grown into a monolith as the API complexity grew.
- Federated GraphQL is used to decompose the gateway monolith, allowing domain developers to maintain their portion of the gateway while still exposing a unified API to clients.
Netflix's engineers admit that migrating to GraphQL Federation has had its challenges:
The biggest challenge was aligning on this strategy across the organization. Initially, there was a lot of skepticism and dissent; the concept was fairly new and would require high alignment across the organization to be successful. Our team spent a lot of time addressing dissenting points and making adjustments to the architecture based on feedback from developers. Through our prototype development and proactive partnership with some key critical voices, we were able to instill confidence and close crucial gaps.
They detail some steps that they took to handle these challenges better:
- Core infrastructure - their GraphQL Gateway is written in Kotlin. This choice gives them access to Netflix's Java ecosystem while maintaining Kotlin's richness over Java. They also developed a schema registry for managing the GraphQL schemas in Kotlin, utilizing the event sourcing pattern on top of the Cassandra database.
- Developer experience - with the new architecture, every DGS team needs to learn and build GraphQL services. To improve the experience, they made a framework for easier authoring of these services. The framework is due to be open-sourced in early 2021.
- Schema governance - they had a Studio Data Architect focused on data modeling and alignment across the organization. Also, they employ a collaborative design process that involves feedback and reviews across team boundaries.
- Observability- Netflix's engineers "integrated the Gateway and DGS architectural components with Zipkin, the internal distributed tracing tool Edgar, and application monitoring tool TellTale." Also, they employed a distributed log correlation mechanism in the gateway to help with debugging more complex server issues.
- Security - the authorization is delegated to DGS owners, in contrast to past situations where different layers might implement the same rule. There is a single implementation for each authorization rule, resulting in a consistent authorization for the same user across various applications.
At the base of the implementation is GraphQL Federation. GraphQL Federation works by splitting up the responsibility and query execution for a schema between multiple services. The gateway then merges these services' query results using agreed-upon identifiers, hiding the complexity from clients. Netflix's engineers provide an example composed of three primary services - a Movies DSG responsible for movie data, a Production DSG responsible for managing production-related data, and a Talent DSG managing the data on talents working on a movie (actors, directors, and so on). Given these services, a simple query for the movie title along with the production id and actors' names will produce the following query plan:
Source: https://netflixtechblog.com/how-netflix-scales-its-api-with-graphql-federation-part-1-ae3557c187e2
The federated gateway is responsible for breaking the client query into separate queries for the involved DSGs, executing them sequentially or in parallel in an optimized manner, and then stitching the results back, abstracting away the complexity. Since domain teams are responsible for their business logic, the API as a whole can evolve at a much faster pace.