BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Articles Evolving the Federated GraphQL Platform at Netflix

Evolving the Federated GraphQL Platform at Netflix

This item in japanese

Key Takeaways

  • Federated GraphQL distributes the ownership of the graph across several teams. This requires all teams to adopt and learn federated GraphQL and can be accomplished by providing a well-rounded ecosystem for developer and schema workflows.
  • Before you start building custom tools, it helps to use existing tools and resources adopted by the community and gradually work with initial adopters to identify gaps.
  • The (D)omain (G)raph (S)ervices (DGS) Framework is a Spring Boot-based Java framework that allows developers to easily build GraphQL services that can then be part of the federated graph. It provides many out-of-the-box integrations with the Netflix ecosystem, but the core is also available as an open-source project to the community.
  • As more teams came on board, we had to keep up with the scale of development. In addition to helping build GraphQL services, we also needed to replace manual workflows for schema collaboration with more tools that address the end-to-end schema development workflow to help work with the federated graph.
  • Today we have more than 200 services that are part of the federated graph. Federated GraphQL continues to be a success story at Netflix. We are migrating our Streaming APIs to a federated architecture and continue to invest in improving the performance and observability of the graph.

In this article, we will describe our migration journey toward a Federated GraphQL architecture. Specifically, we will talk about the GraphQL platform we built consisting of the Domain Graph Services (DGS) Framework for implementing GraphQL services in Java using Spring Boot and graphql-java and tools for schema development. We will also describe how the ecosystem has evolved at various stages of adoption.

Why Federated GraphQL?

Netflix has been evolving its slate of original content over the past several years. Many teams within the studio organization work on applications and services to facilitate production, such as talent management, budgeting, and post-production.

In a few cases, teams had independently created their APIs using gRPC, REST, and even GraphQL. Clients would have to talk to one or more backend services to fetch the data they needed. There was no consistency in implementation, and we had many ways of fetching the same data resulting in multiple sources of truth. To remedy this, we created a unified GraphQL API backed by a monolith. The client teams could then access all the data needed through this unified API, and they only needed to talk to a single backend service. The monolith, in turn, would do all the work of communicating with all the required backends to fetch and return the data in one GraphQL response.

However, this monolith did not scale as more teams added their data behind this unified GraphQL API. It required domain knowledge to determine how to translate the incoming requests to corresponding calls out to the various services. This created maintenance and operational burden on the team maintaining this graph. In addition, the evolution of the schema was also not owned by the product teams primarily responsible for the data, which resulted in poorly designed APIs for clients.

We wanted to explore a different ownership model, such that the teams owning the data could also be responsible for their GraphQL API while still maintaining the unified GraphQL API for client developers to interact with (see Figure 1).

[Click on the image to view full-size]

Figure 1: Federated ownership of graph

In 2019, Apollo released the Federation Spec, allowing teams to own subgraphs while being part of the single graph. In other words, the ownership of the graph could be federated across multiple teams by breaking apart the various resolvers handling the incoming GraphQL requests. Instead of the monolith, we could now use a simple gateway that routed the requests to the GraphQL backends, called Domain Graph Services, which serves the subgraph. Each DGS will handle fetching the data from corresponding backends owned by the same team (see Figure 2). We started experimenting with a custom implementation of the Federated GraphQL Gateway and started working with a few teams to migrate to this new architecture.

[Click on the image to view full-size]

Figure 2: Federated GraphQL Architecture

A couple of years later, we now have more than 150 services that are a part of this graph. Not only have we expanded the studio graph, but we have created more graphs for other domains, such as our internal platform tools, and another for migrating our streaming APIs as well.

The Early Adoption Phase

When we started the migration, we had around 40 teams already part of the federated graph served by our GraphQL monolith. We asked all these teams to migrate to a completely new architecture, which required knowledge of Federated GraphQL - an entirely new concept for us as well. Providing a great developer experience was key to successfully driving adoption at this scale.

Initially, a few teams opted to onboard onto the new architecture. We swarmed with the developers on the team to better understand the developer workflow, the gaps, and the tools required to bridge the knowledge gap and ease the migration process.

Our goal was to make it as easy as possible for adopters to implement a new GraphQL service and make it part of the federated graph. We started to gradually build out the GraphQL platform consisting of several tools and frameworks and continued to evolve during various stages of adoption.

Our Evolving GraphQL Ecosystem

The Domain Graph Services (DGS) Framework is a Spring Boot library based on graphql-java that allows developers to easily wire-up GraphQL resolvers for their schema. Initially, we created this framework with the goal of providing Netflix-specific integrations for security, metrics, and tracing out of the box for developers at Netflix. In addition, we wanted to eliminate the manual wire-up of resolvers, which we could optimize using custom DGS annotations as part of the core. Figure 3 shows the modular architecture of the framework with several opt-in features for developers to choose from.

[Click on the image to view full-size]

Figure 3: DGS Framework Architecture

When using the DGS Framework, developers can simply focus on the business domain logic and less on learning all the specifics of GraphQL. In addition, we created a code generation Gradle plugin for generating Java or Kotlin classes that represent the schema. This eliminates the manual creation of these classes.

Over time, we added more features that were more generally useful, even for service developers outside Netflix. We decided to open-source the DGS Framework and the DGS code generation plugin in early 2021 and continue evolving it. We also created the DGS IntelliJ plugin that provides navigation from schema to implementation of data resolvers and code completion snippets for implementing your DGS.

Having tackled implementing a GraphQL service, the next step was to register the DGS so it could be part of the federated graph. We implemented a schema registry service to map the schemas to their corresponding services. The federated gateway uses the schema registry to determine which services to reach out to, given an incoming query. We also implemented a self-service UI for this schema registry service to help developers manage their DGSs and discover the federated schema.

Finally, we enhanced our existing observability tools for GraphQL. Our distributed tracing tool allows easy debugging of performance issues and request errors by providing an overall view of the call graph, in addition to the ability to view correlated logs.

Scaling the Federated GraphQL Platform

We started this effort in early 2019, and since then, we have more than 200 teams participating in the federated graph. The adoption of federated GraphQL architecture has been so successful that we ended up creating more graphs for other domains in addition to our Studio Graph. We now have one for internal platform tooling, which we call the Enterprise Graph, and another for our streaming APIs.

After introducing the new Enterprise graph, we quickly realized that teams were interested in exposing the same DGS as part of the Enterprise and Studio graphs. Similarly, clients were interested in fetching data from both graphs. We then merged the Studio and Enterprise graphs into one larger supergraph. This created a different set of challenges for us related to scaling the graph for the size and number of developers.

The larger graph made it harder to scale our schema review process since it’s been mostly manually overseen by members of our schema review group (see Figure 4). We needed to create more tooling to automate some of these processes. We created a tool called GraphDoctor to lint the schema and automatically comment on PRs related to schema changes for all enrolled services. To help with schema collaboration, we created GraphLabs, which stages a sandboxed environment to test schema changes without affecting the other parts of the graph. This allows both front-end and back-end developers to collaborate on schema changes more rapidly (see Figure 4).

[Click on the image to view full-size]

Figure 4: Schema Development Workflow

Developer Support Model

We built the GraphQL platform to facilitate implementing domain graph services and work with the graph. However, this alone would not have been sufficient. We needed to complement the experience with good developer support. Initially, we offered a white glove migration experience by swarming with the teams and doing much migration work for them. This provided many insights into what we needed to build to improve the developer experience. We identified gaps in the existing solutions that could help speed up implementation by allowing developers to focus on the business logic and eliminate repetitive code setup.

Once we had a fairly stable platform, we could onboard many more teams at a rapid pace. We also invested heavily in providing our developers with good documentation and tutorials on federation and GraphQL concepts so they can self-service easily. We continue to offer developer support on our internal communication channels via Slack during business hours to help answer any questions and troubleshoot issues as they arise.

Developer Impact

The GraphQL platform provides a completely paved path for the entire workflow, starting from schema design, implementation of the schema in a GraphQL service, registering the service to be a part of the federated graph, and operating the same once deployed. This has helped more teams adopt the architecture making GraphQL more popular than traditional REST APIs at Netflix. Specifically, Federated GraphQL greatly simplifies data access for front-end developers, allowing teams to move quickly on their deliverables.

Our Learnings

By investing heavily in developer experience, we were able to drive adoption at a much more accelerated pace than we would have otherwise. We started small by leveraging community tools. That helped us identify gaps and where we needed custom functionality. We built the DGS Framework and an ecosystem of tools, such as the code generation plugin and even one for basic schema management.

Having tackled the basic workflow, we could focus our efforts on more comprehensive schema workflow tools. As adoption increased, we were able to identify problems and adapt the platform to make it work with larger graphs and scale it for an increasing number of developers. We automated a part of the schema review process, which has made working with larger graphs easier. We continue to see new use cases emerge and are working to evolve our platform to provide paved-path solutions for the same.

What’s ahead?

So far, we have migrated our Studio architecture to federated GraphQL and merged a new graph for internal platform teams with the Studio Graph to form one larger supergraph. We are now migrating our Netflix streaming APIs that power the discovery experience in Netflix UI to the same model. This new graph comes with a different set of challenges. The Netflix streaming service is supported across various devices, and the UI is rendered differently in each platform. The schema needs to be well-designed to accommodate the different use cases.

Another significant difference is that the streaming services need to handle significantly higher RPS, unlike the other existing graphs. We are identifying performance bottlenecks in the framework and tooling to make our GraphQL services more performant. In parallel, we are also improving our observability tooling to ensure we can operate these services at scale.

About the Author

BT