Agoda recently described their unconventional approach to transitioning from a monolithic GraphQL API to a microservices architecture. Unlike traditional methods focusing on breaking down server-side components first, Agoda adopted a client-first strategy, preparing their client applications to handle both the monolith and the microservices in parallel using an in-house smart orchestrator library.
Agoda prioritized preparing clients to handle both the monolith and microservices before the server migration, reducing risks and the required coordination efforts. Numan Hanif, Associate Development Manager at Agoda, told InfoQ, "This approach minimized disruptions, empowered our teams with greater control over the full stack, and better aligned the architecture with agile and modern development principles."
Breaking Monolith the GraphQL Monolith (Source)
An in-house Smart Orchestrator was a critical component of Agoda's client-first migration strategy, designed to bridge client applications with the backend during the transition. It acted as a dynamic routing layer, directing requests to the monolithic GraphQL API or the newly deployed microservices based on configuration. Agoda's engineers favoured this approach over using Apollo Federation, which is similar to how Netflix, for example, handled this scenario.
Publishing the same interfaces as the monolith allowed client applications to operate without modifications, reducing the need for extensive changes during the migration. This setup ensured client applications could handle the mixed backend environment, providing a foundation for parallel microservices development.
The Smart Orchestrator routes client queries based on configuration (Source)
In addition to routing, the Smart Orchestrator managed schema updates and mappings automatically, supporting the incremental nature of the migration. Initially, all schemas pointed to the monolith, but as domains were split, the orchestrator updated mappings to route queries to the correct microservices.
Agoda executed the migration to microservices incrementally. Each newly created microservice was rigorously tested against the monolith to verify data correctness, ensuring that responses from both systems matched before transitioning client traffic. In addition, Agoda implemented an accuracy test system to continuously compare outputs from the monolith and microservices during the incremental rollout to validate the migration further.
Agoda employed an automated data-driven approach to analyze and prepare GraphQL queries across their 100 application Git repositories to ensure client readiness during the migration. Using production data, Agoda categorized queries into simple, moderate, and complex types based on their level of domain interdependence.
"Unstitching" cross-domain queries was a critical step in Agoda's transition. Simple queries required no changes, while Agoda engineers split moderate queries into separate independent requests for each domain and merged the result on the client. Complex queries involving tightly coupled nested domains, often requiring sequential execution when one query depended on the results of another, necessitated more extensive restructuring.
"Unsticthing" a complex query (Source)
While breaking down the monolith, Agoda focused on moving the existing code as it was, causing old technical problems to move into the new microservices. For example, Hanif told InfoQ that Agoda encountered issues such as missing client rate limiting, which frequently impacted our operations. He says that in hindsight, "To better achieve this balance, we would implement refactoring sprints as part of the migration process, ensuring a cleaner, more efficient architecture post-migration."
InfoQ spoke with Hanif about Agoda's microservices migration, the methods used, and the lessons learned.
InfoQ: Can you provide more details about why you decided to migrate from a monolith to microservices? What were the particular growth metrics or pain points that served as tipping points for initiating the shift?
Numan Hanif: The Supplier Management System system was managed by a dedicated engineering team of 7, overseeing over 70 client services interacting through multiple interfaces, including GraphQL, gRPC, and REST. This monolith was critical in handling vital processes, including bookings through the monolith for reservation confirmations.
Within this monolithic system, we experienced a 12% quarter-on-quarter increase in Mean Lead Time for Change (a developer experience metric used at Agoda), with an average of 40 merge requests (MRs) created monthly. Our extensive testing environment includes 9,200 unit and integration tests that take approximately one hour to run and achieve around 73% stability in scheduled runs.
The monolithic setup also had heavy dependencies on various infrastructure components and struggled with managing unrelated domains, leading to architectural complexities. With a backlog of 210 operational tickets per quarter and high operational overhead, there was a steep learning curve due to knowledge silos and a large codebase.
These issues underscored the necessity for a scalable solution, making microservices an ideal choice to enhance manageability, scalability, and agility in the development processes for this system.
InfoQ: Did you follow any specific methodology or framework to define the boundaries for each microservice, such as Domain-Driven Design (DDD) or similar practices?
Hanif: Rather than adopting Domain-Driven Design (DDD) as the default framework for our microservice boundaries, we developed a custom approach tailored to our specific needs and the unique aspects of our GraphQL API. We started with a bottom-up method, analyzing database schemas and categorizing tables into specific domains by tagging them. This provided a foundational structure that aligned with our data management patterns, ensuring a consistent basis for identifying service boundaries.
Once we tagged the tables, we aligned each GraphQL endpoint with the corresponding table domains to maintain consistency across data and service boundaries. While 75% of the GraphQL endpoint domain tagging was straightforward, the remaining 25% presented challenges due to complex schemas involving stored procedures accessing tables from multiple domains. This complexity required additional refinement to ensure a cohesive service structure aligned with our architectural goals.
Our process shares elements with DDD, such as aligning services with business functions, but it differs in its data-driven, bottom-up approach. Unlike the top-down, domain-expert-led method of DDD, we focused on existing data structures and used domain tagging to address the unique challenges of GraphQL. This enabled us to create microservices that were coherent and functionally aligned while addressing the operational intricacies of our infrastructure.
InfoQ: What inspired adopting the "client-first" approach for the migration? Was this strategy based on prior experience, theoretical advantages, or lessons learned from previous projects, and were there any specific challenges in Agoda's existing architecture or team structure that made this approach particularly suitable?
Hanif: The decision to adopt a "client-first" approach for migrating to microservices was influenced by both theoretical and practical experiences. Initially, we considered using the monolith as a router for client communication, but this conflicted with our goal of achieving true vertical (client-database) ownership. The router would end up being a soft monolith, complicating long-term architectural changes and maintaining centralized control. This insight emphasized the need for a more decentralized strategy.
Past experiences with a server-first migration approach highlighted several challenges, including prolonged transitions due to client migration delays. These bottlenecks hindered our ability to fully decommission the monolith and temporarily delayed the benefits of a microservices architecture. These lessons underscored the importance of preparing clients first, ensuring they could handle backend changes and facilitating a seamless transition.
The client-first strategy was designed to reduce coordination efforts with external teams by allowing backend developments to occur simultaneously with client readiness activities. This approach minimized disruptions, empowered our teams with greater control over the full stack, and better aligned the architecture with agile and modern development principles. It also addressed challenges like stitching queries proactively, enabling parallel development and reducing dependencies, which was particularly beneficial given the complexity of existing architecture.
InfoQ: Was the Smart Orchestrator client library built in-house, or did you adapt an existing tool for your needs? What were some of the main challenges in creating and maintaining it, and are there any specific features you found essential for handling your GraphQL schema and routing complexities?
Hanif: The Smart Orchestrator client library was developed in-house to address specific needs that existing solutions couldn't meet. This decision allowed for a customized tool that efficiently managed schema updates and request routing across microservices without altering client interfaces. Key features like automatic schema mapping updates were crucial for handling GraphQL complexities ensuring seamless integration with existing client applications while avoiding embedded business logic.
Developing the Smart Orchestrator presented challenges, such as updating multiple repositories with new library versions whenever issues arose during migration, necessitating careful coordination and testing. Additionally, enabling GraphQL introspection queries in the production environment, typically restricted due to security and performance concerns, was essential. By navigating these challenges, we ensured the library remained lightweight while effectively supporting cross-domain queries and adapting to our architectural needs.