BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Presentations The Human Side of Airbnb’s Microservice Architecture

The Human Side of Airbnb’s Microservice Architecture

Bookmarks
37:30

Summary

Jessica Tai discusses lessons learned by Airbnb from its migration to microservices, covering cross-team collaboration strategies, designing observability access control, and planning for unified APIs.

Bio

Jessica Tai was previously a staff engineer on Airbnb’s Core Services infra team. She has given multiple talks at QCon about the technical design and scaling challenges with the migration to service-oriented architecture. Now an engineering manager of the User and Foundation infrastructure teams, she spends more time thinking about how to grow and scale the humans of Airbnb.

About the conference

QCon Plus is a virtual conference for senior software engineers and architects that covers the trends, best practices, and solutions leveraged by the world's most innovative software organizations.

Transcript

Tai: One of the pre-pandemic things I miss most about work are the dance classes. Pictured here is a dance flash mob that we did in our office a few years ago. This is an example of when many individuals work together, we were able to create something fun, a coordinated dance set. Such a collaboration amongst so many humans is really challenging. It requires intentional design, effort, and planning. This is not something that happens spontaneously overnight. Like any good flash mob, a technical architecture revamp is also a big human coordination project. Architecture migrations require a lot of planning, time, and extensive collaboration in order to be successful.

Background

I've been an Airbnb for seven years, and seen the architecture evolve and the amount of human power it takes. My name is Jessica. I work extensively in our monolith, and our first microservices at Airbnb. I'm now an engineering manager on two of our platform infrastructure teams.

Outline

Having gone through various migration of architectures at Airbnb firsthand, I'll give a brief overview by covering the three major stages and the changes that we saw within teams to support these architecture migrations. I'll then cover some of the highlights and the lowlights that we've seen in our migration journey. Then I'll end with what we're working on now, and the next challenges that we see on the horizon.

Architecture Evolution - Monolith (2008 - 2017)

Architecture reflects the needs of your company and product. For Airbnb, it began as a simple marketplace for hosts to open up their homes to their guests. This could be done in a Ruby on Rails application, our monolith. For a number of years, our monolith worked really well for us. Most of our engineers were full stack. In our monolith, you could do our frontend, API layer, or database migrations all within the single repository, and engineers were able to execute on end-to-end features themselves. Because features were able to be completed within the team, there were less inter-team dependencies. This is because all engineers could access the whole codebase in the monolith, and not need to depend on another team to make a change in a certain part of the codebase. However, as Airbnb entered hypergrowth, there were many more engineers and many more teams. It was not possible for a single person or team to have context on the whole codebase anymore, ownership and team boundaries were needed. We attempted to make more granular teams focused on clear product surface areas. However, the monolith was tightly coupled as a result of moving quickly and prioritizing fast product iteration. Code changes in one team could have unintended consequences for another team. Team ownership was confusing for different pieces of code, and there were areas that were completely unowned.

As Airbnb grew to hundreds of engineers, our monolith encountered more of these scaling challenges, until we reached a breaking point where we can no longer be productive. The largest pain point that we experienced was the slow deploys. When I first joined Airbnb, I could deploy our monolith within a matter of 20, 30 minutes. At its worst, it was taking many hours, sometimes over a day to get a single deploy done. The slow deploys thus contributed greatly to a slower developer velocity, hurting both engineering morale as well as the company business. It's difficult to have hundreds of engineers contributing to the same application. The way that monoliths are structured now, it was clear that it wasn't going to last as an architecture for many more years.

Microservices (2017 - 2020)

Like other companies, we decided to move to microservices, and wanted to be a bit more disciplined about our approach. We aligned on having four different service types. One, having a data fetching service to read and write data. Two, a business logic service able to apply functions and combine different pieces of data together. Three, a workflow service for write orchestration. This is important in handling operations that involve touching pieces of data across multiple services. Four, the last type of service is our UI aggregation service, putting this together from beneath data services. These services are all behind our API gateway. We did do a similar migration for our monolithic frontend into separate React services.

Since ownership was a problem in the monolith, we made sure that each service had only one owning team. A team could own many different services, and this is a lot easier to reason about because the clear API boundaries and code space of each service was much easier to enforce. We could easily add mandatory reviewers to a project that has the team that owns it as a tagged person that must approve the pull request before it is merged. There was a change in the way that we structured our teams as well. Instead of having teams that were full stack, able to handle anything, we saw that we were having teams that were focused just on the backend. We had teams now that were specialized on certain data services, and other teams that were focused on the more complex pieces of business logic. We even created a team dedicated specifically for the migration of monolith to microservice. This team was responsible for building tools that help compare read and write endpoints for the migration, as well as embed with other teams to help teach them the best practices of service building and operations.

We saw a change in collaboration too as we moved to microservices. Features now required changes across multiple services, but these multiple services could live in different teams. No longer was it common for a feature to be executed complete within its own team. Now teams needed to be much more aware of the largest service ecosystem to understand where the dependencies may lie, what interactions they need to have to make sure that the priorities align for a particular end-to-end feature to be built. We migrated many microservices to a point of as hundreds. After about a few years into the migration, new challenges were emerging. It was now difficult to be managing so many services and their dependencies.

Micro + Macroservices (2020 to Present)

Trying to address some of those challenges, we decided to have a hybrid between micro and macroservices. This is what we're working on now. The micro and macroservice hybrid model focuses on unification of APIs. How can we consolidate and make things easier as a clear one place to go for certain pieces of data or functionality? Our backend services previously able to call any other service in our microservice world, would go through a GraphQL interface and get the data only from our central data aggregator. This central data aggregator then federates out its schema using a simple GraphQL interface as well into our service blocks. Our service blocks have a facade API that abstracts away the microservices beneath it. In our previous microservice only world, our services were talking to each other with Thrift. We preserved that within the blocks and allowed our existing microservices to continue using Thrift but have this abstraction layer encapsulating them in the service block. Our team structure changed as well. Now our teams are more specialized. Our product teams are just focusing on that, the product. They no longer need to be responsible for optimizing performance of data fetching, because now we have a dedicated team to do this. We also had dedicated teams for our service blocks, which is really important to have cohesive ownership over the larger domains that power our Airbnb product.

Migration Hits and Misses

Let's switch a little bit and go into reverse and think about how our journey came to be, and what were the things that we did well, and the things that we did not do so well that we'd like to adjust in our upcoming work. When thinking about how to frame this, I realized that there are two major cycles that happen in Airbnb with a bunch of mini similar cycles. One, it starts with a problem. Two, we then create a solution to make this problem better. Three, we need to get people to adopt the solution to motivate them and show them why it's impactful. Four, if we're successful at that, there's going to be tons of people using that particular solution. Five, so many people are using it, it might push it to a breaking point of creating a new scaling challenge, in which we need to go back and figure out, how do we address this new state of the world?

From Monolith to Services

One of these cycles starts from the move from monolith to services. In the beginning, there was a lot of excitement, "Yes, we're going to move to services. It'll be so great. Our deploys will be faster." The services were really hard to set up and maintain in the beginning. It took three or four weeks to do all the steps in this diagram, and then some, just to get a health point up and running. It's pretty difficult to convince somebody to migrate to services if you say you have to add on an extra month of time just to get that service up and running. Recognizing that we needed to make this easier for our engineers in order to get the migration going, we invested heavily in our service infrastructure. They developed a way to have service infrastructure as code. This configuration lives alongside the service in the same repository, and manages configurations, including our Kubernetes setup, all within simple configuration based format. The hope is that this will make it really easy for people to get a service up and running, so that they would be motivated to then get the benefits of having an isolated service and the fast deployment process.

We further devised ways so that all teams could be using this new tool to build a service. We increased adoption by enabling people to be migrating incrementally out of the monolith and in parallel to each other. Again, this is really important because we were hitting that breaking point where the deploys in our monolith were so slow and painful. Doing this migration in parallel did enable many services to be built. With the benefits of that better developer experience, we migrated field by field, endpoint by endpoint. Soon, hundreds of people were making services, and we had hundreds of services in our network.

Then it leads us to the scaling challenge. Many services were created in a short amount of time, and they got the benefit of the deployments going from hours or days, down to minutes. However, the dependency graph became really complicated and hard for both humans and services to navigate. Furthermore, we saw that quick developer velocity and a desire to get features that quickly or iterate fast, led to mixed functionality in our services. Our services and the APIs were becoming too complicated. The complexity dealt with not having those four initial service types clearly defined in each service. Instead, what used to be a UI presentation service was now fetching data. A data service may also be orchestrating like a workflow service, or doing a lot of complicated business logic. Services were having circular dependencies between each other. It was really difficult for us to understand the topology of our hundreds of services and reason about them. Code is being duplicated because it was difficult to discover what services existed, what functionality was in the monolith, or what could be extended from something that was already existing.

Services at Scale (Retrospective)

This leads us to our second cycle of that migration challenge journey. Our starting problem is now the dependency graph is really difficult for humans and services. Features were requiring collaboration across many different teams, we needed a way to work in this new world where it was rare to have a feature only be finished by your own team. It needed to be easy for other engineers to understand how the other services interact with each other, and their capabilities. We had a set of solutions aimed to make this better. We wanted to improve productivity and observability with ownership tools and automation, such as code generation. Distributed systems across distributed teams create more communication overhead. Knowing your call graph, knowing which teams are your downstream dependency, knowing how your service is being used by its clients are really important for service ownership and maintainability. We wanted to make this information really easy for our engineers to understand so they could more effectively work with each other.

An example of this is our ownership tool. It's really simple metadata for each of our services, going line by line showing what are the best communication channels to reach the owners, tagging what type of service it is, and providing links to the code repository where it lives. This metadata is important for engineers to know who to contact during incidents, the purpose of a service, and also used as a way for us to programmatically create trackers for our migration tools and ongoing processes. For example, migrating something for a framework level may require migrating it for all services. It's not easy to track down 500-plus services. With this tool, we have a better idea of which services exist and how to find the people who are the experts at maintaining them. We created a Service API Explorer tool. This not only has information useful to the API itself in an easy searchable way, but also has information useful for the humans to know how to contact each other. Their Slack channels, dashboards to look at health, design docs to understand what's going on, all provided easily as Thrift annotations right alongside the API for this service. This documentation is in Thrift, deployed with this service, so it's really easy for the developers to keep up to date.

Our services are built using that one setup, and it comes with a lot of free code for us, scaffolding both the client and server side. With all this boilerplate generated from the service framework, comes with pre-populated metrics and standard dashboards. The one I use a lot is looking at the callers of my particular service. Aside many challenge journey and learning that we have, is that when creating a networker service, it's really important to know how those services are talking to each other, which includes passing down the ID of a caller to the service that it is calling. However, in the beginning, we did not have this standardized or required across our various services. Each service was doing it a little bit differently. This resulted in not having these standardized metrics and some of the caller IDs being unpopulated, hence, unknown. This has been really difficult for us to track down, what are the source of these unknown calls, and understand how those unknown callers are trying to use our service. Tip for next time, make sure that your request context is set and standardized ahead of time.

Solution to Make it Better

Going back to the solutions, another solution to make it better, was replacing some of our bespoke Thrift schemas with a single unified GraphQL schema. We wanted to move to GraphQL to get a richer set of features that more closely resemble the way that our API was being used. However, you recognize this is yet another migration. We needed to invest in ways to make it easy for people to migrate away from the Thrift that they were used to using with developing services. When introducing new technologies to your tech stack, the education piece and onboarding cost is important to factor as well. We invested heavily in creating GraphQL training courses to get our engineers more familiar with the best practices there.

Here's an example of a reservation page and how we would start to translate it to reservations in GraphQL. To get people to adopt such a schema, we wanted to focus on, how can we improve the developer velocity? We do this through GraphQL annotations and creating a lot of behind the scenes code from these annotations. We take a look at that reservation snippet again. We'll have an annotation around owners. This allows us to provide a way for our alerts to be automatically generated and paging the right team. Also, a way for us to automatically tag the correct team to review code when anything in the schema changes. We want to be really strong about who the owner is for a particular schema so that things don't get out of hand, and people adding fields to places that might not make the most sense. Another annotation that we found was really useful is our service back node annotation. This is helpful because it auto-generates a bunch of code for us in the resolver to call our internal microservice with Thrift. This is important because the migration price to pay to integrate a service into this data aggregation layer is high if everyone needs to make these changes done manually. If we're able to just add this annotation, and use an existing service name, and the existing endpoint name, the rest is done for us.

There are many different other annotations. A few others include, marking what type of data classification each field has. Is this field PII? Is it other data? What type of security and privacy methods do we need to apply to this field? The permissions piece is particularly useful because this previously lived in our monolith at this application level. There was a lot of complex logic scattered across many places. Some people have migrated it out into their own permission services in our microservice world. We created a separate permission service that tried to be this connection for all the permission checks. I found it was difficult to get traction on migration there too. By introducing this as a simple annotation, we're able to get permissions easily defined in line and really clear how the logic of the permission maps to each particular field.

With all these annotations, it was a lot easier for people to get started and find the benefits of doing things in GraphQL. Now we have migrated hundreds of data entities to this unified GraphQL schema, which then leads us to, what's next? The scaling challenge. It's slower to build and deploy in the data aggregation now that so many engineers are actively contributing and using it. Recognizing the slowdown as a sign of what happened to our monolith, we are doubling down and investing heavily into our developer infrastructure and tools to make sure that the developer experience for the data aggregation service remains at high productivity levels.

What's Next?

What's next for our Airbnb? What are upcoming challenges? As described, Airbnb is moving towards that hybrid mode of having aggregators act as these macroservices that abstract the microservices. Initially, when we did our microservice migration, we were trying to run away from any centralization. This ended up resulting in a lot of different free-form services that had a little too much flexibility of how the dependency graph could look like and their functionality internally. Instead, we're trying to be more stringent about enforcing a paved path. For this micro and macroservice architecture, the paved path looks like our internal backend service gets its data from the data aggregation service, using the unified GraphQL schema. That large GraphQL schema is then federated out into smaller ones that are owned by the service blocks. The service blocks are the units that are encapsulating some of our larger business domains, core to the Airbnb product. The service blocks have facade APIs that encapsulate a collection of microservices. We only have a few service blocks as they're really for the larger product entities, such as reservations at Airbnb, users is their guests and hosts, homes, reviews, pricing and availability.

Beneath the service blocks, we have a unified datastore layer and API. This is a way to further specialize the optimizations needed at the data level, and to provide that unified schema going throughout the whole stack. We saw changes in how our teams are structured as well. We wanted to clarify the responsibilities by specializing our engineering teams more. Our product teams should be focused on just that, building the product. They shouldn't need to worry about optimizing data for performance or figuring out the permissions for particular pieces. Instead, that's the focus of the data aggregation team, stitching together the data across the company into an easy to consume and understand way.

Our platform or our service blocks are run by separate teams as well. It's important to have teams focused on these holistic domains to be stewards of these important pieces of our product. This will help reduce in the duplication that we saw in the microservice world, where there wasn't this type of centralization. We also want to make sure that we are planning a little bit more ahead and getting in front of some foreseeable challenges. At the product team level, we know that there is a need for quick product iteration, so we need to be able to design our architecture in a way that doesn't stifle that iteration cycle speed. At the same time, it's not accruing tech debt for each new attempt. At the data aggregation layer, the challenge that we foresee is that it becomes a new monolith that has lower productivity, developer velocity, and tight coupling. We want to make sure that we are disciplined about what belongs where in this data aggregation layer, so it doesn't become the new monolith.

At the service block layer, the challenge that we foresee is how to define these schema boundaries in a clean way. There's many pieces of data or logic that can span multiple entities. For example, where would you put the number of completed reservation that a host had, in the reservation block or the user block? These types of questions are things that we're working out now to make sure that we have a paved path for these common use cases. Finally, there's a whole nother slew of challenges that exist in the offline or asynchronous world. The ones covered here are just the online use cases.

If I had to describe our future looking forward, I see a lot of tracks. Our migrations are on track, but there's a lot of these parallel tracks, with many moving pieces going on at once. We're migrating our internal services to call that data aggregation layer. We're building out that new data aggregation layer. We're building service blocks, migrating callers to those service blocks, and building out our datastore. With many migrations happening at the same time, they need to be not only backward compatible, but also compatible with each other. We're also wary of the fatigue that comes from engineers doing migration after migration. It's important that we actually fully complete the migrations because, if there's any long tail remaining, then we're stuck having to maintain various states of our infrastructure.

Deprecation Working Group

Putting this together means that the deprecation of a monolith takes a very long time. We are our third year into our migration process and there's still a lot to happen. We have a working group that put together some proposals that are helping to move this migration process along. Our mobile apps are going to be automatically deprecated after 12 months. This will help us get rid of features that aren't being supported in parity on web anymore. We elevate and track our monolith tech debt alongside our other business metrics. We have planned to sunset low usage endpoints, and are raising the importance and finding long-term owners for pieces in the monolith. Finally, from a more human perspective, we're recognizing that the deprecation work is impactful. It might not be as flashy as building a new product, but cleaning up our tech debt will enable us to build better, cleaner products in the future with higher efficiency. We're keeping at it, the team effort. We've learned from previous migration experiences.

Takeaways

All the migration experience have really settled around, what does it take to improve engineering productivity? One way is restructuring teams in a way to collaborate whenever there's a new architecture. Another is to invest heavily into developer productivity with tools and automation such as code generation. Migrations are tiring work, so it's important to make it as easy for engineers as possible. We're increasing the focus of teams with more platformization of teams and central domains. Finally, building with the expectation that there's going to be new challenges. As the company continues to evolve and grow, the challenge cycle will continue to start once again.

Wrap-up

Airbnb decentralizing into microservices created both technical and human challenges. As we look back, many of these growing pains were necessary parts of our evolving journey to where we are right now. We didn't know what we didn't know at the time, and needed the migration and incremental steps of our services, to let us know how our system worked, where the pain points were, and how we can continue improving going into the future.

Questions and Answers

Watt: Could you maybe just tell a little bit more about the dedicated service migration team, and what tasks they were actually performing? There's a related question which says that if there was a dedicated team to do this, it sounds like the team creating the microservices was not the team actually owning the service in the long term. How did that work?

Tai: We had this centralized team, it was formed. We had about four building. They were the ones building the initial services, working with our service framework team and infrastructure team, to make it a lot easier. Then once we felt comfortable building a few services within that centralized team, they then went to work with our product teams to help get them onto services. We focused on our core flows for Airbnb, that's our search page, our homes description page, and our payments page. We had an engineer from the core team go [inaudible 00:33:53] and work with them, as a way to better educate the best practices for service building, so we didn't have people building services more bespoke. We wanted to have a standardized way and encourage folks to do this migration as we recognize that it was a bit of a lift more to migrate to a service instead of just continuing to develop in the monolith. This is a way to help some of our critical pieces start. Then the team disbanded after two years once the migration got enough.

Watt: It wasn't that the migration team actually just did it and handed it over the fence. It was more collaborative, helping best practices and things like that.

Tai: Yes. It was pitching best practices and also taking the learnings from what the product teams were struggling with and being a liaison with our infrastructure teams to ensure that we're building the right infrastructure to support the actual needs of the service owners.

Watt: Why not break the microservices down following a domain-driven approach?

Tai: We did do that a bit. I wouldn't say we were so strict to call it domain driven. I think for us a challenge we had is that we actually didn't have too much understanding of what the major domains were in our monolith. We had an idea of what some of the larger files were going to be used, but there are a lot of different features that were appended onto these main domains in the monolith. It was difficult for us to just go about it with a clear approach. Breaking it into services gave us a better observability of how the different pieces interacted, and which parts were core, which parts were not.

 

See more presentations with transcripts

 

Recorded at:

Nov 05, 2021

BT