Transcript
It was 2014 when I joined Airbnb. The engineering team was a lot smaller then, around 90 people, small enough to fit on one side of one floor in one building. During my first week there, I heard the sound of a gong emit throughout the hallways. I then saw engineers scrambling towards their desks. It turns out we hit the gong whenever the site went down. Because a majority of our engineers at the time were working in our monolithic application, they wanted to help debug this incident. Now in 2018, just four years later, we have over a thousand engineers at Airbnb. We spread across multiple floors, buildings, and cities.
My name is Jessica and I am an ex-monolith engineer. I currently work on our core services team, building out the infrastructure for our migration. I'll be discussing how Airbnb managed to grow its team by over 10x in the past four years by redesigning its technical architecture. I'll begin with describing how life was with the monolith, and then go into how we experienced growing pains as our engineering team expanded. I'll cover some of the service design principles we wanted to have when creating services and then how we began migrating. I'll end with some of the best practices we developed along the way and share some of the results we've had so far.
Monorail, Our Ruby on Rails Monolith
Our Ruby on Rails monolith is known as Monorail. A monolith is a single-tier unit that's responsible for both service-side and client side-functionality. This means that the model view and controller layers are together in a single repository. And monoliths are really easy to get started with. They're a great place for small teams to start and iterate quickly. There's a client that would make a call to the monolithic application and I would call the database directly.
In 2014, one of my new hire tasks was to add a feature to require a guest to send a message to a host. This involved a few components and involved accessing the model, getting the host's first name and saving a message. It involved modifying the view template a little bit: "Tell your host 'Hello.'" It also involved modifying our controller endpoint to add some basic service-side validation. And together, the model, view, and controller changes were made within Monorail.
Why Decide to Migrate?
Our application was simple enough that our developer infrastructure projects were done on a volunteer basis or in hackathons. We had a volunteer sysops, engineers who voluntarily were on call for the entire Airbnb site. Engineers were fairly productive and happy, life seemed simple. So why did we decide to migrate? Why spend the time to invest and the engineering manpower to move away into a different technical architecture? This talk will also be describing animals that migrate. Birds migrate, and the Arctic tern has the greatest migration path of 1.5 million miles in its lifetime. That's like going from here to the moon and back three times. For us, we were about to begin our million-mile journey from migrating from the monolith to microservices.
Microservices can theoretically have well-encapsulated modules or fit services inside of them, but at a certain point, they become difficult to scale as they do not enforce such encapsulation. At Airbnb, we began to experience tight coupling between the modules as our engineering team grew larger. Modules began to assume too many responsibilities and became highly dependent on one another. As our engineering team continued to grow, so did the spaghetti entanglement over Monorail code. It became harder to debug, navigate, and deploy Monorail, and became a source of engineers' frustration.
Our lines of code in Monorail began to grow as well, and it shaped similarly as our engineering team's growth. We were adding more features to our Monorail, but we had a single database which meant these additional dependencies were making our database less reliable. We began to feel these growing pains around the 2015 time. Then, we had over 200 engineers. We were deploying 200 commits to our Monorail to production every day. However, we were experiencing around 15 hours on average of time where our Monorail was blocked. We couldn't deploy due to reverts or rollbacks.
Let's look at that checkout page again. One, we did a rebrand in 2014, so this page got a revamp, but now there were many teams with different features on this single page. Ownership and accountability were difficult, because there're many different components part of a single feature. We tried to add a Git level part to add the required reviewers on the file or directory level, which helped a little bit, but then our modules and files became big as well. This message module now became over 400 contributors and thousands of lines.
We began to see more incidents. It was not uncommon for an engineer to make a change and what they thought was localized to their feature, but then it ended up breaking something unrelated. Our deploy trains became slower as well, on the magnitude of hours. I personally like to deploy my Monorail changes in the morning before the other engineers woke up and got into the office. This reduced my chance of having merge conflicts or dealing with other people's reverts. But overall, engineers were getting frustrated and Monorail deploys were a huge pain point and decreasing our developers' productivity.
Our Solution: Service-Oriented Architecture (SOA)
So we look towards service-oriented architecture, or SOA, as a possible solution to help alleviate our pain. SOA is a network of loosely-coupled services. The client will instead go through some API gateway and they make calls out to various services. Each service can be built and deployed independently and scaled separately as well. Ownership within a service is a lot clearer as it's defined within the supported service's API. Requests to these services can be made in parallel. This seems to be a promising solution to the pains that we were feeling.
If we look back at that checkout page now with an SOA lens, we might have a home service, a reservation service, a messaging service. You could have a home's demand service, a business travel service, cancellation service, so now this is looking like a ton of services in a different type of spaghetti mess. But, we had faith that we could give it a shot, as I'm sure many of you in the audience have experienced successful migrations from a monolith to microservices, so have other companies such as Uber, Twitter, Amazon, Netflix. So we were confident that it was our turn to give this a shot.
SOA Design Tenets
We wanted to start with a key set of shared understanding and design principles for building services. Penguins migrate and when they do, the various colonies have a shared understanding of where to go and meet at the same time. We wanted our engineers to have a shared understanding of how to build services in a standardized way so we could help scale them quicker.
The first of these principles was that services should own both reads and writes to their own data. This means if multiple services are interested in a certain data set, they must go through the gatekeeper service's API to access that data. This principle helps with data consistency as well as encapsulation and isolation. Services should address a specific concern. We wanted to avoid going from a monolith, breaking it apart, but putting so much functionality in another service that it becomes the new monolith. We also wanted to avoid going from monolith, breaking it apart into many smaller services, and now we have a polylith. So somewhere in the middle is where we're trying to go for, where a service had a large enough scope, but was still focused and did not duplicate other services' functionality. This was important because in Monorail, files had access to many other files and it was easy to share code in that way, but it was also easy for dependencies to happen. Instead, when we break this apart into services, instead of duplicating code in various services, we wanted to develop shared services and shared libraries.
Another design principle we came up with was that data mutations should publish via standard events. This came from looking at our code base and seeing Monorail had a lot of callback methods in them. Callbacks are hooks that are executed during various parts of an object's [inaudible 00:10:44] life cycle. For example, if I were to book a home on Airbnb, once the reservation transaction is completed, then we mark the home's availability as busy for those dates. Breaking this apart into services meant we could no longer use Rails callback methods. Instead, we developed SpinalTap, which is open sourced by Airbnb. SpinalTap is a Change Data Capture service. It listens to various changes on databases, and then publishes them to a standard queue. In our case, we use Kafka. Other services can then listen for the standard event, consume it and react accordingly, such as applying business logic or modifying their own data.
Getting Started with the Migration
With our principals in hand, we sought out to get started with the actual implementation, but some of our initial ideas were not the ones that we ended up going with. Our ideas evolved over time. Monarch butterflies migrate and they have an interesting migration path because their migration cycle is longer than any one butterfly's lifespan. This means the butterfly that starts the migration is not the one that ends the migration. Similarly, our initial services that we built are not considered the "poster childs" right now. We've learned from mistakes and have developed best practices along the way.
When beginning, we wanted to start with something that was crucial to Airbnb, that was part of our core product. At the time, Airbnb was focused on homes, so we decided to pick the homes data model as the first service to take out and put into its own service. The homes data model was accessed by almost every feature in Airbnb, so if we could pull this out of the monolith and make it its own microservice, it gave us confidence that we could continue tackling the rest of the monolith.
First Attempts to Break Apart Monorail
Some of our first attempts to break apart the monolith started with looking at the various call sites in Monorail for which our homes data was accessed. This was an attempt to consider replacing each of these methods with a service call to our home service, but upon looking, we found that there were thousands of such call sites, so it's not possible for us to manually go through each one and change it to a service RPC. Instead, we looked a layer deeper and considered applying Ruby metaprogramming to overwrite these data access methods. Instead of changing the thousands of call sites code, we would overwrite these methods with areas I will call our service.
Migrating Rails’ ActiveRecord
However, we encountered difficulty with relations with other models, and breaking apart the joints is difficult so we went one layer deeper and looked into ActiveRecord. ActiveRecord has some access methods. This one, for example, home.find_by_host_id. It then gets translated into an ActiveRecord wrapper, which is a Ruby library that implements read and write access to business objects that, again, persisted to a relational database. The ActiveRecord translates that method into a raw SQL query, select * from homes where host_id = 4. That raw SQL query gets sent to an ActiveRecord adapter layer and by default for us, it was a MySQL adapter that sent the query directly to the database.
Custom ActiveRecord Adapter
Instead, we wanted to change this by writing a custom adapter, and this custom adapter created a query object. So if we expand into what this query object is, we broke apart that raw SQL query into components. There is a type, the query type, in this case, is "select." There's a table name, "homes," and the filter where the host ID is equal to four. It also supplies the fields of interest. We could then map this query object to a request object. From the query type and the table, we're able to know that we're interested in loading homes data to fill out that endpoint. The filters and the field could then help fill out the request body.
Re-Route Queries to Services
So now with this query object and custom adapter, instead of going directly to the database, we could create a request object and send that to our home service. The home service would then read from the database. This may seem like a roundabout way to get data, but it provided us a few things. By doing the ActiveRecord adapter at the lowest level, we're able to capture the raw SQL patterns, ensuring that we create a flexible enough API on our service to support the use cases that Monorail had. It also gave us proof that we didn't need to go to the thousands of call sites and manually change them to service RPCs. By having it at the bottom layer, this allowed product engineers to continue with their interactions of the ActiveRecord methods and not need to change the workflow, while under the hood, we were calling our new service.
Service Interaction Design
We wanted to build out these services, but ensure that they could talk to each other, interact in a certain pattern. Jellyfish have an interesting migration pattern, that they interact with the sunlight, falling in a specific direction from east to west. We wanted our services to interact with a specific direction as well to have a strict flow of dependencies, and this begins with the service requests originating at our clients. In the interim, the client's requests will be routed to our Monorail application and that would send the request to our service network. The future was to go through an API gateway as well and then send that to our service network.
Starting at the bottom of the service network, we have different service types. The data service, as mentioned before, is the gatekeeper to their own data. Only that service can read and write to its dataset. A layer on top of that is the derived data service. This reads from both the data service as well as its own derived store, combines this, applies business logic and then it can be shared among multiple product contexts. The highest level of service is the presentation service. This reads from the data service, as well as the derived data service, to synthesize information that is shown to the user in our product.
As we began to build out more of these initial services, we realized the need for an additional type of service. This middle-tier service serves as a layer for the shared validation logic for products to read and write to our particular data service. We didn't want to bloat our data service with product business logic, so instead, we put it into this separate utility service.
Let's look at an example from the checkout page. There are many data services involved in that feature, including the reservation data service and the homes data service. They each read from their own separate isolated databases that those services are the only ones that have access to. A home demand derived data service hydrates from these services; perhaps it wants the reservation dates or the location of the home and it combines this with its own data store, some offline booking trend statistics. It then can combine these together to find out a statistic, such as what is the likelihood of your home being booked for this certain date, and that would be shown to the user, so the checkout page gets data from the homes derived data service, the homes data service, and the reservation service. Say I do book the reservation. That means we're going to write to our reservation model. This goes through the reservation validation middle-tier service, having the shared logic of how to validate writes to their reservation data.
Compare for Differences
With all these theories in mind, we wanted to set out, implement, but ensure that we didn't break anything along the way. So we were cautiously comparing for differences between our Monorail and our services. Well, this is our migratory animals and they have two different ways that they migrate. One is they swim to their end migration location and the other is floating on ice sheets. For us, we wanted to compare two different ways of accessing the same information. One, through the Monorail and two, through our service.
And we start with reads. Reads are idempotent, which means that we can issue multiple identical requests and it'll have the same effect as issuing a single request. We compare our reads from this read path A, the existing path of the Monorail going to the database against a read path B going through our service and we put this dual read behind a gate that's configurable in an admin UI tool. This is really important for us because it allows us to ramp up or completely turn off a feature without meaning to make a code change, review and deploy it. Instead, we can do it simply with a click of a button.
We begin the comparisons for this dual read starting with 1% of traffic. We look for mismatches between the responses of the two read paths and address them along the way. Once the comparisons are clean, we then ramp up the production traffic going through these two read paths. We do this slowly, going from 5%, to 10%, to 25%, 50%, and then 100%; each step comparing, waiting and comparing again. Well, at 100%, we wait some more. So it's important to get enough traffic to ensure that the various access patterns are covered and to ensure that your service can handle 100% of the load that Monorail was supporting.
Once the comparisons at 100% are clean, we can then switch over to move all of our reads through our data service. Writes are a little bit different because we can't dual write to the same database, so instead, we write to a separate database. Say we have the monolith already hooked up to a presentation service and the presentation service is writing to the production database. Now we want to introduce a middle-tier service and pull out some of that validation logic from the presentation service. We call this write path B, and we compare it by having the middle-tier service write to a separate shadow database. The middle-tier service can then issue strongly consistent reads against the production database and the shadow database, compare the responses to look for any mismatches. Similar process; this is behind a UI configurable gate. We ramp up slowly, wait till we're at 100% and once all the comparisons are clean, then we switch over to moving all of our writes through this write validation middle-tier service.
Incremental Migration
This may seem like we took a lot of careful steps, but we further increment our migration process. At Airbnb, we have a value for democratic deploys. This is where we empower each engineer to be responsible for testing their own change and deploying it to production. We don't have a centralized ops team to do this for us. It's the responsibility of each engineer.
Because Monorail deploys were a big pain point, we wanted to move teams as quickly as possible to services so they can get the benefits of the separate build and deploy process. This meant that we wanted to build services and allow production traffic to go through them even if they were in an incomplete state. So one way that a service could be incomplete is that not all of the endpoints are ready, but that's okay because we migrate one endpoint at a time. This allows us to unblock other clients who are interested in that one endpoint and give us time to build out other functionality for that service.
For example, I work on the user service and when we first made this microservice, we only had one endpoint, /loadUsers, and only had one functionality. It fetched users from our one MySQL table by ID. This may not seem like very much, but it unblocked 10 other services within the first few months, giving them the ability to continue working while we added additional functionality and data sources to the user service.
We also migrate on a per attribute level. Say we have a presentation service and it accepts some production traffic. If it's interested in 10 attributes that it needs to show to the user but, say, only three of them are ready, we'll get those three attributes that are already migrated into the SOA network and then fetch the remaining seven for Monorail. This allows all teams, including the presentation service, to get production traffic and have smaller change sets while they're building out their services.
So remember that gong that we used to hit when the site went down? Well, even though we did all these precautionary steps, the gong was hit a few times in the beginning, but it's okay. We've learned from our mistakes and often, the first services tend to have some rough patches.
SOA Best Practices
From our learnings, we develop best practices. Wildebeests have a dangerous migration path. They've developed a best practice to keep their young safe by putting them in the center of the herd while they migrate. Similarly, we wanted to create best practices to keep our services safe, alive, functional, and able to scale. So we focused on standardizing service building and ensuring consistency across our services. We did this through the usage of frameworks which auto-generate code for us, having standard testing and deployment processes that utilize replay traffic. Replay production traffic is when a client makes a request to our production service and then we take a copy of that and send it to other targets that we can use internally.
Observability; it was important that we had a standard way to look at the health and functionality of each of our services. Digging into these more, an engineer would create a service because ultimately they want to support some sort of business logic, but then they will need to add an endpoint to expose this to clients. At Airbnb, we support both Java and Ruby as first class language services.
Now that we have our clients, we may need to add additional information such as server diagnostics, metrics, data validation, resilience on both the client and the server side, error handling, and for Ruby, some type checking. Now that our service is in production, we want to ensure that we have a dashboard, alerts and easy runbook documentation so other engineers know how to use the service.
This is beginning to seem like a lot of work just to run that one business logic. So instead, we invested in creating a service framework team. Their responsibility was to create solutions to help build our services in an automated way so that engineers could focus on the actual functionality of the service, not creating all the boilerplate. To do this, they use an IDL, or interface description language, and picked Thrift. So now, an engineer can focus on the business logic without needing to manually write all of the Java, Ruby, and client code. The IDL would do this boilerplate for us and allow us to do this in a standardized way.
So before when we were coding services, it was difficult and engineers didn't really want to do it. They rather stay in monolith because it took weeks to just get a service boilerplate running. We had to deploy code in three different repositories in the correct order, just to get the basics of a service running. Once the service is running, we needed to maintain both the Java and the Ruby clients manually, otherwise there'd be inconsistencies between the two. But now, with our service frameworks, we get a lot of this auto-generated for us for free. The API is clearly defined in a Thrift file where any engineer can go to an IDL service and know the endpoints it supports. Now, with the click of a button, we're able to create Ruby gem clients. Instead of having to manually type the Ruby client, we just click "deploy Ruby gem" and it looks at the ideal configuration and generates a client for us.
Here's an example of what our Thrift looks like. We might have some request represented as a Thrift struct. It takes in some strongly typed data, which was important for us due to the Ruby and Java inconsistencies. We can specify that a field is required or optional and a response is similarly defined as a Thrift struct. Endpoints are defined in our Thrift files as well. We can specify a response type in some struct and the name of the endpoint. We then could specify the request struct that it takes in as input and then what exceptions it throws. We can add additional annotations in our IDL as well, such as accepting replay traffic or has rate limiters.
And our testing deploying process benefited from the IDL as well. Previously, there was an element of uncertainty when creating a service. A lot of the testing was done manually such as curling or clicking around in the UI to trigger a request to our service. Instead of the "Let's hope for the best" mentality, we now have a more structured preproduction process with replay traffic as the center of it all.
Starting from the local development, we support services in our local dev to ensure that we can test our code and its dependencies before needing to merge and deploy. Once our code is approved, we can then merge into staging, and staging is a preproduction environment that utilizes the replay traffic. It takes a copy of the production traffic and sends it to staging. Our staging services then interact with other staging services as well, so we have a whole network of services before going to production. This allows us to determine if we've broken any of our downstream or upstream services before pushing it to prod.
Diffy is a really cool tool, open sourced by Twitter, and it allows us to compare the responses from staging with our new code against production with the last known good code. So let's take a detour and dive into Diffy. Diffy takes replayed traffic as input, and sends it to three targets; one being staging with the new code, two being primary, running the old code which is on production, and three, secondary, running the old code which is the exact same code as primary. We then compare the responses of the two. Comparing staging and primary, we get raw response differences. And then we compare primary against secondary, and these differences we consider as non-deterministic noise as primary and secondary running the exact same services. We filter out the noise and we're left with the response differences that can be attributed to the change that was just introduced on staging. This has been really helpful for our team to ensure that we haven't broken any of the existing functionality, or if we were doing a bug fix that the change was reflected in the responses.
Diffy is not an SOA specific tool, but it's much more practical to use when the service supports a much smaller endpoint set, and the engineers working on them are fewer as well. So from a Monorail perspective, if we were to use Diffy, there would be thousands of endpoints and because the code is so entangled, it would be hard to debug. But with each service, we have a standard process where Diffy is included before moving to production. Once we've gotten enough traffic on Diffy and are comfortable to move forward, we then moved to Canary and then from Canary, which is a single instance of production, we move on to the rest of the production fleet.
Observability has gotten a lot of benefits from our standard frameworks as well. Previously, service owners were responsible for defining their own metrics and dashboards and this ended up being done in a non-standard way. Alerts were inconsistent or incomplete, but now with our IDL auto-generating a lot of code for us, it gives us templated metrics that we don't even need to define. It's all done in the auto-generated template framework for us. With these templated metrics, we get templated dashboards as well, and the IDL allows for annotations or alerts. We can specify the threshold, right there in the IDL file. For example, we can do the P95 latency; what is too high of a latency or what is too high of an error rate, what is too low of a queries per second or QPS, all within that IDL file.
Here's an example of one of our standard graphs. All we need to do is change the dropdown to be the name of our service, and then the rest of the graphs get populated for us. Because each IDL service has the same graph, it's really easy for us to get a good overview of the health and functionality of each service during an incident.
How is the Migration Going So Far?
This seems like a lot of work, so how is the migration going so far? Humpback whales have the longest migration path of all mammals. Similarly, the Airbnb migration path is long and we're not quite done yet. We're actually in the early stages, but we've seen a lot of great results so far. We've onboarded over 250 services using our IDL framework. We support over a thousand endpoints with these IDL services, and we've seen some great results in terms of improved build and deploy times. Instead of hours in Monorail, we now have deploys on the order of minutes within services.
We're seeing fewer reverts as well, which helps out with the deploy train time decrease. Service ownership is much clearer and there're fewer engineers working on each service, so we aren't stepping on each other's toes as much. We're seeing bug fixes be done quicker and we're able to meet our SLAs faster. And we've also seen some interesting results in terms of latency. SOA itself is not a reason to do for performance gains, but because we were in a Ruby monolith, it was single-threaded. We moved a lot of our services into Java, which is multithreaded, giving us the ability to benefit from parallelization of requests. This helped lower our latency a lot. Our search page is three times faster and our home description page is over 10 times faster.
Last year we had around 800 engineers and were doing 3,000 deploys a week. This is about one every three and a half minutes. Now, we've increased to around 1,000 engineers, but we're doing 10,000 deploys a week. This is about one every minute, so we're able to get engineers' code out to production faster and more reliably. Previously, we had our product engineers focused on front-end and user interface type features and they were separated from our infrastructure engineers who worked on back-end and some services. But now all teams are service owners. We've abolished our on-call rotation, the sysops, and instead, we have per team on-call. Each team is responsible for their own services and keeping them alive and functional.
If we go back to that checkout page, there're still a lot of services on there. We have multiple data services and multiple derived data services. To top it off, we have a checkout presentation service that reads from all of them. But if I were to make a similar change now, I would make it only in the checkout presentation service. I would be able to go to the message data service, look at its IDL file and know how to call its API with the correct Thrift structs.
Caution: SOA is Not for Everyone?
But I'd like to caution that even though it may seem great, SOA, it may not be for everyone. It has a lot of drawbacks, and a monolith is a great place to start, especially for a small team with quick iteration development. SOA introduces distributed services. This means that multiple network requests are required, whereas previously just one to the monolith. Each network request is a potential for higher latency or failure. Consistency becomes difficult when we spread across multiple databases and services. Observability becomes harder and we may change instrument distributed tracing.
Service orchestration becomes complex as well with SOA. We needed to onboard engineers to learn to be service owners and maintain their own clusters. At Airbnb, we started with EC2, but as we're building out more services, we're beginning to see scaling problems, so we're moving towards using Kubernetes.
SOA is a high investment cost. You can see there are a lot of tooling and additional frameworks that are needed, and having clear documentation is helpful as well because there's going to be hundreds of services, instead of that one monolith.
But from a SOA migration, it's working out for us. Be prepared for a long commitment and journey, and ensure you don't break functionality on the way by comparing slowly, carefully, and piecemeal. Focus on the standardization of services, especially prioritizing frameworks, tools, and documentation. So, look both ways before you begin. Airbnb is having a positive experience so far. Thank you for listening to our migration story.
See more presentations with transcripts