BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Presentations Amazon DynamoDB: Evolution of a Hyperscale Cloud Database Service

Amazon DynamoDB: Evolution of a Hyperscale Cloud Database Service

Bookmarks
52:54

Summary

Akshat Vig presents Amazon’s experience operating DynamoDB at scale and how the architecture continues to evolve to meet the ever-increasing demands of customer workloads.

Bio

Akshat Vig is a Principal Engineer at AWS. Akshat has been working on DynamoDB since its inception. He is one of the primary authors on the DynamoDB paper published at USENIX. He has filed close to 100 patents, served on IEEE program committees, and has given keynotes around the world.

About the conference

QCon Plus is a virtual conference for senior software engineers and architects that covers the trends, best practices, and solutions leveraged by the world's most innovative software organizations.

Transcript

Vig: How many of you ever wanted a database that provides predictable performance, higher availability, and is fully managed? What I'm going to do is talk about evolution of a hyperscale cloud database service, which is DynamoDB. Talk through the lessons that we have learned over the years, while building this hyperscale database. I am Akshat Vig. I'm a Principal Engineer in Amazon DynamoDB team. I've been with DynamoDB right from its inception.

Why DynamoDB?

AWS offers 15-plus purpose-built database engines to support diverse data models, including relational, in-memory, document, graph, time-series. The idea is that you as a customer can choose the right tool for the use case that you're trying to solve. We are zooming in into DynamoDB, which is a key-value database. The first question that comes to mind is, why DynamoDB? Let's go back to history. During 2004-2005 timeframe, amazon.com was facing scaling challenges caused by the relational database that the website was using. At Amazon, whenever we have these service disruptions, one thing we do as a habit, as a culture, is we do COEs, which are basically correction of errors. In that COE, we ask questions, how can we make sure that the issue that happened does not happen again? The use case for which that particular outage happened was related to a shopping cart. One of the questions that we asked in the COE was, why are we using a SQL database for this specific use case? What are the SQL capabilities that are actually needed? It turns out, not many. Choosing the right database technology is the key to build a system for scale and predictable performance. At that time, when we asked this question, if not an SQL database, what exactly would we do? At that time, no other database technology existed that met the requirements that we had for the shopping cart use case.

Amazon created Dynamo. It was between 2004 to 2007, where Dynamo was created. Finally, in 2007, we published the Dynamo paper after letting it run in production, and used by not just the shopping cart use case but multiple amazon.com services. Dynamo was created in response to the need for a highly available, scalable, and durable key-value database for the shopping cart, and then more teams started using it. Dynamo was a software system that teams had to take, run the installations on resources that were owned by them, and it became really popular inside Amazon within multiple teams. Hearing this one thing we heard from all these teams is that, Dynamo is amazing, but what if you make that as a service, so that a lot of teams who are trying to become experts in running these Dynamo installations, it becomes easier? That led to the launch of DynamoDB.

Dynamo and DynamoDB, they are different. Amazon DynamoDB is a result of everything we have learned about building scalable, large-scale databases at Amazon, and it has evolved based on the experiences that we have learned while building these services. There are differences between Dynamo and DynamoDB. For example, Dynamo, it was single tenant. As a team, you would run an installation, you would own the resources that are used to run that service. DynamoDB is multi-tenant. It's basically serverless. Dynamo provides durable consistency. DynamoDB is opinionated about it and provides strong and eventual consistency. Dynamo prefers availability over consistency, versus DynamoDB prefers consistency over availability. In Dynamo, routing and storage are coupled. We'll see, routing and storage in DynamoDB are decoupled. Custom conflict resolution was something that was supported in Dynamo. In DynamoDB we have last writer wins. There are differences between Dynamo and DynamoDB.

Coming back to the question of, why DynamoDB? If you ask this question today, 10 years later, a customer still will say they want consistent performance. They want better performance. They want a fully managed serverless experience. They want higher availability on their service. We are seeing that, like consistent performance at scale, this is one of the key durable tenets of Dynamo. Key properties that Dynamo provides that as Dynamo is being adopted by hundreds of thousands of customers, and as the requests are increasing, even the request rates are increasing, customers who are running mission critical workloads on DynamoDB, the performance they're seeing is consistent. They're getting consistent performance at scale. Proof is in the pudding. One of the customers, Zoom, in early 2020 when they saw unprecedented usage that grew from 10 million to 300 million daily meeting participants, DynamoDB was able to scale with just a click of a button and still provide predictable performance to them.

DynamoDB is fully managed, what does it mean? DynamoDB was serverless even before the term serverless was coined. You pay for whatever you use in DynamoDB. You can scale down to zero essentially. If you're not sending any requests, you don't get charged for whatever you're doing. It is built with separation of storage and compute. As a customer, in case you run into logical corruptions where you accidentally deleted some of the items or deleted your table, you can do a restore. Dynamo also provides global active-active replication where you have use cases where you want the data closer to the user, so you can run DynamoDB table as a global table.

On availability, Dynamo offers an SLA of four nines of availability for a single region setup. If you have a global table, then you get five nines of availability. Just talking about magnitude of scale, to understand that, like amazon.com being one of the customers of DynamoDB, 2022 Prime Day, amazon.com and all the different websites generated 105.2 million requests per second. This is just one customer. This can help you understand the magnitude at which DynamoDB runs. Throughout all this, they saw predictable single digit millisecond performance. It's not just amazon.com, hundreds of thousands of customers have chosen DynamoDB to run their mission critical workloads.

DynamoDB, Over the Years

Introduction of DynamoDB. How is it different from Dynamo? What properties are the durable tenets of the service? Let's look at how it has evolved over the years. DynamoDB over the years, it was launched in 2012, working backward from customer, that's how Amazon operates. It started as a key-value store. We first launched DynamoDB in 2012, where you as the customer can do Put, Gets, and it scales. Foundationally, very strong. Then we started hearing from customers, we want more query capabilities, serverless capabilities in DynamoDB, and we added indexing. Then customers started asking about JSON documents, we added that so that they can now preserve complex and possibly nested structures inside DynamoDB items. Then, 2015, a lot of customers are asking us, can you provide us materialized views? Can you provide us backup, restore? Can you provide us global replication? We said, let's take a step back, figure out what common building block we need to build all these different things that customers are asking. We launched DynamoDB Streams so that by the time we build all these native features inside Dynamo customers can innovate on their own, and a lot of customers actually used the basic scan operation and streams to innovate on their own. Most recently, we launched easier ingestion of data into Dynamo or easier export of data from Dynamo. Over the years, the ask from customers around features, predictable performance, availability, durability, that has been constant.

How Does DynamoDB Scale and Provide Predictable Performance?

How does DynamoDB scale and provide predictable performance? Let's try to understand this particular aspect of Dynamo by understanding how exactly a PutItem request works. As a client, you send a request, you might be either in Amazon EC2 network or somewhere on the internet, it doesn't matter. As soon as you make a request, you do a PutItem, it lands on the request router. Request router is the first service that you hit. As every AWS call, this call is authenticated and authorized using IAM. Once the request is authenticated and authorized, then we look at the metadata. We try to figure out, when exactly do we need to route the request? Because the address of like where exactly the data this particular item has to finally land, is stored in a metadata service, which is what the request router consults. Once it knows the answer, where to route the request, next thing it does is it basically verifies whether the table that the customer is trying to use, has enough capacity. If it has enough capacity, the request is admitted. In case the capacity is not there, request is rejected. This is basically admission control done at the request router layer. Once all that goes through, request is sent to the storage node. For every item in Dynamo, we maintain multiple copies of that data. DynamoDB storage nodes, one of the storage node is the leader storage node. The other two storage nodes are follower storage nodes. Whenever you make a write request, it goes to the leader, gets written on at least one more follower before the write is actually acknowledged back to the client.

We don't have, not just a single request router, and not just three storage nodes, the service consists of many thousands of these components. Whenever a client makes a request, the request is routed to a specific storage node and sent to the request router, and then sent to the storage node. AWS just like a well-architected service, DynamoDB is also designed to be fault tolerant across multiple availability zones. In each region, there are basically request router and storage nodes which are in three different availability zones. We maintain three different copies of data for every item that you store in the DynamoDB table. Request router essentially does a metadata lookup to find out where exactly to route the request. It takes away the burden from the clients to do the routing. When I said storage and routing is decoupled, that's what I meant, that the clients now don't have to know about where to route the request. It is all abstracted away in the request router.

Wherever the request router gets a request, it finds out the storage nodes that are hosting the data, it will connect to the leader storage node. The leader storage node submits the request, acknowledges, and finally once it gets an acknowledgment from one more replica, it acknowledges it back to the client.

Data is replicated at least to two availability zones before it is acknowledged. DynamoDB uses Multi-Paxos to elect a leader, and leader continuously heartbeats with its peers. The reason it is doing it is that so that if a peer fails to hear heartbeats from a leader, a new leader can be elected so that availability is not impacted. The goal is to reduce the failure detection and elect a new leader as soon as possible in case of failures.

Tables

Now we understand the scale at which Dynamo operates, we understand how the request routing logic works. Let's look at the logical construct, the table, and how exactly DynamoDB automatically scales as your traffic increases, as your data size increases in the DynamoDB table. As a customer, DynamoDB, you create a table, and each table, you specify a partition key. In this particular example, each customer has a unique customer identifier, and we are storing customer information in this table. Customer ID is your partition key. Then you also store other customer information like name, city, in the item as other attributes. DynamoDB scales by doing partitioning. How exactly that happens is, behind the scenes, whenever you make a call to DynamoDB with the customer ID or whatever is your partition key, Dynamo runs a one-way hash. The reason for doing that one-way hash is that it results in random distribution across the total hash page associated with that table. One-way hash, it cannot be reversed, it's not possible to essentially determine the input from the hashed output. The hashing algorithm, it results in essentially highly randomized hash values, even for inputs that are very similar.

A table is partitioned into smaller segments based on the overall capacity that you have asked or the size of the table. Each partition, it contains a contiguous range of key-value pairs. For example, in this case, we have a green partition that has values roughly from 0 to 6. Similarly, we have the orange partition, which has values from 9 to B, and then you have the pink partition, which has values from E to F. Essentially, given a hashed value of an item partition key, a request router can determine which hash segment that particular item falls into, and from the partition metadata service, it can find out the three storage nodes, which are holding the copy of that particular item, and then send the request to that particular set of nodes.

As I explained previously, we have three copies of data in three different availability zones. If we have these three partitions, essentially, we have three green partitions in three different zones, three orange partitions, and then three pink partitions. All these partitions, the metadata about where exactly these partitions are, is stored in a metadata service. That particular metadata is called a partition map. What a partition map looks like, it essentially is the key ranges that that partition store supports, and then green1, green2, green3, these are essentially the addresses of the three storage nodes where that partition is actually hosted. Think about it, when Zoom comes and asks for 10 million read capacity unit table, we would essentially add more partitions. If suddenly they increase their throughput to 100 million, corresponding to that we would add more partitions, update the metadata, and that's how DynamoDB scales.

Predictable Performance and Data Distribution Challenges

What challenges are there? DynamoDB is a multi-tenant system. What are the different challenges that come into picture that we have to solve? What are the lessons that we have learned to provide predictable performance? One of the common challenges in a multi-tenant system is workload isolation, because it's not just one customer that we have, we have multiple customers. These customers, their partitions are installed on the storage nodes which are multi-tenant. If isolation is not done right, it can cause performance impact to these customers. Let's jump into how exactly we solve that. In the original version of DynamoDB that was released in 2012, customers explicitly specified the throughput that the table required in terms of read capacity units and write capacity units. Combined, that is what is called as provisioned throughput of the table.

If a customer is essentially reading an item, which is up to 4 KB, that means it has consumed one read capacity unit. Similarly, if a customer is doing a write of a 1 KB item, that would mean the write capacity unit is consumed. Recall from the previous example for the customer's table, we had three partitions. If a customer asks for 300 read capacity units in the original version of Dynamo, what we would do is we would assign 100 RCUs to each of the partitions. You have basically 300 RCUs in total for your table. Assuming that your workload is uniform, essentially your traffic is going to three different partitions at a uniform rate. To provide workload isolation, DynamoDB uses token bucket algorithm. Token bucket is to track the consumption of tokens of the capacity that that particular table has, and a partition has, and enforce basically a ceiling for that.

Looking at one of these partitions, we had token buckets at a partition level in the original version of Dynamo. Each second, essentially, we are refilling tokens, at the rate of the capacity assigned to the partition, which is the bucket in this particular case. When RCUs are used for read requests, we are continuously deducting them based on the consumption. If you do one request, and we basically consume one token from this bucket, the bucket is getting refilled at a constant rate. If the bucket is empty, obviously, we cannot accept the request and we ask customers to try again. Overall, let's say that a customer is sending request, and if there are 100 tokens, the request will get accepted for the green partition. As soon as the consumed rate goes above 100 RCUs, in this particular example, as soon as it reaches 101 RCUs, your request will get rejected, because there are no tokens that are left in that token bucket. This is a high-level idea of how token buckets could work.

What we found out when we launched DynamoDB, is that uniform distribution for the workloads is very hard. Essentially getting uniform workloads across for the full duration of when your application is running, your table exists, it's very hard for customers to achieve that. That's because the traffic tends to come in waves or spikes. For example, let's say you have an application which is for serving coffee. You suddenly will see that spike happening early in the morning, and then suddenly your traffic will increase. As most of the customers get the coffee they go to their office, what you would see is the traffic suddenly drops. Traffic is not uniform. It basically changes with time. Sometimes it is spiky, sometime there is not much traffic in the system. If you create a table with 100 RCUs, and you see a spike of traffic greater than 100 RCUs, then whatever is above 100 RCUs will get rejected. That's what I mean by non-uniform traffic over time. Essentially, what's happening is, maybe your traffic is getting distributed across all the partitions, or maybe it's getting to a bunch of partitions, but it is not uniform across time. Which means if you have provisioned the table at 100 RCUs, any request that is being sent above the 100 RCU limit, it will all get rejected.

Another challenge that we saw was, customers solving this problem of seeing that they are getting throttled. To solve this problem of getting throttled, what they did was they started provisioning for the peak. Instead of doing 100 RCUs, they would ask for 500 RCUs, which means that it is able to handle the peak workload that they'll see in the day, but at the same time for rest of the day, you are seeing a lot of waste in the system. This meant a lot of capacity unused and wasted, which incurred cost to the customers. Customers asked us, can you solve this problem? We said, what if we let the customers burst? What is the capacity of the bucket? To help accommodate the spike in the consumption, we launched bursting, where we allow customers to carry over their unused throughput in a rolling 5-minute window. It's very similar to how you think about unused minutes in a cellular plan. You're capped, but if you don't use minutes in the last cycle, you can move them to the next one. That's what we called as the burst bucket. Effectively, the increased capacity of the bucket was able to help customers absorb their spikes. This is the 2013 timeframe when we introduced bursting, unused provision capacity was banked to be used later. When you exercised those tokens, your tokens will be spent. Finally, that particular problem of non-uniform workload over time, we were able to solve.

We talked about the non-uniform distribution over time, let's talk about non-uniform distribution over keys. Let's say that you're running a census application for Canada, and the data of the table is partitioned based on ZIP codes. You can see in this map, 50% of Canadians live below that line, and 50% of Canadians live north of that line. What you will see is that most of your data is essentially going in a bunch of partitions, which means your traffic on those partitions will be higher as compared to your traffic on some partitions. In this example, we have 250 RCUs going to the green partition, and 10 RCUs going to orange and pink partition. Overall, the partitions, they're not seeing uniform traffic. The takeaway that we had from bursting and non-uniform distribution over space was that we had essentially tightly coupled how much capacity a partition will get to how physically we are basically landing these partitions. We had essentially coupled partition level capacity to admission control, and admission control was distributed and performed at a partition level. What that resulted in, just a pictorial picture of that, is you would see all the traffic go into a single partition. Then, since there is not enough capacity on that partition, the request will start getting rejected.

The key point to note here is that even though a customer table has enough capacity, for example, in this case, 300 RCUs, but that particular partition got only assigned 100 RCUs, so that's why the requests are getting rejected. Customers were like, I have enough capacity on my table, why is my request getting rejected? This particular thing was called as throughput dilution. The next thing we had to do was solve throughput dilution. To solve throughput dilution, what we did was we launched global admission control. DynamoDB realized it would be going to be beneficial to remove the admission control from partition level and move it up to the request router layer and let all these partitions burst. Still have maximum capacity that a single partition can do for workload isolation, but move the token buckets from the partition to a global table level token bucket. In the new architecture, what happens is, we introduced a new service called as GAC, global admission control as a service. It's built on the same idea of token buckets, but the GAC service centrally tracks the total consumption of table capacity, again, in terms of tokens. Each request router maintains a local token bucket to make sure the admission decisions are made independently and communicate with GAC to replenish the tokens at regular interval. GAC essentially maintains an ephemeral state computed on the fly from the client requests. Going back to the 300 RCU example, now customers could drive that much traffic to even a single partition because we moved the token bucket from the partition level to a global level, which is the table level bucket. With that, no more throughput dilution. A great win for customers.

That particular solution was amazing, so we had essentially launched bursting, and we did global admission control. It helped a lot of use cases in DynamoDB. Still, there were cases where a customer, due to maybe non-uniform distribution over time or space, could still run into scenarios where traffic to a specific partition is reaching its maximum. If a partition can do 3000 RCUs maximum, and a customer wants to do more on that partition, requests greater than 3000 RCUs would get rejected. We wanted to solve that problem as well. What we did was as the traffic increases on the partition, we actually split the partition. Instead of throttling the customer, we started doing automatic splits. The idea behind automatic splits was to identify the right midpoint, which will actually help to redistribute the traffic between two new partitions. If customers send more traffic to one of the partitions, we would again further split that into smaller partitions and route the traffic to the new partitions.

Now you have these partitions that are equally sized, or they're balanced, essentially. You as a developer, did not have to do any single thing. AWS literally is adjusting the service to fit your custom needs on the specific usage pattern that you are generating for the service. All this magic happens to solve both the problems, even if you have non-uniform traffic over time, or non-uniform traffic over space. This is not something that we got right from day one. As more customers built on top of DynamoDB, we analyzed their traffic, understood the problems that they were facing, and solved these problems by introducing bursting, split for consumption, and global admission control as solutions for all these different problems. Going back to the picture, where if the customer is driving 3000 requests per second to the green partition, we would automatically split, identify where exactly is the right place to split that, and split it so that 1500 RCUs, 1500 RCUs, the traffic splits between the two. This was again, amazing. We essentially did a bunch of heavy lifting on behalf of the customers.

One thing we still were hearing from customers that, DynamoDB has figured out a lot of things for us, now, for us coming from the world where we always have been thinking about servers, now you've started asking us to think in terms of read capacity units, write capacity units. Can you further simplify that? Instead of asking customers to specify provisioning at the time of table creation, what we did was we launched on-demand, where you don't even have to specify that. All the innovations that we did around bursting, split for consumption, global admission control, all of those enabled us to launch something which is basically on-demand mode on your tables where you just create a table and start sending requests, and you pay per request for that particular table.

Key Lesson

The key lesson here is that designing the system which adapts to the customer traffic pattern is the best experience that you can provide to the customer while using the database, and DynamoDB strives for that. We did not get this right in the first place. We launched with the assumption that traffic will be uniformly distributed but realized there are actually non-uniform traffic distribution based on time and space. Then, analyzing those problems, making educated guesses, we evolved the service and solved all these problems so that all the heavy lifting, all the essential complexity is moved away from the customer into the service, and customers, they just get a magical experience.

How Does DynamoDB Provide High Availability?

DynamoDB provides high availability. Let's look at how DynamoDB does that. DynamoDB has evolved, and a lot of customers have moved their mission critical workloads into DynamoDB. AWS, there are service disruptions that happen. 2015, DynamoDB also had a service disruption. As I said in the beginning, whenever there is a service disruption that happens, we try to learn from them. The goal is to make sure that the impact that we saw doesn't repeat. We want to make sure the system weaknesses are actually fixed, and we have a more highly available service. When this issue happened, one of the learnings that we had from that particular COE was that we identified a weak link in the system. That link was related to caches. These caches are essentially the metadata caches that we had in the DynamoDB system. One thing about caches is that the caches are bimodal. There are essentially two routes that a cache code can take. One is, when there is a cache hit, your requests are served from the cache. In the case of metadata, all the metadata that request routers wanted, it was being served from the cache. Then, you have a cache miss case where all the requests actually go back to the database. That's what I meant by the bimodal nature of the caches. Bimodality in a distributed system is a volcano waiting to erupt. Why do I say that?

This is going back to our PutItem request, so whenever a customer made a request to DynamoDB to put an item or get an item, the request router is the first service where that request lands. Request router has to find out where to route the request. What are the storage nodes for that particular customer table and partition, so it will hit a metadata service? To optimize that, DynamoDB also had a partition map cache in the request routers. The idea is that since the partition metadata doesn't change that often, it's a highly cacheable workload. DynamoDB actually had about 99.75% cache hit ratio from these caches, which are on the request router. Whenever a request lands on a brand-new request router, it has to go and find out the metadata. Instead of just asking the metadata for a specific partition, it would ask the metadata for the full table, assuming that next time customer makes a request for a different partition, it already has that information. Maximum, 64 MB requests you can get.

We don't have just one request router, as I said, we have multiple request routers. If a customer creates a table with millions of partitions, and start sending that many requests, they'll probably hit not just one request router, they'll hit multiple request routers. All those requests, then request routers would start asking about the same metadata, which means you have essentially reached a system state where you have a big fleet talking to a small fleet. The fundamental problem is that, either there is nothing in the caches, that is, the cache hit ratio is zero, you have a big fleet driving so many requests to a small fleet, and a sudden spike in traffic. In steady state it was 0.25%, if the cache hit ratio becomes 0, caches become ineffective. Traffic jumps to 100%, which means 400x increase in traffic, and that would further lead to cascading failures in the system. The thing that we wanted to solve is remove this weak link from the system so that the system can always operate in a stable manner.

How did we do that? We did two things. One is, as I said previously, in the first version of DynamoDB, request router, whenever it finds out that there is no information about the partition which it wants to talk to, it would load the full partition map for the table. First change it did was, instead of asking for the full partition map, just ask for that particular partition which you're interested in. That was the one change that we did then, which was a simpler change. We were able to do it faster. Then we also, secondly, built an in-memory distributed datastore called MemDS. MemDS stores all the metadata in-memory. Think of it like an L2 cache. All the data is stored in a highly compressed manner. The MemDS processes on a node encapsulates essentially a Perkle data structure. It can answer questions like, for this particular key, which particular partition it lands into, so that MemDS can respond back to the request router, the information, and then request router can route the request to the corresponding storage node.

We still do not want to impact performance. We do not want to make an off-box call for every request that customer is making, so we still want to cache results. We introduced a new cache, which is called the MemDS cache on these request routers. One thing, which is different and critical, the most important thing which we did different here on these caches is that even though there is a cache hit on the MemDS cache, we would still send all the traffic to the MemDS system asking for the information, even though we have a cache hit on these MemDS nodes. What that is doing is essentially the system is generating always constant load to the MemDS system, so there is not a case where suddenly caches become ineffective, your traffic suddenly rises. It is always acting like the caches are ineffective. That's how, essentially, we solve the weak link in the system of metadata getting impacted by requests landing from multiple request routers onto the metadata nodes. Overall, the lesson here is that designing systems for predictability over absolute efficiency improves the stability of a system. While system like caches can improve performance, but do not allow them to hide the work that would be performed in their absence, ensuring that your system is always provisioned to handle the unexpected load that can happen when the cache has become ineffective.

Conclusions and Key Takeaways

We talked about the first and second conclusion. The first one being adapting to customer traffic patterns improve their experience. We looked at it with how different problems we solved by introducing global admission control, bursting, and finally on-demand. Second, designing systems of predictability over absolute efficiency improve system stability. That's the one we just saw with caches. Caches are bimodal. How it's important to make sure that system is doing predictable load. The failure scenarios are tamed by making sure that your system is provisioned for the maximum load that you have to do. Third and fourth, these are two more things that we talk about in much detail in the paper. The third being DynamoDB is a distributed system. We have multiple storage nodes, multiple request routers. Performing continuous verification of data at rest is necessary. That's the best way we have figured out to ensure we meet high durability goals. Last, maintaining high availability as system evolves, it might mean that you have to touch the most complex part of your system. In DynamoDB, one of the most complicated most complex part is the Paxos protocol, the Multi-Paxos protocol. To improve availability, we had to do some changes in that protocol layer. The reason we were able to do those changes easily was because we had formal proof of these algorithms that are being written there, that were written there right from the original days of DynamoDB. That gave us quite high confidence since we had a proof of the working system, we could tweak it and make sure that the new system still ensures correctness, all the invariants are met.

Questions and Answers

Anand: What storage system does metadata storage use?

Vig: DynamoDB uses DynamoDB for its storage needs. Just think about it, the software that is running on the storage nodes, that's the same software which is running on the metadata nodes as well. As we have evolved, as I talked about MemDS being introduced, basically think of it like an L2 cache on top of the metadata that is being used to serve, like partition metadata for all the learnings that we had and scaling bottlenecks with the metadata.

Anand: You're saying it's just another Dynamo system, or it's a storage system? Does the metadata service come with its own request router and storage tiers?

Vig: No, it's just the storage node software. Request router, basically has a client, which talks to the storage nodes. It uses the same client to talk to the metadata nodes so that whenever the request comes, it's exactly the same code that we run for a production customer who would be accessing DynamoDB.

Anand: Is MemDS eventually consistent? How does it change to a cached value addition of a new value of 1 MDS node get replicated to other nodes?

Vig: Think of MemDS like you have an L2 cache, and that cache is updated, not a write through cache, but a write around cache. Whenever a write happens on the metadata fleet, which is very low throughput, partitions, they don't change often. What happens is, whenever a write happens on the partition table in the metadata store, those writes through streams, they are basically consumed by a central system, which sends all the MemDS nodes to have the updated values. This whole process happens within milliseconds again. As the change comes in, it just gets replicated on all the boxes. The guy which is in the center is responsible for making sure all the MemDS nodes are getting the latest values. Until it is acknowledged by all the nodes, that system doesn't move forward.

Anand: Do you employ partitioning in master-slave models there, or are there agreement protocols? Does every MemDS have all of the data?

Vig: MemDS has all the metadata. It's basically vertically scaled. Because the partition metadata, as I said, is not that big, it's tiny. The main thing there is the throughput that we have to support. That's what is very high. That's why we just keep on adding more read replicas there. It's not the leader-follower configuration. It's like every node is eventually consistent. It's a cache. They are scaled for specifically the reads that that metadata requests have to be served for whenever customers are sending requests, or any other system which wants to access the partition metadata.

Anand: There's some coordinator that's in the MemDS system, requests are send with a new write request, that coordinator is responsible for ensuring that all of the entire fleet gets that data.

Vig: Exactly.

Anand: It's the coordinator's job, it's not each peer-to-peer gossip type protocol to link. That coordinator is responsible for durable state and consistent state. Each node, once they get it, they're durable, but that's just for consistent state transfers.

Vig: Yes. Durable in the sense it's a cache, so when it crashes, it will just ask from the other node in the system to get up to speed and then start serving requests.

Anand: That coordinator, how do you make that reliable?

Vig: All these writes are idempotent. You could have any number of these guys running in the system at any time, and writing to the destination. Since the writes are idempotent, they always monotonically increase. It never goes back. If a partition has changed from P1 to P2, if you write again, it will say, "I have the latest information. I don't need this anymore." We don't need a coordinator. We try to avoid that.

Anand: You don't have a coordinator service that's writing to these nodes?

Vig: We don't have the leader and follower configuration there.

Anand: You do have a coordinator that all requests go to, and that keeps track of the monotonicity of the writes, like a write ahead log. It's got some distributed, reliable write ahead log. It's just somehow sending it.

When the partition grows and splits, how do you ensure the metadata of MemDS cache layers get updated consistently?

Vig: It's eventually consistent. The workflows that are actually doing partition splits and all these things, they wait until MemDS is updated before flipping the information there. The other thing is, even if the data is not there in MemDS, storage nodes also have the protocol to respond back to the request router saying that, "I am updated. I don't host this partition anymore, but this is the hint that I have, so maybe you can go talk to that guy." For the edge cases where this information might get delayed, we still have mechanisms built in the protocol to update the coordinator.

Anand: Each partition also has some information of who else has this data?

Vig: MemDS node, yes.

Anand: Also, it's a little easier because, unlike DynamoDB, this can be eventually consistent. DynamoDB, people care about consistency over availability and other things. Essentially, we're saying that, when you split, it doesn't have to be replicated everywhere immediately. It's done as time.

 

See more presentations with transcripts

 

Recorded at:

Jun 16, 2023

BT