Transcript
Danilov: We'll talk about AWS Lambda, how it's built, how it works, and why it's so cool. My name is Mike Danilov. I'm a Senior Principal Engineer at AWS Serverless. A decade ago, I joined EC2 networking team, and it was a fantastic ride. Then, five years back, I heard about Lambda. I really liked the simplicity of the idea. We run your code in the cloud, no servers needed, so I joined them. I'm excited to share some of the great stuff I've learned along the way.
Outline
First, I will do an introduction to Lambda itself, so understanding key concepts about the service and its fundamentals. It will help us understand in deep dive. Then we will talk about invoke routing layer. It is a super important part of the system, which connects all the microservices and makes the system work together. Then, we'll talk about compute infrastructure. This is where your code runs. This is serverless in serverless. Then, alongside our conversation, I will tell you a story about cold start.
AWS Lambda Overview
What is Lambda? Lambda is a serverless compute system. It allows you to execute your code on-demand. You don't need to own, provision, or manage your servers. Lambda has a set of built-in languages so you provide only your code and it will execute it. Lambda scales fast with response to your tracking demand. Lambda launched many years ago, and today, millions of users use it monthly. They together generate trillion invokes every month. It's a big volume. Let's see how it works. These are examples of very simple Hello World functions. You only write your code, you can pass your data in an event and then context will provide you some runtime information like request ID, remaining execution time. Configuring Lambda is very simple, you only pick preferred memory size, all the rest of the resources, including compute, CPU, will be allocated proportionally. Lambda supports two invocation models. First, the synchronous invoke. In that case, you send a request, this gets routed to your execution environment. Your code gets executed and you receive response synchronously on the same timeline. In asynchronous invoke, your request will be queued first, and then it will be executed at different timeline by poller systems. It's important to highlight that execution of asynchronous invoke is no different from synchronous. It's important because today, we'll only talk about synchronous invokes. It's also very important to talk about Lambda tenets, because they help us to make technical tradeoffs, make technical choices and design decisions. First is availability. We want to ensure that our service is available, and every time you send us a request, you receive a reliable response. Efficiency is super important on the on-demand systems, because they need to be able to provide resources very quickly and release them very quickly as well to avoid wastage. Scale, as I said, Lambda scales very fast with response to your demand. It also scales down very fast as well again to reduce wastage. Security is job number one at AWS, and Lambda is no exception. We want to provide you a safe and secure execution environment where you can run and trust your code, and performance. We're building a compute system that means we want to provide very little overhead on top of your application business logic. We want our system to be invisible.
Invoke Request Routing
This is a very high-level overview of Lambda component. We will only talk about invoke routing, and compute infrastructure. Let's start with invoke request routing. This is the layer which connects all the microservices together. It provides availability, scale, and access to the execution environments. Let's start very simple, let's build it ourselves. We have Alice, and she asked us for help to run her code in the cloud. First, we need to add a configuration service of some kind, to store her code and associated config. Then, we need to introduce a frontend. Frontend will take invoke request, validate it, authorize, and the invoke stores configuration in a database. Now we need a worker, your execution environment, or sandbox, where actual Alice's code will run. Sounds simple, but there is a problem. Remember, we're building an on-demand compute system, it means at the time of the invoke there might not be an available worker or execution environment. Frontend doesn't know where to run code, where to send the request. Let's add a new system, we'll call it placement. Placement will be responsible for creating execution environment or sandboxes on-demand. Now frontend needs the short call placement to request a sandbox. Only after that, it can send the invoke request to the created execution environment. Now it's functionally complete. There is a problem. We added on-demand initialization in front of every invoke request. This initialization takes this many steps. You need to create execution environment itself. You need to download customer code. You need to start up a runtime. You need to initialize the function itself. It all can take several seconds, which wouldn't be great customer experience.
This is a latency distribution graph. It shows how often we see particular invoke duration over the period of time. The green invoke latency is our good work. This is what our code does, it's our business logic. It does some useful work and it takes some CPU time, it's ok. We want our overhead to be very minimal, so that's why we want to be nearby in terms of observed latency. What we see, we are here. This is because of all these initialization steps we're taking in front of every invoke. It's not great customer experience, so let's fix it. Let's introduce a new system. We'll call it worker manager. Worker manager will be responsible to help us reuse previously created sandboxes, and use them instead of creating a new one every time. Now, we have two modes. The first one, when frontend asks worker manager for a sandbox, and worker manager returns one. Then from then it sends an invoke without significant interruptions, we call it warm invokes. Still, there are some cases where worker manager won't have a sandbox, so it has to call placement again to create a new one. This would be our initialization path, slow path. This is how it's seen on the distribution graph. We now see two paths. One we see that majority of invokes now move to the right spot. They provide very little overhead. They're fast. We call them warm invokes. There is a smaller number of the cold invokes still left. By the end of this talk we will completely eliminate cold starts, but there is a lot that needs to be done.
This is a more realistic representation of Lambda synchronous invoke routing in production. We added multiple availability zones for resiliency. We added front load balancer to distribute the load. We replaced worker manager with assignment service. I'll explain why. Everything else looks very similar to what we just built for Alice. Let's deep dive into the worker manager. Worker manager was the original system which was built when Lambda started. In order to track sandboxes, it keeps in-memory database of all the created sandboxes and their state. Sandbox can be in busy state, meaning it's sharing an invoke, or it can be in idle state meaning it's doing nothing. When there is your request from frontend comes in, worker manager will pick first available idle sandbox in the region. It also will mark it as busy, and worker manager will track the progress of an invoke to mark it as idle at the end of the invoke. This is how worker manager was deployed into the production region. Again, we have multiple zones, worker manager deploys in every zone. Because Lambda is a regional service, worker manager was tracking sandboxes in every other AZ, availability zone. The system works reasonably well towards available at provided scale until it doesn't. Remember, every worker manager holds sandbox state in memory, which is not durable storage, is not replicated. If for whatever reason, the worker manager host goes down, we will lose all the sandboxes which was created by that instance. Let's look at the production region. In the case of availability zone event, we generally expect that, ok, we will lose one-third of capacity, which resides in this availability zone. The rest of the availability zones will scale to compensate. Because worker manager tracks sandboxes in multiple availability zones, the real impact will be worse. We will lose capacity in healthy AZs as well, because it was tracked by unavailable worker managers. The compound impact can be quite large, and it's not acceptable. In addition to that, worker managers run in isolation. All the remaining worker managers, they will start calling placement service to compensate for the losing capacity, adding additional load and pressure on the system in the case of availability event, making it harder to recover.
This is why a year ago, we replaced worker manager with a new service called assignment service. Functionally, it looks exactly the same. It runs in every availability zone and tracks sandboxes, but there is a difference. We added reliable distributed and durable storage, we call it journal log. This storage tracks sandbox states across entire region. It's replicated across multiple availability zones and multiple nodes. Now, we were able to organize assignment service in partitions, usually about three nodes each. They have one leader and two followers. They coordinate their work based on a distributed state. Only leader is allowed to write to a log, and only leader is allowed to call placement to create new sandboxes. Follower is the only read-only replicas, so they just follow and read this journal log. In case of leader failure, a follower can easily take over and become a new leader. No sandboxes will be lost, and it will be able to continue where the previous leader stopped. Thanks to the distributed state, we can easily introduce a new node and it will rebuild the state from the log, and now it continues, easy to serve. This change helped us to significantly increase availability of the system. It became fault tolerant to the single host failure, and it's an availability zone event. We replaced in-memory state with a distributed consistent sandbox state regionally. We applied leader-follower architecture for the quick failovers. Additional benefit, that new service was rewritten in Rust. Rust is an awesome language, it helped us to increase efficiency and performance of every host, improving processing volume and reducing overhead latency. This is the end of chapter one about invoke routing. We learnt about cold and warm invokes. We learnt about how a consistent state helped us to improve availability.
Compute Fabric
Now it's time to switch to compute fabric. Compute fabric, it owns all infrastructure required to actually run your code. This is the overview. First of all, we have worker fleet. Worker is an EC2 instance, where we create execution environments and run your code. We have a capacity manager, which is responsible for maintaining an optimal worker fleet size. It can scale up and scale down depending on demand. Capacity manager is also responsible for the worker health. It can proactively detect workers in unhealthy state and replace them with the healthy ones to minimize traffic disruption. You remember placement from invoke routing. Placement is responsible for creating execution environments. Before it can actually create, it needs to pick a bright worker, a worker in an optimal state to take a new work. Our data science team helps both services, placement and capacity manager, to make smart decisions. Data science team takes a lot of production real-time signals, they build models, and share them with capacity and placement. Both teams can forecast and make smart decisions. It works really well all together.
Let's quickly look again at routing. Remember we helped Alice to run her code in the cloud. Now Bob asked us to do the same. Question, can we run Bob's code on the same worker as Alice? The answer is maybe. Let's deep dive into the data isolation. The most straightforward thing would be to load Bob's code in the same runtime process as Alice, but it will be disaster because now both functions can write to the same memory, they can interfere with each other. We want to separate them. Operating system provides us a good logical separation boundary process, so we can launch multiple runtime processes, one for each function. In case, they will be logically isolated. We can even put them in containers for additional isolation, but it's not sufficient because we run arbitrary code in a multi-tenant system. We don't know what this code is doing. There are plenty of security threats, which can escape across this boundary. This is why in AWS Lambda, we believe that only a virtual machine isolation provides sufficient guarantees to run an arbitrary code in a multi-tenant compute system.
When Lambda launched, we were using EC2 VMs as isolation boundary, it means single tenant worker. In this world, no, you cannot put Bob's code on the same worker as Alice. We would create an execution environment on such worker and then run code. It works fine logically and provide good latency, but there were problems. First is waste. Even a smaller EC2 instance is much larger than the smallest Lambda function. It means a lot of unused EC2 resources. We can fix it by creating multiple instances of the same function on the same worker, because they belong to the same account, we're not worried about data isolation. This helped us with resource utilization. There is a problem. It's very likely that multiple instances of the same function of the same code very likely will access system resources at exactly the same time. They will do either compute work or start doing networking or block device IO at the same time causing overloading of the instance. It's very hard to control this instance and provide consistent execution performance to our customers. Additionally, because we now give every account, every customer plenty of workers, the worker fleet size became just too large to efficiently manage. Networking IO became very tricky because it's hitting standard operating system limits. Deployments, operation support all become very tricky.
This is why we came to EC2 team and asked how they can help us. We work, and this is how Firecracker was born. Firecracker is fast virtualization technology. It allows you to create secure and fast microVMs. It was specifically designed to meet serverless compute needs. This is how Lambda adopts Firecracker. Now, in every worker, we would wrap every sandbox, every execution environment into a tiny virtual machine. This machine is completely independent. It has its own Linux running, its own kernel. It shares nothing with the other processes. It's secure. In this model, we can safely put multiple different accounts on the same worker. In this model, we can actually run Bob's code on the same worker as Alice, because Firecracker virtualization will provide strong data isolation. This is a comparison of the size of EC2 metal instance which we use as a substrate for the Firecracker VMs. They are really huge. They have plenty of memory. This allows us to multiplex and place thousands of functions from different customers on the same worker, and still provide consistent performance because now they were different. It's very unlikely that they will request CPU at exactly the same time. More likely it will be actually distributed and decorrelated. Placement using data models they can pick and combine, and group together most optimal workloads.
Let's summarize the benefits of Firecracker. It provides strong isolation boundaries. It's very fast and provides very little system overhead. It enables us to decorrelate demand to resources and better control heat of our worker fleet. Let's take a look again at our latency distribution diagram. By switching from EC2 to Firecracker, we were able to significantly reduce the cost of creating a new execution environment. We still have the middle part. We don't control it. It's the size of the code. It's startup time of the runtime, initialization time of the function. How are we going to fix it? Since we switched to the Firecracker VMs, we probably can use VM snapshots. This is the idea. What if, at the end of initialization stage, instead of serving invoke, we will actually create a snapshot of virtual machine. Then, next time we need to create new VM to serve an invoke, instead of initialization we will resume it from snapshot. If resume is fast enough, the total overhead won't be noticeably different from the warm invoke. Let's try to build it. Let's start with the rough design sketch. First, we need the worker where we will launch the VM and initialize the function, run all the initialization steps. At the end we'll create a snapshot. We need to upload it to some distributed storage, let's say S3, and then a different worker would download snapshot to create a local copy. Now we can create multiple VMs very rapidly. This is definition of a snapshot. It has multiple files on disk. First, is memory file, is the whole estate virtual machine accumulated during the initialization phase. Then there is a reference to the source read only image, which contains operating system, runtime, and your code. There is also read-write overlay, which has all the data which was written to the disk during initialization phase. Then there is a smaller internal file which tracks the VM state itself.
Let's summarize the task to build such system. We need to find a way to distribute snapshots between workers. We need to be able to resume multiple VMs from the same snapshot file to avoid multiple downloads. It needs to be super-fast. We will start with resuming VMs. Firecracker resumes VM from snapshot by memory mapping the snapshot memory file, and then attaching disks via virtio interface. Disk is very standard. It's no different from the on-demand launched VM. Let's deep dive into the memory map. We need to understand how it works to decide if it's safe to use in our Lambda environment, in our multi-tenant compute system. Memory mapping allows you to literally map the contents of a file on a disk to a virtual private space of the running process. When process accesses the memory, it doesn't know about file, it thinks it works with memory. Operating system does all the magic. When process rise to search memory, operating system will copy original pages to the private memory part. It will hold their latest overwritten pages. This system worked well for decades, and it definitely supports multiple processes, mapping the same file into their processes.
Let's dig a little deeper. Access to the file disk is slow, slower than to the memory. This is why operating system adds a page cache. It's in-memory place, which holds content of the original file. When multiple processes map the same file and try to read it, they actually read from this page cache, they're not going to disk. This page cache means that multiple processes reference the exact same physical memory pages, which is what open threats in the multi-tenant execution environment. Specifically, these two papers, really great examples of sophisticated side-channel attacks, which leverage shared memory so that one process can influence CPU cache, and using timing, ensures private data of the other process, like, for example, encryption keys. It means that we cannot use this proven technology in our multi-tenant execution environment. We need something better. We introduced our own indirection layer which enforces strict copy anytime a process accesses the data, we called it copy-on-read. It guarantees that there is no shared memory, hence no security threats.
We're not done yet. When we resume multiple VMs from the exact same snapshot, they're identical. Every bit of these VMs are identical. They have same random sequences. They have same identifiers, same timestamps, everything. If we allow them to talk to the internet, the remote host won't be able to distinguish them. They will be confused who is talking to them. We need to fix this. This is an example of the quite real code which uses UUID as a sandbox ID to uniquely write to a log. If I create multiple VMs with that code from the same snapshot, they will all have the same exact sandbox ID, and logging will be very confused. In order to fix that, we need to restore uniqueness of every clone VM after they resume. It needs to be done at every layer. We need to resume randomness in Linux kernel. We need to restore uniqueness in system libraries like OpenSSL, for example. Runtime in your own code all needs to be fixed. It's a lot of work and we start incrementally. Together with the Java community, we introduced a callback interface. You can use it in your code to subscribe to the resume notification and use it to restore uniqueness of your code. We also prepared patch for the OpenSSL to reset entropy pool in the event of resume. We're working with the Linux community. We have a patch to reset Linux kernel randomness on resume. There is a lot of work to be done, but we're moving in this direction. Now we can safely resume multiple VMs from the same snapshot, and they will be unique.
Snapshot Distribution
Let's talk about snapshot distribution. Remember, snapshots can be quite large, up to 30 gigabytes. M5 metal disk EC2 instance has throughput of 25 gigabits, it means it will take at least 10 seconds to download the entire snapshot. It will be actually longer than running on-demand initialization. We need to do something better. When we start watching video, we're not waiting for the entire video file to be downloaded to our browser. Browser gets few initial bytes, start playing, and the rest is downloaded in background, and we do not notice it. Let's use this as an example, and build similar system. We have one large snapshot, we'll split it into the multiple chunks which use 512 kilobytes of the chunk size. Now, we only need to download a minimal set of chunks to resume a VM and start the invoke. The rest will be downloaded on demand. It has two benefits. It allows us to amortize the download time during the invoke duration. Also, we'll be downloading only minimal needed working set. In most cases, this data in the snapshot file is for reference only. It's not actually used during the invoke. A lot of data was written once during initialization time, and was never accessed again. This on-demand loading helps us to only extract the working size needed for the invoke.
This is how it will work. Remember, we have VM memory which is mapped to the snapshot file. When process access the memory and its page is there, it just immediately gets the data. If page is not in memory, it will fall back to the snapshot file. If snapshot file has the chunk, data will be re-chunked. If snapshot file doesn't have a chunk yet, it will make a call to our indirection layer and request for a chunk. Indirection layer first will look up on the local cache on the worker to see if such chunk is available. If not, it will ask a distributed cache, and if chunk is not there as well, it will go to a region, to S3 in our case. This is how on-demand chunk loading works. Efficiency of on-demand chunk loading hugely depends on the success hit ratio of caches. Local cache is super-fast. Distributed cache is fast as well, and origin is slow. For the consistent customer experience and stable performance, we want to ensure that we hit cache all the time, or most of the time, and minimize calls to the region.
In order to maximize cache hit ratio, it will be good to share common chunks. For example, multiple Java functions, they all have Java. It's very likely these chunks holding Java bits will be common, and we should be able to dedupe them. Or multiple container images, they're using the same Linux, we should be able to dedupe these Linux chunks and share it between multiple functions, and don't copy them. We need to be careful, because we don't want to share function data itself. It's private. It's confidential. We can leverage the fact that we know how we created these chunks, because it was our system which loads operating system first, then loads runtime, and then function code. We can create multiple layered incremental snapshots, or first be the base operating system snapshot, which has no customer data, it's just Linux booted. Then we can use this snapshot to resume VM, and let's say, load Java, or a different VM and load Node.js, but they still will have common operating system chunks. Then we can create next incremental snapshots, we'll call it runtime snapshot. Now we can create multiple Java functions from the same Java runtime snapshot. They will have unique function data but also common operating system and Java runtime chunks. As a result, we can mark entire snapshot and tag them as operating system chunks, runtime chunks, and function chunks. For every layer, we will use different encryption key, so they don't share with each other. For the function chunks, we will be using customer managed key to make sure that only this particular customer has access to their chunks, and they're never shared. We can share operating system and runtime chunks. This is super helpful.
Sometimes we don't know the origin of the file, for example, read-only disk. We don't know what was written in the Java runtime distribution or into the Linux distribution. Still, intuitively, we think there is some common chunks there, some common bits there. We come up with the convergent encryption idea. It means that if two chunks from two different files have the same content in the plaintext, they will be equal in the encrypted form as well. In convergent encryption, the key is ensured from the content itself. Now we can take different Linux distributions, chunk them and encrypt them, and then compare encrypted chunks, and be able to still deduplicate common bits. It allows us to greatly reuse chunks across container images. This all together increases cache locality, number of hits to the local distributed cache, and reducing the latency overhead. This is an overview of the system in production. The only big difference is that I replaced work indirection layer with a sparse file system, which actually does all this magic to request chunks and to provide it on-demand at the file system level.
Have We Solved Cold Starts?
Now we're able to distribute snapshots, and we're able to resume VMs, everything should be good. Unfortunately, we still see some delay in some number of cold invokes. They're very close to the target place, but still, we see them. Let's understand what's going on here. We need to go back to the page caches and memory mapping, because file on disk is slower to read, operating system makes another optimization. When a process reads single page in the memory, operating system will actually read multiple pages. It's called read-ahead. The expectation that in normal files the read will be sequential. That means we read a bunch of pages from a file once, and then we just use memory because it's already there, very fast and very efficient. Remember, we mapped memory instead of regular file. Memory is random access. In that case, we won't hit the pages we load, we will hit a new page every time. As a result, in this example, we randomly request four pages, but it forces operating system to actually download the entire snapshot file. It's super inefficient. It takes time to download this file, decrypt them, checksum them. This is what we see on this latency distribution graph. We need to fix it. This is the memory access between 100 VMs, laid one on another. You can see that most of the pages, they're quite consistent, exactly the same spot. We can record this access pattern into a file, page access log, and attach it to every snapshot. Now when we resume snapshot, we know exactly what pages will be needed, and in which order. It becomes fast. Now, we solve problems with cold starts. The best part, you can try it today yourself. Simply enable Lambda SnapStart on your Java function, and see how VM snapshot works.
Summary
We talked about invoke routing layer in Lambda. We were able to improve availability of our system and maintain its scale. We talked about compute infrastructure. We introduced Firecracker to greatly increase efficiency while preserving strong security posture. We also substantially increased performance of our system, and completely eliminated cold starts. All of this work is needed for a simple idea. We run your code in the cloud, no servers required. This is why we say Lambda is a compression algorithm for our experience. I collected additional materials on this link, https://s12d.com/qcon.
Avoiding Hash Collisions in Convergent Encryption
Danilov: How do we avoid hash collisions in the convergent encryption?
We rely on the cryptographic hash algorithms to provide us reliable information. I think we just rely on cryptographic hash algorithms to solve it.
How Function Config Is Loaded
Danilov: How do we actually load function configuration because making an extra hop is inefficient?
It's absolutely true. It's inefficient. In fact, frontend maintains a coherent cache of the configuration database nearby, it's distributed cache in multiple availability zones. It's basically become a fast cache lookup.
Using the Same Tech to Boot EC2 Instances Faster
Danilov: Can we use the same technology to boot EC2 instances faster?
Logically, yes, but they are more complicated in terms of hardware and virtualization layer is more complicated than Firecracker. Yes, logically, it's possible. It just requires more work.
How Services Communicate with Each Other
Participant: How do different services, like placement service, availability service talk to each other, like synchronous HTTP calls or asynchronous, how do they communicate with each other?
Danilov: It's basically a mixture. In some cases, it's synchronous request-response communication. In some cases, we leverage gRPC and HTTP/2 streams to stream data instead of a single request-response. It's really a mixture depending on the requirements of a particular communication. Find one which fits the best.
Why Metal Instances Are Used
Danilov: Why we use metal instances?
The Firecracker actually requires either metal instances or nested virtualization. Unfortunately, today, the nested virtualization doesn't meet our requirements on overhead and latency. That's why we use only metal because they basically fit the need. Again, logically, you can use nested virtualization but it would be much slower, it would be 10x to 20x slower on every system call.
Handling Lambda Function Updates when Snapshot Is Being Executed
Danilov: How do we handle Lambda function updates while snapshot is being executed?
Basically, on a creation of the function or function update, we would kick off the asynchronous background process to create a snapshot, while it's not finished, you will be using previous function version. Once the snapshot is finished, we will switch to the latest.
Handling Efficiency in the Engineering Process
Danilov: How do we handle efficiency in the engineering process?
We have three big pillars, it's security, efficiency, and latency. As engineers, you're always in tension. You want to be in the middle of this spot, because they all pull in different sides. It's really a balancing act, there is no magic, no silver bullet here. We always try to do the balancing act, and the only thing you cannot sacrifice is security. Sometimes you can, for example, do the security patch and then restore the efficiency. You cannot do this with security, so security always first. It's a balancing act, basically, and every engineer doing this balancing act and trying to preserve all three.
See more presentations with transcripts