During the first day of QCon San Francisco 2023, Mike Danilov, a senior principal engineer at AWS, presented on AWS Lambda and what is under the hood. The talk is a part of the "Architectures You’ve Always Wondered About" track.
Danilov's talk centered around invoke routing, compute infrastructure, and cold starts topics. Before diving into these topics, he explained AWS Lambda, providing examples, its configuration, invocation models sync and async, and its fundamentals (availability, efficiency, scale, security, and performance).
From a 10,000 foot view, AWS Lambda is composed of multiple components. However, in the talk, Danilov dives particularly into invocation and infrastructure to mitigate the cold start problem. The goal was to shorten the latency invoke distribution of an AWS Lambda function (less time for invocation, runtime, code, and sandbox initialization).
Before 2022, AWS Lambda's invoke routing system faced challenges regarding scaling and availability. The worker manager, responsible for handling incoming requests, struggled to manage the growing demand, scaling until it simply couldn't handle the load anymore. Sandboxes, which stored the execution state in memory, posed a significant risk. In the event of a zonal failure, the sandbox state could be lost, undermining system resiliency. AWS introduced the placement service to address these issues, although recovery from such failures remained challenging.
Next, AWS introduced durable storage, featuring a leader-follower model, ensuring no sandboxes were lost during failures. This enhancement significantly improved the overall reliability and durability of AWS Lambda's invoke routing system. Danilov concluded this part of the talk by saying that the Assignment Service's introduction and the Rust programming language's adoption have brought stability and performance to AWS Lambda's invoke routing system – yet that only solved part of the cold start issue.
Danilov continued describing the underlying infrastructure for AWS Lambda – such as the earlier mentioned placement service, capacity manager, workers, and data science. Optimizing the AWS Lambda infrastructure involves a strategic approach to worker selection driven by insights from the data science team. This process entails scrutinizing metrics, constructing models, and forecasting to ensure the right worker is chosen for each task. Concurrently, reevaluating routing strategies is essential to achieve efficient resource allocation.
A deep dive into data isolation reveals that AWS Lambda is designed to provide robust security and separation. Each runtime operates within a separate process and worker through virtual machine isolation for added protection.
Historically, utilizing single EC2 VM instances often resulted in resource wastage, but deploying multiple invocations on the same worker proved more efficient for resource utilization. However, this approach could lead to overloading when numerous resources were invoked simultaneously, presenting a challenge in environments with a single tenant and an excess of workers.
The adoption of Firecracker technology within AWS Lambda has been instrumental in addressing these challenges. Firecracker introduces "micro VMs", paving the way for serverless computing. Its integration into Lambda enables safe execution while facilitating multiplexing functions within the same worker to serve multiple users. Yet, this multiplexing introduces latency distribution and resource allocation considerations for the functions sharing the same worker, highlighting the ongoing complexity of Lambda infrastructure optimization.
Finally, Danilov discussed an idea that led to eliminating cold start (the part around invocation, not code or runtime).
AWS Lambda employs a sophisticated approach to handle snapshots, beginning with resuming Virtual Machines (VMs). These VMs have memory mapping capabilities, enabling them to delve deep into file disks and memory, particularly the page cache. However, due to security concerns related to side-channel attacks, Lambda cannot use shared memory for tasks such as copy-on-read.
VM Clones play an essential role in Lambda's snapshot management. Each clone is designed to be uniquely identifiable, and the mechanism behind this uniqueness is integral to the ability to resume from the same snapshot safely. Snapshot distribution in Lambda operates like streaming video, where snapshots are downloaded on demand, optimizing resource utilization and efficiency.
Lambda employs on-demand chunk loading and chunk-sharing mechanisms to enhance this approach further. These components work together, using convergent encryption to deduplicate data, ultimately reducing the load on local disk caches. This combination of resume capabilities and snapshot distribution methods ensures AWS Lambda's robustness and efficiency in managing snapshots.
Danilov concluded the talk by saying that invoke routing deals with availability and scale, and the compute infrastructure under AWS Lambda handles efficiency and security, while cold start is all about performance. Lastly, resources around this talk are available on a severlessland page.