Key takeaways
|
When building a new system on AWS we are faced with three architectural choices around application packaging, runtime service and load balancing service.
We can build Amazon Machine Images (AMIs) and run them as virtual machines in the Elastic Compute Cloud (EC2) behind a Elastic Load Balancer (ELB).
We can build Docker Images and run them as containers on the EC2 Container Service (ECS) behind an Application Load Balancer (ALB).
We can build Node, Python or Java project zip files and run them as Lambda Functions behind an API Gateway.
EC2, ECS and Lambda represent three generations of AWS services rolled out over the past decade. Instances, Tasks and Functions are the primary building blocks in these three generations respectively.
While each architecture can lead to success, and real-world systems will use some of each, ECS Tasks are the best architecture to target right now. ECS Tasks cost savings, security and speed over manually configured EC2 Instances, with none of the tough constraints of Lambda Functions.
Why Lambda Functions?
Lambda Functions are the newest AWS technology, and get a lot of buzz right now as “the future of cloud computing.” The properties of real-time function invocation, per-request billing and zero server management are impossible to beat.
The simplest way to use Lambda Functions is to add custom logic around events that happen inside the AWS platform. The canonical killer Lambda “app” is a single JavaScript function that is configured to be automatically called on every new S3 object, say to resize images.
This pattern of evented programming with Lambda enables extremely sophisticated applications. When coupled with the AWS Simple Queue Service (SQS) and/or DynamoDB, you can build robust, dynamic and cost-effective producer and consumer systems without any servers.
In 2015, AWS launched an API Gateway service which turns incoming HTTPS requests into events that trigger Lambda functions. This enables Lambda to power Internet-facing APIs with zero servers.
There is now an explosion of tooling and app development patterns evolving around building systems around Lambda Functions.
Why Not Lambda Functions?
However, the Function building block comes with strict constraints that pose tough challenges.
Right now, every Lambda Function call must complete within 5 minutes and can not use more than 5 GB of memory. Calls that violate these constraints will simply be terminated.
Similarly, the API Gateway service only supports HTTPS connections, and not HTTP or HTTP/2.
These constraints make Lambda a non-starter for the traditional business Java, PHP and Rails systems out there.
Lambda does represent the future of the cloud. AWS will continue working over the next decade to remove server management and to drastically lower our bills.
But right now, Lambda can’t fully replace the need for Instances due to its constraints. And ECS offers similar reductions in server management and cost without the tough tradeoffs.
Why Not EC2 Instances?
EC2 Instances are the oldest technology, now 10 years old, and are considered the tried-and-true form of cloud computing. The properties of API based provisioning and management, and hourly pricing are what killed the traditional hosting model.
But managing apps on EC2 Instances leaves a lot to be desired. You need to pick a server operating system, use a configuration management tool like Chef to install dependencies for your app, and use an image tool like Packer to build an AMI.
A deploy that needs to build AMIs and boot VMs takes at least a few minutes. In the age of continuous delivery we want deploys that take a few seconds.
Why EC2 Instances?
That said, we can’t use Lambda Functions for everything, so we still need to utilize Instances somehow…
EC2 Instances are extremely flexible, so numerous strategies for faster deployment times exist. A common technique is to use a tool like Ansible to SSH into every instance and pull a new version of code then restart the app server. But now we’re using bespoke scripts to mutate instances which add to the failure scenarios.
Another strategy is “blue-green deploys”. We can boot an entire new set of EC2 Instances, with the new version of the software (call this “blue”), migrate traffic over to it, then terminate the old set (“green”). This reduces failure scenarios, but doesn’t necessarily increase speed. It also requires windows of double capacity, which adds cost and may not be available during service outages.
A cutting-edge technique is to install an agent on every instance that will coordinate starting and stopping processes to do a rolling deploy. Big companies like Google and Twitter have proven this model of scheduling work across a cluster of generic instances. There are now a few open-source projects, like Docker Swarm and Apache Mesos, that “orchestrate” these fast deployments across a cluster.
Because of the speed, fixed capacity needs, success at Google, and mature open-source solutions that make it viable in any datacenter (on prem or in the cloud), container orchestration is a modern best practice.
Therefore EC2 Instances with container orchestration is the modern AWS best practice.
This explains why there is heated competition in the orchestration software space, and why AWS launched their fully-managed EC2 Container Service (ECS).
If we install Swarm or Mesos we’re responsible for operating the software. We need to keep an orchestration service up 100% of the time, so Instances can always check in and see if they need to start or stop processes. If we use ECS, we delegate that responsibility to Amazon.
Why ECS Tasks?
ECS Tasks are young technology, but container orchestration is a modern best practice of cloud computing. The properties of packaging our apps, including the operating system, into a standard Image format, then scheduling containers across a cluster of “dumb” instances is a huge efficiency improvement over older EC2 Instance strategies.
Containers are faster to build and faster to boot than Instances. A single Instance can run multiple container workloads, which offers less operating system overhead and less instances to maintain and pay for. Orchestration coordinates fast deploys and failure recovery alike.
The most important distinction about ECS is that it’s a managed service.
Amazon provides an “ECS Optimized AMI” that has the right server operating system, Docker server pre-configured. AWS is responsible for keeping the ECS APIs up and running so instances can always connect to it to ask for more work. AWS is responsible for writing the open-source ecs-agent that runs on every instance.
Because AWS built the entire stack, we can trust its quality, and trust that AWS will support it through tickets if things don’t work right.
It’s also important to understand that AWS considers Tasks a first-class primitive of the entire AWS platform.
Every individual Tasks -- every single command we run in the cluster-- has configuration options for:
- CPU and memory limits
- Security policies through an IAM role
- Logging to any external syslog, fluentd system or CloudWatch Logs group
- Registering into a load balancer
This year we saw two new services, Elastic File System (EFS) and Application Load Balancer (ALB) which are clearly designed around containerized workloads and fix major constraints with the Elastic Block Store (EBS) and Elastic Load Balancer (ELB) from the EC2 generation.
With all this, Tasks have more platform features out of the box than EC2 Instances ever did.
I expect continual platform improvements around Tasks, like improved auditing and billing over the next years. I also expect reduced effort for cluster management.
So with ECS Tasks, we’re responsible for providing an arbitrary Docker Image and Amazon is responsible for everything else to keep it running forever.
In Conclusion
EC2 Instances are too raw, requiring lots of customization to support an app. Lambda Functions are too constrained, disallowing traditional apps. ECS Tasks are just right, offering a simple way to package and run any application, while still relying on AWS to operate everything for us.
If you’re building a modern real-world system on AWS, an architecture based around ECS Tasks is the best choice.