Amazon has released native support for automated rollbacks within their Amazon ECS service. This feature leverages Amazon CloudWatch metric alarms to monitor and, if necessary, reverts the in-progress deployment. This feature supports using any system metrics that CloudWatch Container Insights collects for Amazon ECS as well as custom metrics.
When creating or updating services within Amazon ECS, one or more CloudWatch metric alarms can be configured to determine if the deployment was successful. During a rolling update, Amazon ECS starts monitoring the list of configured CloudWatch alarms as soon as any task is running on the updated service. The new tasks make up the primary deployment, whereas the previous tasks are known as the active deployment.
With no alarms configured, the rolling update is complete when the primary deployment is healthy and at the desired count with the active deployment count at zero. With alarms configured, the deployment will continue for an additional duration known as the bake time. During this time, the primary deployment remains within the IN_PROGRESS
state. The bake time duration is calculated automatically based on a combination of the CloudWatch alarm properties. At the end of the bake time period, if no alarms have been triggered and the services remain in the OK state, the deployment will be considered a success.
If at least one alarm is triggered, then automated rollback will start. A notification is sent out via the event bus and the current deployment's state is set to FAILED
. The active deployment is promoted back to being the primary deployment and is scaled back up to the desired count. The failed deployment is scaled down and then deleted.
Alarms can be created via the AWS Console, AWS CloudFormation, or the CLI. In the following example, create-service is used to create a Linux service with deployment alarms:
aws ecs create-service \
--service-name MyService \
--deployment-controller type=ECS \
--desired-count 2 \
--deployment-configuration "alarms={alarmNames=[alarm1Name,alarm2Name],enable=true,rollback=true}" \
--task-definition sample-fargate:1 \
--launch-type FARGATE \
--platform-os LINUX \
--platform-version 1.4.0 \
--network-configuration "awsvpcConfiguration={subnets=[subnet-12344321],securityGroups=[sg-12344321],assignPublicIp=ENABLED}"
Note that the deploymentConfiguration
request parameter now has the alarms data type. Here the alarm names, whether the alarm is enabled, and whether to initiate a rollback based on the alarm can all be specified.
All system metrics that CloudWatch Container Insights collects for Amazon ECS can be used as an alarm. Custom metrics can also be used via one of two methods. If the Prometheus client library is in use, Container Insights Prometheus metrics monitoring can be used to automate the discovery of Prometheus metrics. These will then be ingested as custom metrics into Amazon CloudWatch. If the OpenTelemetry SDK is being used, AWS Distro for OpenTelemetry can be leveraged to export application metrics into Amazon CloudWatch.
AWS has provided some recommendations for which alarm metrics should be used based on the service in use. For Application Load Balancers, AWS recommends using HTTPCode_ELB_5XX_Count
and HTTPCode_ELB_4XX_Count
metrics to check for HTTP error code spikes. For existing applications, CPUUtilization
and MemoryUtilization
metrics could be used to monitor the consumption of CPU and memory.
Amazon CloudWatch alarms are only supported for Amazon ECS services leveraging the rolling update deployment controller. The feature is in general availability and questions or feedback can be submitted to the container public roadmap on GitHub.