AWS recently announced the availability of enhanced error handling capabilities in AWS Step Functions workflows allowing developers to identify errors more clearly and provide them with fine-grained control over their retry strategies.
AWS Step Functions is a serverless workflow service that allows developers to automate multiple AWS services into serverless workflows using visual workflows and state machines. The service now includes enhanced error handling that allows developers to construct detailed error messages in Fail states, including dynamic information about the error cause at runtime. In addition, they can set a maximum limit on retry intervals, providing greater control over their retry strategies so that exponential retries do not exceed desired timeframes.
Furthermore, the addition of jitter allows developers to introduce randomization to their retry intervals, which helps prevent overloading services called from a state machine during recovery situations.
Marc Brooker, a VP/distinguished engineer at AWS, explains in an Amazon's Builders' Library article on timeouts, retries, and backoff with jitter:
When failures are caused by overload or contention, backing off often doesn't help as much as it seems like it should. This is because of correlation. If all the failed calls back off to the same time, they cause contention or overload again when they are retried. Our solution is jitter. Jitter adds some amount of randomness to the backoff to spread the retries around in time.
The documentation on error handling for Step Functions mentions that Task, Parallel, and Map states can have a field named Retry, whose value must be an array of objects known as retriers. An individual retrier represents a certain number of retries, usually at increasing time intervals. One of the retrier optional fields is JitterStrategy, which is a string that determines whether to include jitter in the wait times between consecutive retry attempts. Another optional field is the MaxDelaySeconds, a positive integer field that sets the maximum value, in seconds, up to which a retry interval can increase.
An example of retry can look like this:
"Retry": [ {
"ErrorEquals": [ "States.Timeout" ],
"IntervalSeconds": 3,
"MaxAttempts": 3,
"BackoffRate":2,
"MaxDelaySeconds": 5,
"JitterStrategy": "FULL"
} ]
The addition of enhanced error handling in Step Functions resonated in the community of AWS builders. Yan Cui, an AWS serverless hero, tweeted:
My favorite is probably the ability to add jitters to the retry delays, which is a common strategy in large systems to avoid retry storms.
In addition, Luc van Donkersgoed, a lead engineer at PostNL and AWS Serverless Hero, tweeted:
Oooohhh these are some great additions to AWS Step Functions:
- Custom error messages in fail states
- JitterStrategy for retries (!)
- MaxDelaySeconds for retries
And Rehan Haider, a cloud & infrastructure solution architect at DXC Technology, commented in a tweet:
Seems to be a step forward in workflow efficiency.
Developers can start with enhanced error handling in either the AWS console, AWS CloudFormation, the AWS Command Line Interface (CLI), or the AWS Cloud Development Kit (CDK). They can find more details in the developer guide.
Lastly, the pricing details of Step Functions workflows are available on the pricing page.