Transcript
Pogosova: Imagine you have built this simple serverless architecture in the cloud. You pick the pieces you want to use. You connected them together, and everything just works like magic, because that's what serverless is. So much so that you decide to go to production with it. Of course, all of a sudden, hell breaks loose, and everything just starts to fail in all possible shapes and forms. There's nothing that you have done wrong, at least that's what you think, except for maybe not reading all the extensive documentation about all the services and all the libraries that you are using. Who does that nowadays anyway? What's the next logical step? Maybe you just swear off using the architecture or services in question, never and again, because they just don't work. Do they? What's wrong with the cloud anyway? Why doesn't it work? Maybe we just should go on-prem and build good old monoliths.
Purely statistically speaking, we might consider that the problem is not necessarily the cloud or the cloud provider, maybe not even the services or the architecture that you have built. What is it then? As Murphy's Law says, anything that can go wrong will go wrong. I personally prefer the more extended version of it that says, anything that can go wrong will go wrong, and at the worst possible time.
Background
My name is Anahit. I'm a Lead Cloud Software Engineer at a company called Solita. I'm also an AWS Data Hero. There was this funny thing that I noticed after becoming an AWS Hero, for some reason, people just started to come to me with this smirk on their faces, saying, so tell us now, what's wrong with the cloud? Why doesn't it just work? I'm finally here to answer that question. Maybe not exactly that question, but I want us to look together at something that we as humans usually don't feel too comfortable looking at, failures.
I hope that this talk helps you to become a little bit more aware and curious, maybe to spot patterns that others don't necessarily see. Also, to have the tools to ask questions and make conscious, critical, and informed decisions, rather than believing in magic, taking control into your own hands. Finally, become a little bit paranoid, but in a good way, because, to borrow words of Martin Kleppmann, "In distributed systems, suspicion, pessimism, and paranoia pay off".
A Failing Serverless Architecture
Before we go any deeper into talking about distributed systems and all the failures, let's briefly get back to our story with serverless architecture that was failing. It actually had a prequel to it. Once upon a time, you were a developer that started developing software that was supposed to run on a single machine, probably somewhere in an on-prem data center. Everything you need to care about at that point were your so-called functional requirements, so that your code works and it does exactly what it's supposed to, and it has as little bugs as possible. That was your definition of reliability. Of course, there could have been some occasional hardware failures, but you didn't really care much about them, because things were mostly deterministic. Everything either worked or it didn't.
Next thing you know, you find yourself from the cloud and maybe building software that you're supposed to run on virtual machines in the cloud. All of a sudden, you start developing software that requires you to think about the so-called non-functional requirements, so certain levels of availability, scalability. Also, reliability and resilience get a whole new meaning. You still have all your functional requirements. You still need to make sure that your code works and has as little bugs as possible. The complexity just went up a notch, and you need to worry about so much more now. Also, failures start to be somewhat more pronounced and somewhat less deterministic. Welcome to the dark side, a wonderful world of distributed systems where with great power comes great responsibility.
Things don't stop there, and before you know it, you jump over to the serverless world, where things just seem so easy again. You just pick the services. You connect them together. Pure magic. Of course, cloud providers take care of these most important ilities for you, things like reliability, availability, scalability, and you're once again back to caring about your own code and the fact that it works and has as little bugs as humanly possible. You are not looking for any machines around. Moreover, the word serverless suggests that there are no machines that you need to be looking around. Things are just easy and nice and wonderful. Though, we know that that's not exactly how the story goes, because anything that can go wrong will go wrong. What is it that can go wrong exactly?
Distributed Systems, Cloud, and Serverless
To set the stage, let's talk a little bit about serverless, cloud, distributed systems in really simplified terms. Distributed systems, just a bunch of machines connected with a network. While it provides a lot of new and exciting ways to build solutions and solve problems, it also comes with a lot of new and exciting ways for things to go wrong, because resources we are using are not limited to a single machine anymore. They are distributed across multiple servers, server racks, data centers, maybe even geolocations. Failure can happen in many different machines now, instead of just one. Those failures can, of course, happen on many different levels. It can be software failures or hardware failures, so things like operating system, hard drive, network adapters, anything can fail. All of those failures can happen completely independently of each other and in the most non-deterministic way possible.
The worst thing here is that all of those machines are talking to each other over a network. Network is known for one thing in particular, whenever there is any communication happening over the network, it will eventually fail. Any cloud is built on top of such distributed systems, that's where their superpowers come from. The cloud provider takes care of the most difficult part of managing the underlying distributed infrastructure, abstracting it away from us and giving us access to this really huge pool of shared resources that we can use, like compute, storage, network. They do that at a really massive scale that no individual user could ever achieve. Especially at the big scale, if something has a tiny little chance of happening, it most certainly will.
Serverless and fully managed services are just a step up in this abstraction ladder. They make the underlying infrastructure seem almost invisible, almost magical, so much so that we sometimes forget that it's there. By using those serverless services, we didn't just magically teleport to a different reality. We are still living in the very same messy physical world, still using the very same underlying infrastructure with all its complexity. Of course, this higher level of abstraction does make a lot of things easier, just like a higher-level programming language would. It also comes with certain danger.
Being seemingly simple to use, it might also give you this false sense of security, which might make spotting potential failures that much harder, because they are also abstracted away from you. The reality is, failures didn't really go anywhere. They are still there, embedded in the very same distributed system that you are using, waiting to show up. As Leslie Lamport said in 1987, "A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable". We could rephrase this for serverless architectures. A serverless architecture is one in which the failure of a computer you didn't even know or cared existed can render your entire architecture unusable. Of course, failures with serverless are a bit different than failures with distributed systems. You don't see them as blue screens or failures of the hardware, they manifest in a bit different way.
Distributed Architectures
Let's take one last step in this abstraction ladder. We started with distributed system, then we had the cloud, then we had serverless. Now we are building distributed applications on top of serverless and distributed systems. In essence, what we do is we just split the problem at hand into smaller pieces. We pick resources or services for each piece. We connect them together with things like messages or HTTP requests or events, and just like that, we build ourselves a distributed application. All of those things are using network in some shape or form. You might have noticed that, in essence, we are mirroring the underlying distributed architecture that we are having.
Of course, distributed applications also give you this great power of building applications in a completely different way, but just like the underlying distributed architecture, they come with certain complexity and certain tradeoffs. Architectures that you are going to build are likely going to be complex. Every piece can fail at any given moment in the most non-deterministic way possible. Whenever there is any communication happening over network, it will eventually fail. Our special case of this kind of distributed architectures are the so-called data architecture, data applications. With data applications, we deal with collecting, storing, processing large amounts of data. The data can be anything, log data, maybe you have website clickstreams, or IoT data. Whatever data you have, as long as the volumes are large.
On one hand, that large volume makes spotting potential issues somewhat easier, because if something has a tiny chance of happening at a bigger scale, it will. Also, with data applications, maybe failures are not as obvious as with client facing applications. If there was a failure while processing the incoming data, nobody is going to resend you that data. Once the data is gone, it's gone. We'll see an example of such a data architecture.
How do we make our distributed architectures more resilient in the face of all those impending failures? While we are mirroring the underlying distributed system, let's take a look at how cloud providers are dealing with all those failures. They do have quite some experience with that. Of course, there's a lot of complex algorithms and mechanisms at play. Surprisingly, two of the most effective tools for making distributed architectures or distributed systems more resilient are also surprisingly simple, or at least seemingly simple, they are timeouts and retries.
Those are the things that we absolutely need to be aware of when we are building our distributed applications. I call them superpowers, because just like superpowers, they can be extremely powerful, but we need to be very mindful and careful about how we use them, not to do more harm. You might have noticed that so far, I haven't mentioned any cloud providers, any services, nothing, because all of those things are pretty much universal to any of them. Now it's finally time to move on from our fictional story, and also time for me to confess that it probably wasn't as fictional as I wanted you to believe. In fact, it's something that happened to me to some degree. I was working with a customer where we're building this simple serverless, distributed architecture for near real-time data streaming at a pretty big scale.
On a quiet day, we would have over half a terabyte of data coming in. We wanted to collect and process and store that data. For that, we had our producer application that received the data. We connected it to a Kinesis Data Stream. On the other end, we connected an AWS Lambda to it, and just like that, we built ourselves a magical data processing pipeline. Things were just wonderful. We were very happy, until one day we realized we were actually losing data in many different places, and we had no idea it was happening. Thank you, higher level of abstraction. What exactly was going on there? Let's look at it step by step. There were several places where that was happening.
Kinesis Data Streams
First, what is Kinesis Data Streams? It's a fully managed, massively scalable service in AWS for streaming data. After you write the data to the stream, it appears in the stream within milliseconds, and it's stored in that stream for at least 24 hours, or up to a year if you configure it to be so. During that entire time, you can process and reprocess, read that data in any way that you want, as many times as you want, but you cannot delete the data from the stream. Once it gets to the stream, it stays there for at least 24 hours. Kinesis is an extremely powerful tool. It's mostly used for data applications, but also for event driven architectures. The power comes from the fact that you don't need to manage any servers or any clusters.
Also, it scales pretty much massively. To achieve that massive scalability, Kinesis uses the concept of a shard. In this particular context, shard just means an ordered queue within the stream, and stream being composed of multiple such queues. Each shard will come with capacity limitation of how much data you can write to it, so 1 megabyte or 1000 records of incoming data per second. The amount of shards you can have in a stream are pretty much unlimited. You can have as many shards as you want to stream as much data as you need. When you write the data to the stream, it will end up in one of the shards in your stream.
Speaking about writing the data, there's actually two ways in which you can write the data to the stream. In essence, you have to choose between two API calls. You either write individual records or you can batch up to 500 records in a single API call. Batching is usually a more effective and less resource intensive way to make API calls, especially in data intensive applications where the amount of individual calls can get really high, really quickly. Once again, when something sounds too good, there's usually some things we need to consider, and we'll get back to that.
How Failures Manifest
We have established by now that the failures will happen. There's no way around that. How do those failures manifest with this higher level of abstraction, with serverless services, with services like Kinesis, for example. It's actually pretty straightforward, because when we interact with services, let's say, from our code, we are using API calls, and naturally, any of those API calls can fail. The good news is that if you are using AWS SDK to make those API calls from your code, it handles most of those failures for you. After all, AWS does know that the failures will happen, so they have built into the SDK one of those essential tools for better resilience, or a superpower as we know it, the retries. The trouble with the retries is that now we have a potential of turning a small intermittent problem, let's say a network glitch, into a really massive one, because retries can have really unexpected blast radius.
They can spread this ripple effect of cascading failures through your entire system and ultimately bring the entire system down, because retries are ultimately selfish, just like when you're hitting the refresh button in a browser. We all know we shouldn't do it, but we do it anyway. Retrying implies that our request is more important, more valuable than anybody else's. We are ready to spend more resources, we are ready to add load, to add potential cost on the downstream system, just to make sure that our request goes through no matter what.
The reality is that retries are not always effective, neither are they safe. First and foremost, which failures do we even retry? Let's say the failure is caused by the downstream system, such as database or API being under a really heavy load. Then if you are retrying, you are probably making the matter worse. Or let's say the request failed because it took too much time and it timed out. Then retrying will take more time than you are actually prepared to wait. Let's say you have your own SLA requirements. Then that basically means that retrying is just selfishly using the resources that you don't really need. It's just like pressing the refresh button and just closing the browser. What if the underlying system also have their own retries implemented? Let's say they also have a distributed application. They have different components. Each of them has retries on a certain level.
In that case, our retries will just be multiplied, and they will just amplify all the potential bad things that can happen. This is the place where this ripple effect of cascading failures can happen really easily, especially if we start retrying without giving the underlying system a chance to recover. Let's say, what if the operation that you're actually retrying has side effects? Let's say you're updating a database. Then retries can actually have unexpected results. The bottom line is, you need to be extremely careful about how we use this superpower. We don't want to bring down the system that we are trying to fix.
Luckily, in case of AWS SDK, retries already come with these built-in safety measures. If a request to a service such as Kinesis fails, AWS SDK will only handle the so-called retryable errors, so things like service unavailable, other 500 errors, or timeouts. For those retryable errors, it will retry them on your behalf, behind the scenes, but it will stop after a certain amount of attempts. Between those retry attempts, it will use the so-called exponential backoff, which means that delays between retry attempts will be increasing exponentially. These things might seem very simple, but they are actually crucial details that can either make it or break it. They can turn retries from being a very powerful tool for better resilience into the main cause of a system outage, because we only want to be retrying if it actually helps the situation, so only retryable failures.
When we do retry, we do want to stop if it doesn't help the situation anymore, to avoid that ripple effect of cascading failures as much as possible. Also, we want to spread the retry attempt as uniformly as possible, instead of just sending this burst of retries to a system that is already under a heavy load, to give the system a chance to recover. With AWS SDK, you are given these safety measures, but you are also given the possibility to configure some of those retry parameters. Here's an example how you would do this with JavaScript SDK. Every language will have their own ways to configure them and their own default values, but all of them will give you a possibility to configure some of those values.
The same way, they will give you the possibility to configure the second superpower that we have, the timeout related values. If timeouts don't sound like too much of a superpower to you, I have news for you. Timeouts are pretty much a given in distributed systems, and we absolutely need to be aware of them.
Once again, in simplified terms, let's talk about timeouts. When we are interacting with services, no matter, are they serverless or not, we are using API calls. Usually, those API calls are abstracted away as SDK method calls, and those look exactly the same as any local method invocation. Let's not let that fool us, because we know already, network is still there, it's just abstracted away from us. Any request sent over a network like an API call to Kinesis, for example, can fail in many different stages.
Moreover, it's almost impossible to tell if the request actually failed or it didn't, because that failure can happen on many levels. Maybe sending the request actually failed, or maybe the processing of the request failed, or maybe your request is waiting in the queue because the downstream system is overwhelmed. Or, maybe the request was processed, but you just never got the response back, and you just don't know about. There are plenty of options, but end result is always the same. You are stuck waiting for something that might never happen. This can happen to any service. It can happen to Kinesis as well. Not to wait forever, AWS has built into the SDK this other tool for better resilience, this other superpower, timeouts. The ability to configure those timeouts for the API calls is our superpower that we can use.
Again, just like with retries, we need to be extremely careful how we use it, because picking the right timeout value is not an easy task at all. Just like any decision in your architecture, it will come with certain tradeoffs. Let's say you pick a too long timeout, then it's likely going to be ineffective, and it can consume resources, increase latencies. Too short timeout might mean that you will be retrying too early and not giving the original request a chance to finish, and this will inevitably mean that you will add load to the underlying system. This, as we know, will cause these cascading failures, and eventually can bring the entire system down.
On top of all that, the appropriate timeout value will be different depending on the service you are using or operation that you are performing. For the longest time, I've been scaring people saying that for all the services and all the requests, AWS SDK, JavaScript SDK has this default timeout that it uses, and it's two minutes. Just to think about it, we are probably dealing with some low latency systems with services like Kinesis, in this case, or DynamoDB. Usually, we are really aware of every millisecond that we spent. Here we are just stuck for two minutes waiting for SDK to decide that, yes, there was a timeout.
Since then, things have changed, things have evolved. JavaScript SDK saw a change of version from 2 to 3, and also default timeout values have changed, so now the default timeout value is infinite. Basically, this means that if you don't configure those timeouts, you are just stuck there. Who knows what happened? Bottom line here is that, very powerful but very dangerous.
Finally, we get to the first reason for losing data in our story, so having too long timeouts can actually exhaust resources of your producer application and not make it possible for it to consume new incoming requests. That's exactly what we saw in our architecture. It was still in the two-minute timeout era. Still, even then, without configuring timeouts, we got our producer to be stuck waiting for something that will never happen because of likely a small, intermittent problem. We got this entire system outage instead of masking and recovering from individual failures, because that's what a resilient system should do. Instead, we just gave up and started losing data.
This obviously doesn't sound too great, but if you think the solution is to have short timeouts, I have bad news for you as well, because in my experience, too short timeouts can be even more dangerous, especially if they are combined with retries. Because again, retrying requests too early, not giving the original request a chance to complete means that you are adding the load on the underlying system and pushing it over the edge. You start to see all of the fun things. It's not as obvious when you see it. You just start to see failures. You start to see latencies. You start to see cost going up. We will get this ripple effect of cascading failures in the end, and again, ultimately bring the entire system down.
Again, if our goal is to build a resilient system, we should mask and recover from individual failures. We should make sure that our system works as a whole, even if there were some individual failures. Here is where wrongly configured timeouts and retries can really become a match made in hell. Once again, even though retries and timeouts can be extremely powerful, we need to be very mindful how we use it. We should never ever go with defaults. Defaults are really dangerous. Next time you see your own code or your own architecture, please go check all the libraries and all the places that you make any calls over the network, let's say SDK, or maybe you have external APIs that you are calling.
Please make sure that you know what the default timeouts are, because they usually are default timeouts. Make sure that you are comfortable with that value. It's not too small, it's not too big. That you have control over that value. Especially if those timeouts are combined with the retries.
So far, I've been talking about these failures that are inevitable in distributed systems, but there's actually one more type of failure, and those failures are caused by the cloud providers on purpose. Those are failures related to service limits and throttling. This can be extremely confusing, especially in the serverless world, because we are promised scalability. Somehow, we are very easily assuming infinite scalability. Of course, if something sounds too good to be true, there's probably a catch. Sooner or later, we better face the reality.
Reality is, of course, the cloud is not infinite, and moreover, we are sharing the underlying resources with everybody else. We don't have the entire cloud at our disposal. Sharing all those resources comes with tradeoffs. Of course, on one hand, we do have this huge amount of resources that we can share that we wouldn't have otherwise, but on the other hand, it also allows individual users, on purpose or by accident, to monopolize certain resources. While resources are not infinite, it will inevitably cause degradation of service to all the other users. Service limits are there to ensure that that doesn't happen, and throttling is just a tool to enforce those service limits. For example, in case of Kinesis, we had the shard level limit of how much data we can write to a shard, and once we reach that limit, all our requests will be throttled. They will fail.
We get to the next reason of losing data in our story. I said at some point that if you're using AWS SDK, you're lucky, but it handles most of the failures for you. The catch here is that in case of batch operations, like we have putRecords here, instead of just handling the failure of an entire request, we should also handle the so-called partial failures. The thing here is that those batch operations, they are not atomic. They are not either all record succeeded or all record failed. It might happen so that part of your batch goes through successfully while the other part fails, and you still get a success response back. It's your responsibility to detect those partial failures and to handle them.
Moreover, every single record in a batch can fail, every single one of them, and you will still get a success response back. It's very important to handle those partial failures. The main reason for the partial failures in this case is actually throttling, or occasionally it's exceeding service limits, so having spikes in traffic. Luckily, we already know that there is this fantastic tool that can help us when we are dealing with transient errors, and in case of occasional spikes in traffic, we are dealing with something temporary that will probably pass. This wonderful tool is, of course, retries. When implementing the retries, there are three key things that we need to keep in mind. Let's go through them one more time. We only want to retry the retryable failures. We want to set upper limits for our retries, not to retry forever, and stop retrying when it doesn't help. We want to spread the retry attempts as uniformly as possible, and have a proper backoff.
For that, an exponential backoff and jitter is an extremely powerful tool. I actually lied a little bit. SDK uses exponential backoff and jitter. Jitter basically just means adding some randomization to your delay. Unsurprisingly, or maybe not so surprisingly, by just this small little change to how you handle the delays or the backoffs between your retry attempt, you can actually dramatically improve your chances of getting your request through. It actually reduces the amount of retry attempts that you need and increases the chance of the overall request to succeed pretty drastically. Very small and simple tool, but also extremely powerful.
I always say that if you remember anything from my talks, let it be partial failures of batch operations, timeouts, and retries with exponential backoff and jitter. Because those things will save you tons of headache in many situations when you are dealing with distributed applications. They are not related to any particular service. They are just things that you absolutely need to be aware of. To borrow words of Gregor Hohpe, "Retries have brought more distributed systems down than all the other causes together". Of course, this doesn't mean that we shouldn't retry, but we know by now we need to be very mindful and very careful. We don't want to kill the system that we are trying to fix.
Lambda
Speaking of which, there were even more failures coming in our story. Let's see now what else can happen if we let the error handling slide and not really do anything, and just go with the defaults. This time, we are speaking about the other end of our architecture, where we had a lambda function reading from our Kinesis stream. Lambda itself is actually a prime representative of distributed applications, because it's composed of many different components that work together behind the scenes to make it one powerful service. One of those components that I personally love and adore is the event source mapping. It's probably unfamiliar to you, even if you are using lambda, because it's very well abstracted underneath the lambda abstraction layer.
It's a very important component when you're dealing with different event sources, in this case, Kinesis, for example. Because when you're reading data with your lambda from a Kinesis Data Stream, you're, in fact, attaching an event source mapping to that stream, and it pulls records from the stream, and it batches them, and it invokes your lambda function code for you. It will pick those records from all the shards in the stream in parallel. You can have up to 10 lambdas reading from each shard in your stream.
That's, once again, something that the event source mapping provides you, and it's just a feature called parallelization factor. You can set it to have up to 10 lambdas reading from each shard instead of just one. Here is where we see the true power of concurrent processing kicking in, because now we can actually parallelize that record processing, we can speed things up if we need to. We have 10 lambdas reading from each shard instead of just one lambda.
Of course, there's always a catch behind every fantastic thing that you hear, and in this case, it's extremely easy to hit one of service limits. Every service has a limit, even lambda. One of those limits are very important. I always bring it up. It's the lambda concurrency limit. This one basically just means that you can have a limited number of concurrent lambda invocations in the same account in the same region. The number of lambda instances running in your account, in your region, at the same time, is always limited to a number. Usually that number is 1000, it's a soft limit. Nowadays, I've heard that you only get 100 lambdas. I haven't seen that. I've just heard rumors. It's a soft limit. You can increase it by creating a ticket to the support. There still is going to be a limit.
Once you reach that limit, all the new lambda invocations in that account in that region, will be throttled. They will fail. Let's say you have your Kinesis stream with 100 shards, and let's say you set parallelization factor to 10, because you want to speed things up, because why not? Now all of a sudden you have 1000 lambdas, so 100 times 10, reading from your stream, and things are probably going to be ok. Then there is another very important lambda somewhere in your account, somewhere in your region, that does something completely irrelevant from your stream, but that lambda starts to fail.
The reason is, you have consumed the entire lambda concurrency limit with your stream consumer. This is the limit that can have a really big blast radius, and it can spread this familiar ripple effect of cascading failures well outside of your own architecture. You can actually have failures in systems that have nothing to do with your architecture. That's why I always bring it up, and that's why it's a very important limit to be aware of and to monitor all the time.
Now let's get back to actually reading data from our stream. What happens if there's a failure? If there is a failure, again, there's a good news-bad news situation. The good news is that the event source mapping that I've been talking about actually comes with a lot of extensive error handling capabilities. To use them, you actually need to know that they are there. If you don't know that the entire event source mapping exists, the chances are high you don't know. The bad news is that if you don't know, you are likely just to go with the defaults. We should know by now that defaults can be really dangerous. What happens by default if there is a failure in processing a batch of records? Let's say there was a bad record with some corrupt data, and your lambda couldn't process it. You didn't implement proper error handling, because nothing bad can ever happen. Your entire lambda now fails as a result of it. What happens next?
By default, even though no amount of retries can help in this situation, we just have some bad data there. By default, lambda will be retrying that one batch of records over and again until it either succeeds, which it never will, or until the data in the batch expires. In case of Kinesis, data stays in the stream for at least 24 hours. This effectively means an entire day of useless lambda invocations that will be retrying that batch of records. They don't come for free. You're going to pay for them. All those useless lambda invocations have a lot of fun side effects, and one of them is that you are likely reprocessing the same data over and again. Because you see from the perspective of the event source mapping, either your entire batch succeeded or your entire batch failed.
Whenever lambda encounters a record that fails, it fails the entire batch. In this example, record 1, 2, and 3 went successfully, record 4 failed, but your entire batch will be retried. Records 1, 2, and 3 will be reprocessed over and again. Here we come to idempotency, which Leo was telling us about, extremely important. Bad things don't really stop there, because while all this madness, at least it looks like madness to us, is happening, no other records are being picked from that shard. Other shards in your stream, they go on with their lives, and data is being processed, and everything is good, but that one shard is stuck. It's waiting for that one bad batch of record to be processed. That's why it's often referred to as a poison pill record, because there was just one bad record, but a lot of bad things happen.
One of those bad things, especially in data applications, data loses its value pretty quickly, so the chances are really high that 24 hour long or old data is pretty much useless to you. We lose data on many different levels when we don't process it right away. Speaking of losing data, so let's say finally, in 24 hours, the data finally expires. Your entire batch finally leaves the stream. Lambda can stop retrying. Of course, part of that batch wasn't processed, but ok, that's life. What will you do? At least your lambda can catch up and start picking new records from that shard. Bad things happen. It's ok.
The problem here is that your shard is potentially filled with records that were written around the same time as the already expired one, so this means that it will also expire around the same time as the already expired ones. This might lead to a situation where your lambda will not even have a chance to process all those records. The records will just keep expiring and being deleted from the shard without you having a chance to process them. I bring up this overflowing sink analogy because you just don't have enough time to drain that sink, water just keeps flowing. We started with just one bad record, and we ended up losing a lot of valid and valuable data. Again, something opposite from what a resilient system should be like.
Yes, this is exactly what we were seeing in our story. Just because of some random bad records, we would see that we would lose a lot of data. We would have a lot of reprocessing. We will have a lot of duplicates. We will have a lot of delays. We would consume a lot of resources, pay a lot of money for something that will never happen. All of those things happened because we just didn't know better and just went with the good old defaults when it comes to handling failures. We know by now that that's exactly the opposite from what we should do. I need to bring up the quote by Gregor because it's just so great.
Luckily, there are many easy ways in which we can be more mindful and smarter about retries when it comes to lambda, because, as I said, event source mapping comes with extensive set of error handling capabilities. We know by now that probably the most important things that we should set are timeout values and limits for the retries. We can do both of them with event source mapping, but both of them are set to minus one by default, so no limits. That's exactly what we were seeing. It's not like AWS wants to be bad and evil to us. Actually, it makes sense, because Kinesis promises us order of records. I said that a shard is an ordered queue, so records that come in the shard, they should be ordered, which means that lambda needs to process them in order. If there is a failure, it can't just jump over and maybe process the failures in the background, because we will mess up the entire order.
This is done with the best intentions, but the end result is still not pretty. We need to configure those failures. There are also a lot of other very useful things that you can do to improve error handling with event source mapping. We can use all of those options in any combination that we want. Just most important things, please, do not go with the defaults. If you are interested at all, I have written these two huge user manuals, I call them blog posts, about Kinesis and lambda, and how they work together and separately. There's a lot of details about lambda and event source mapping there as well. Please go ahead and read them if you feel so inclined.
Epilogue
We have seen that we can actually cause more trouble while trying to fix problems. This is especially true if we don't make conscious, critical informed decisions about error handling. Things like retries and timeouts can be very useful and very powerful, but we need to make conscious decisions about them. We need to take control, rather than just letting the matter slide. Because if we let the matter slide, things can backfire, and instead of making our architecture more resilient, we can actually achieve the exact opposite.
Next time you are building a distributed architecture, I encourage you to be brave, to face the messy reality of the real world, to take control into your own hands, rather than believing in magic. Because there's no magic, and that's a good thing. It means that we have control. Let's use that control. Distributed systems and architectures can be extremely powerful, but they are also complex, which doesn't make them neither inherently good nor bad. The cloud, and serverless especially, abstracts away a lot of that complexity from us. That doesn't mean that complexity doesn't exist anymore.
Again, not inherently good nor bad. We really don't need to know, or we can't even know every single detail about every single service that we are using. It's borderline impossible. There are these fundamental things that are inherent to distributed systems and the cloud in general, so things like service limits, timeouts, partial failures, retries, backoffs. All of those things are really fundamental if we are building distributed applications. We absolutely need to understand them. Otherwise, we are just moving in the dark with our eyes closed and hoping that everything will be fine.
Finally, on a more philosophical note, distributed systems and architectures are hard, but they can also teach us a very useful lesson to embrace the chaos of the real world. Because every failure is an opportunity to make our architectures better, more resilient. While it's borderline impossible to build something that never fails, but there's one thing that we can do: we can learn and grow from each individual failure. As Dr. Werner Vogels likes to say, "Everything fails, all the time". That's just the reality of things. Either in life in general or with AWS services in particular, the best thing that we can actually do is be prepared and stay calm when those failures happen, because they will.
Questions and Answers
Participant 1: How do you set those limits for throttling, for timeout? Let's say that you know that you will have a certain lot and you want to do certain performance tests or validate your hypothesis, do you have a tooling or framework for that? Also, how do you manage the unexpected spikes on those systems? Let's say that you have a system that you are expecting to have 25k records per second, and suddenly you have triple this, because this happened in the cloud. How do you manage this scenario on these systems?
Pogosova: How do you set the timeouts and retries? It's not easy. I'm not saying it's easy. They do sound very simple, but you need to do a lot of load testing. You need to see how your system responds under heavy load to actually know what the appropriate numbers for you are going to be. For timeouts, for example, one practice is to try to estimate what the p99 latency of the downstream system is that you are calling, and then, based on that, try to maybe add some padding to that p99 number and then set it as your timeout.
Then you probably are safe. It's not easy. I have seen this problem just a couple of weeks back. I know exactly what's going to happen, but then things just happen that you don't predict. There are libraries that have defaults that you haven't accounted for, and then all of a sudden, you just hit a brick wall, because latencies increase on the downstream system, and then you start retrying, and then things just escalate and get worse. Then basically the entire system shuts down. It's not easy. You need to do load tests. We don't have any specific tooling, per se. We have certain scripts that would just send a heavy load to our systems, and that's how we usually try to figure what the appropriate numbers are. It's always a tradeoff.
When it comes to spikes in traffic, again, really complex issue. Usually, using serverless actually is an easier way around it, like with lambda, being able to scale, for example, pretty instantaneously. It's very helpful. In case of Kinesis, if we speak about specific services, actually, the best thing you can do is just overprovision. That's the harsh reality of it. If you think that you are going to have certain spikes, and you want to be absolutely sure, especially if those spikes are unpredictable, and you want to be absolutely sure that you don't lose data. You don't really have many options. In case of Kinesis, there is this on-demand Kinesis option where you don't manage the shards, and you let AWS manage the shards, and you pay a bit differently, much more. What it does in the background, it actually overprovisions your stream, you just don't see it. That's the truth of it at the moment.
Participant 2: I was wondering whether it could be helpful to not fail the lambda in case of a batch with data that cannot be written, and instead use metrics and logs to just track that and then potentially retry on your own, separately. That way, basically do not come in this stuck situation. What do you think about that?
Pogosova: First of all, of course, you need to have proper observability and monitoring. Especially in distributed applications and serverless, it becomes extremely important to know what's happening in your system. Definitely, you need to have logging and stuff. Also, there are certain metrics nowadays that will tell you, in some of those scenarios, that there is something going on, but you need to know that they exist before, because there's tons of metrics. You usually don't know what to look at. When it comes to retries in lambda, as I said, there is a lot of options that you can use, and one of them is sending the failed request to so-called failure destinations, so basically to a queue or a topic, and then you can come back and reprocess those requests.
See more presentations with transcripts