Transcript
Weiss: My name is Tal Weiss. Today, I'll be talking with you for a bit about advanced techniques for serverless debugging. I'm the founder at OverOps, where we help our customers deliver continuously reliable code across the entire software delivery lifecycle, all the way from testing into production. Before that, I was a co-founder at VisualTao. Before then a software architect at IAI Aerospace, where I worked on real-time satellite tracking control arrays. For the past 20 years or so, I've been focusing on C++, Java, and .NET, real-time, high scale, high complexity applications and environments. I blog a lot. I blog at overops.com, where you can learn more about the information I'll be sharing with you guys today.
Serverless Debugging
Serverless debugging. We know that serverless architectures open up a whole new world of opportunity for us. We're not going to expound too much on that. With that ease of operation, of scale, comes a loss of control. That control is needed badly, especially when things go bad. A bug in the context where I'm going to be talking about today, means, whenever our code or third-party code is processing unexpected state. Meaning, it's having to operate on information that's been provided to it or that it's reading from an external source that it was not designed to do, or it has to do it at an unexpected time: too late, too early. That can cause things like race conditions, deadlocks, forms issues. All these will manifest, these bugs, as what I fondly call the four horsemen of serverless debugging.
The Four Horsemen of Serverless Debugging
The first one is a function is going to terminate with an unhandled exception or error. The second one which is more subtle is our function is going to encounter internal logical error, meaning it's unable to proceed correctly with the information or data state that it has, so it's going to exit or complete unsuccessfully, which will really open the door for number three, which is a future error. Meaning, our function may not have missed or encountered either a logical error, or terminated with an exception, but its inability to process state correctly, either advertently or inadvertently corrupted the state of the system in a way which will create a future error for another function down the line, which we will have to backtrace to understand what was the original sin. That's a complex one. Number four is whenever a function does not meet its performance SLA, which will be a slowdown or completely times out, in which case it will be actually terminated by the environment itself. Again, could be as a result of its own internal failure or information that was corrupted by previous functions execution.
Debugging in a Serverless World
Traditionally, the tools that we would use as developers, as architects, to really troubleshoot these kinds of issues were things such as profilers, stack dumps, core memory dumps, if needed, agents, just APM agents, or even debuggers. I'm using Java example here such as the TI interface or Java Wire Debugging Protocol. These are true regardless of whether you're using Java, .NET, Python, Go, or any other language or platform of your choice. Tools are available to us. Again, this tool may be activated in a serverless environment. It is possible. It will require a lot more effort, and there are a lot less available. The tools that are more readily available to us are more manual ways of essentially capturing state from our application through either logging, metrics, some proprietary tools that may be built into the environment, it could involve AWS Lambda, X-Ray is a good example, or third party or commercial tracing frameworks.
Guardian Angel
What is it that we want to have at our disposal whenever critical issues take place? There's five things we want to have available to us. First of all, we want to know which other functions executed potentially before us. What was their internal state? How long did it take them to execute? Because we may be provided with information that even the point of entries into our function is already corrupt. We need to understand, who sent that to us and why? We want to have that chain. Then we also want to have essentially all the inputs provided to us. Was there any previous functions? Any debug variables of statements captured during execution that will help us understand what exactly is, that was going on, execution times. Also, a complete breakdown of all the stack traces and names or the threads that were executing with our function whenever we had a critical error, or more relevant, in that example, whenever we have a timeout or a bug.
What I want to show today is a very effective framework, we like to call this the Guardian Angel framework that is very applicable to the world of serverless computing that you can apply to any language or platform of your choice that will really make it easy for you, and more effective. Will make you more effective in your ability to capture deep state from within your serverless function when it fails, to be able to really tack on, to understand the cause of each one of those four horsemen, if you want.
Example
We're going to jump into an example that I had prepared for you guys in advance. We have here a Lambda function that is pretty trivial. Meaning, we get a request to execute as an AWS Lambda function. This could be Java, this could be Go, this could be AWS, this could be Azure, choosing a standard example, if you will, that's going to operate on this request. We're going to parse some inputs from it, pretty simple stuff. Then if those inputs are invalid, we can't move forward, we're going to log this as a failure, which is, on one hand, easy. On the other hand, we have no idea as to why we receive bad inputs. That can potentially be an issue. That's one of the things we're going to show you how to tackle. Then we start processing data. Usually, what a function will do, it will read information from an external source, queue, a data stream, database, operate on it, and return or store those results for future consumption. In this case, we're just operating on our output and reading this information. Here, all we're doing is just randomizing numbers, versus where, as I said, usually a database or a queue. Until we hit a certain result.
This is tricky. A lot of people can sometimes forget, random can be zero, which means this can actually terminate by the division by zero, it terminates the entire function. Of course, this is pretty trivial. In real life scenario, we could be reading an invalid value, an empty string, some bad, negative, whatever value from the database, we aren't prepared for it and crash. This can lead to that second horseman of the apocalypse, which is the crash. That also leads us to number three, and number four. Number three, I've already alluded to, which is what happens if the input that is negative here, like a minus one was produced by a previous functions execution. Or also, what happens if it takes us a very long time to produce the result that we require. Essentially, means that we're going to be either exceeding our desired time to execute, which could be, whatever it is that separates our users to our application, or even worse, we're going to time out indefinitely to the point that our AWS Lambda or Azure environment would terminate us. I think AWS has a 15 minute max, but probably defaults to something like 5 minutes. If you're deadlocked, if you're waiting for something that's not yet ready, from a timing standpoint, you're going to time out.
At that point, either one of those can potentially be very hard for us to debug to understand what exactly was going on, especially when talking about a potential activation of thousands and millions of functions, and we've done this, and why did this one fail? What was the state at that point within that distributed process that brought about this type of failure? As you see here, we don't have a lot going for us here other than a debug statement. That also assumes that we can actually execute debug statements in production, which a lot of time is not the case. This may not even be available to us. There's not a lot going on here for us. This is what I want to show today, which is that Guardian Angel framework.
First things first. This function that we have here, I've derived from another base class, a base Lambda function that I have. This function is actually the one that gets executed by AWS and makes a call to the virtual network override. Here, things start becoming a bit more interesting. The first thing that we do, we catch all Throwables to make sure we log them as a failure. That no matter what, if you're terminating with an exception, we have the ability to capture state. The same is true with what we saw here with bad inputs. We're going to dive in and see what it means to log it. What information are we actually going to be able to catch, more so than, something bad happened. We have that. Here's something pretty interesting that happens, we see this instantiation of a parallel, what's called a Guardian Thread. We instantiate. We let it run in the background. Only then do we actually go about doing the work that we need to do from a business logic standpoint.
Let's look at the thread and see what is that it does. This is the execution loop. What we see here is this. We're going to run in the background as long as we can. We're going to check for the following conditions, are we about to completely time out? Is the time allotted for the execution of this function, less than? In this case, I take a factor of 5 seconds, which means if we're 5 seconds away from being terminated by our serverless operating system, we are flying too close to something. Meaning, we may complete in time, but there's a very high likelihood, and I'll bet my money on it every day, that assuming this function doesn't finish in 58 seconds out of 60 minutes, every time, if you're able to time in that world, we're about to time out. If that happens, or the time that we've allotted ourselves to execute, and this is more of a business thing, like how much time are we willing to spend here, 10 seconds, 10 minutes? A lot of times, developers, we don't have an answer to that. It's more like, is it fast enough? My advice to our customers dealing with those kinds of situations is to set just a default time, a logic time, which is lower than what AWS or Azure will terminate you for, of a minute. Then see over time, does this function only need 5 seconds, which with time you can actually reduce this. Or, it needs a bit more time, in which time you can go from 1 minute to 2 or 3 minutes, and you can read that from your configurations file. Have some form of definition, so you can know how you're doing against this baseline.
If we're going to time out because either one of those conditions, we're going to break out of this loop, and we're going to log state. Another thing that we're going to do is there's another condition here that might help us log state, something to do with the sampling factor. What we've done here is we've seen that in the case of a potential timeout, we are going to log some state. Let's see, what is it that we're actually capturing? We're going to do this so far if we're timing out subjectively or objectively, meaning we're about to be terminated. We're going to do this if we declare a logical failure. We're going to do this also if we essentially are crashing with an exception. Let's see what state can we capture more effectively?
You've seen me log some information previously here. This information might not be available to us. One of the biggest problems with debugging in production in general is that we want to run a lot of logging at a verbosity level, but from just a pure ops standpoint, we really can't because we're just going to overwhelm our logging system. This is actually a pretty powerful mechanism that we have employed. It looks from the outset, just like any standard Java logger, and you can do this with any logging framework. What we've done here, we've actually implemented our own debug logging appender, which every time an event is logged, if that event is a debug event, we're going to store it in-memory. We're not going to emit that to this straight away. If it's an info, or a WAR, or an error, a fail, of course, go to disk. We're going to capture this in-memory. This can be a limited cyclical buffer of the last 100, 200 statements, especially in a loop, if you're doing a lot of short circuit debug logging, that's going to be captured in-memory. Only when there's an actual failure, would we essentially emit that to our log stream, so only when we're crashing, it's really failing, timing out, will we emit that to disk. There's this caveat to that, because you could say, but what if essentially I don't know that I'm failing? I'm producing some valuable debug information. Essentially, I'm going to be causing a failure down the road and starting with the following or the subsequent function doesn't know about that.
First of all, we're going to log all the requested information that was passed on to us. I like to Gsonize data, but any form of serverlessization will do. Then we're going to capture debug state. It's another powerful mechanism. Then the log debug statements. Then last but far from least, we're going to capture all the threads that are running within our Lambda function. We're going to essentially emit their stack trace, and also their stateful thread name. I implore you, in production, to always make sure to provide meaningful naming to your threads and continuously update that state as the thread executes. It will show upgrade to any form of debugger or profiler that you use, and also will be captured here. Thread names are immutable, they are meant to be consumed. We're going to emit that here. We have a pretty decent debugging payload. All the inputs were passed on to us, all the debug statements, debug variables, and all the call stacks, or all of the threads that were executing and the thread name. With that we can really go to town when it comes time to debug information.
Debug variables, I touched a bit about that. One of the biggest issues that you have with especially Lambda execution, any form of asynchronous code execution, but Lambdas really exasperate this, is what happens if you have some internal declared or undeclared failure in function A which calls to B, C, and then it's only at D where we try to consume the results of function A. We don't get what we want, and we fail. At that point in time, we want to have state that tells us exactly what happened across A, B, and C, so we can understand, it wasn't B, it wasn't C that was the culprit, it was A and this is what happened. We need a historical state.
The Leave-behind Method (Classical Logging)
There are two methods by which you can capture that information. I call this take-with and leave-behind. Leave-behind is classical logging. You leave behind these crumbs behind you that tells you, this is what the state was. This is what was happening. It was executed here and here. You will need a transaction ID, something to tie that all together. You're going to have to put that in a place so when function D fails, you can say, what specific thread of execution from A, B, C, and D out of the millions that I have is the one that's causing this failure? Then I need to hope that debugging verbosity was turned on. That's option number one. The good thing about this, it has all the advantages of debugging. We know it. That's method number one.
To essentially overcome this situation where we did not have debug statement logging in A, B, or C, because we didn't even know we're failing. This is what you saw here within the Guardian Thread, which is called the sampling factor. We take the transaction ID, which is a prerequisite if you want to be able to trace anything. I assume here it's a string, but it can be an integer, anything, any persistent structure, and I hash it. I'm using sample Java hashing. It is pretty cookie cutter. In production, because this is a simplified example, I recommend MurmurHash by Google. It's great. It'll give you a 64 long. You can put any byte buffer in, it will give you a really good spread. Then, essentially, I sample factors, say for every 1 out of 10, 100, or 1000, however much many executions of this function for this specific transaction, for a specific transaction ID, I will capture state. That means that even if 1% of our functions are failing, if I'm able to find one transaction ID that was caught in this sampling factor, assuming it knows a pretty good distribution within your transaction, which is usually what happens. I'll get the entire debug trace, debug logging across that execution, across that distributed workflow or transaction. That can be at only 10%, or even 1% of what would be the normal, untenable debug logging verbosity. You're able to just activate debug logging 1% or 10% of the time, yet guarantee that when a transaction fails, if it hits statistically within that bucket of failed transaction, you're able to find one incident, you get the entire trace. That's a great guarantee. That's one thing that the Guardian Thread does for us. Here, you're going to get essentially all the input params, the debug stations, [inaudible 00:19:51], debug statements, all the threads that were running, so you're going to get everything that was happening at each station along the way, 100 times more debug information. That is one method, that's the leave-behind.
The advantage, using debug logging that you like. You don't have to change your code. The disadvantage, you're working with a sampling factor. One thing though about the sampling factor, here it's a constant, because, again, this is a very simplified example of the framework. In real life, you can probably make that configurable. You can say, if something's going on, I want to increase that or reduce that. Or, I can also have it for specific T transactions, I'd be operating at a higher frequency than for less than foreign transactions. You're going to have a lot of tools at your disposal in order to be able to play with this thing.
The Take-with Method
I'm going to show you another strategy, which is also very effective. That's the take-with. If you saw on here, what I was doing with this counter, I want to know exactly how many times and what are the states of the count? How many iterations did it take me to essentially get to the output that I was looking for? This is something that from a business standpoint may not be interesting to the user, or to the consumer of the information. From a debug standpoint, is something that I may need. You see I'm doing this response.debug. Essentially, this, all it does, it logs this tuple of the name or value to the debug values, this synchronized list of values that I have, which essentially documents from my function, this is what I started. This is my function name. Here are all my debug values.
What's neat about this structure is whenever I create my response, I automatically load into it all the previous states. What's going to happen, you're going to get a linked list of all the debug states at stations A, B, C, so when you get to D, already in-memory, you'll have this moving call stack with you that is going to show you exactly what were these interstates of A, B, C, and you're not depending on a sampling factor. You're not depending on log verbose. You don't have to search for logs. You have all that information. If you do time out, crash, incur a logical failure, all of those states, when we look here, are going to be captured for you by the Guardian Thread, so you can see all those debug states. With the leave-behind method, we were logging and then only essentially emitting that to this, at a specific sampling factor rate, which was derived from a transaction ID. It was a dependency there.
With the take-with approach, essentially, we're carrying that information with us. We might even persist that into our database or into our queues, as that workflow is making its way across the serverless architecture. When we get to a point of failure in the future, and again, this information on loading debug values into a function, I may not even be failing at this point, and 99% of the time, I'm not going to fail. When I do fail, at the terminal point, I will have that information available to me. That could be worth its weight in gold. That is a pretty powerful example. Those are the two methods. The debug log appender, the way by which essentially I configure my AWS Lambda to receive that, it's with a simple log4j XML file. Again, every framework is going to have its own way of doing it. I'm overriding this standard log4j1 with my debug log appender, then everything that's going into the main bus, I'm just going to take straight into the AWS log into my Lambda appender, but otherwise, keep that in-memory. That's the technique by which I perform this.
Summary
We're talking about our ability to deal with the four horsemen: termination, logical failures, future errors, performance SLAs or timeouts. We're talking about the information that we would want to have when those essentially take place: inputs, debug variables, execution times, stack traces, threads.
We'll post the links to both the code and the presentation itself up to GitHub, so you can consume it later. Feel free to reach out to me with any questions about this, how to implement this with your own language. Any edge cases that are specific to you.
See more presentations with transcripts