Transcript
Thwaites: My name is Martin Thwaites. I'm first and foremost an observability evangelist. My approach to building systems is about how do we understand them, how do we understand them in production. I also work for a company called Honeycomb. A couple years before I worked for Honeycomb, I was contracted to a bank. I was brought in to start looking at a brand-new system for the bank: core banking systems, ledgers, all of that stuff, writing them from scratch, which is a task. We started to understand when we started building it, that the thing that was really important to them was about correctness at a system and service level. They didn't really care about code quality. They didn't care about database design. All they really cared about was the system was correct. What that came to was us trying to understand that actually the problem that we were trying to solve was that production returned the right data, and that was a little bit liberating.
Testing (Feedback Loops)
Some people may recognize this diagram. This was a 2020 article that Monzo Bank put out. That wasn't the bank I was working with. This is a view of their microservices, all 1500 of them. When they put it out, it caused a little controversy. I think there was a lot of developers out in the world that looked at this with fear of, I struggle to keep the class that I'm developing in my head, never mind all 1500 microservices. Because I've been in this industry for a while, and that's the system I started working on. It had a box, and it had a cylinder. There was a person, they hit my website, and that website loaded things from a database and put stuff back in a database, which was incredibly easy to reason about.
That was really hard for me to get my head around when we start thinking about large microservices systems, where we've got lots of independent systems that do independent little things. Trying to equate them to a time when I could write a test against my productions of my blue box, in this scenario, where it would test a couple classes, and I would feel a lot of confidence that my system is going to work, because I could run the entire thing locally. Can you imagine trying to run that on your machine? I'm pretty sure that an M3 Max Pro thing won't do it either. You just can't.
The issue is that now our systems look a little bit more like this. We have some services. Users may hit multiple services. You may have an API gateway in there. This is not meant to be a system that I would design. This is meant to be a little bit of a pattern that we're going to talk about. We've got independent systems. Some of them are publicly accessible. Some of them aren't. We've got some kind of message bus that allows us to be able to send things between different systems asynchronously. We've got synchronous communication, asynchronous communication, all those things in there. A little bit of a poll.
How many people identify as an application developer, engineer, somebody who writes code that their customers would use and interact with, as opposed to maybe a DevOps engineer or an SRE? This is a talk about testing. It's really a talk about feedback loops, because testing is a feedback loop of how we understand our systems. If we were to take a developer, the fastest feedback loop that we have is a linter. It will tell us whether our code is formatted correctly, whether we're using the right conventions. It will happen really fast. It happens as you type. You don't really get much faster than that. You're getting really rapid feedback to know whether you're developing something that is correct.
The next time we have feedback is the compiler, if we're running in a compiled language. We will compile all of those different classes together, and it will tell us, no, that method doesn't exist on that class. If we're using languages that do support compilation. Again, that's really fast, depending on what language you're using and depending on how bad the code is. It is a fast feedback loop. We're up to seconds at this point. The next feedback loop that we have is developer tests, the test that you write as a developer against your application. I'm specifically not mentioning two types of testing that would fall into that bracket, because they cause controversy if you call one thing another thing, and another thing another thing. We'll come on to what those are.
Then, one, we've got our developer tests. We're talking now maybe multiple seconds, maybe tens of seconds, depending on how big your tests are, how many tests you have that are running, how fast those tests run. We're still local machine here. We're still able to get some really rapid feedback on whether our system is doing supposedly what we've been asking it to do. Because, let's face it, computers only do what we ask them to do. There aren't bugs. There are just things that you implemented incorrectly, because computers don't have a mind of their own yet.
The next thing we have is, I'll call them end-to-end tests. There's lots of different words that you might use for these, but where we test a user flow across multiple different services or systems to ensure that things are running properly. This is where we add a lot of time. These are tests that can take minutes and hours to run, but we're still getting feedback in a fast cycle. From there, we have production telemetry. I told you I'll mention observability.
We have production telemetry, which is telling us how that system is being used in production, whether there are errors, whether things are running slow. Again, this is a longer feedback loop. Each of these goes longer. Then, finally, we have customer complaints, because that is a feedback loop really. Customers are very good at telling you when things don't go right, they just don't do it fast enough for me. We've got all of these feedback loops, but they all take longer. Each different one of these is going to take longer. It's not something you can optimize. It is something that's just going to take longer.
Now we get into something that's a little bit more controversial when we talk about developer tests. A lot of people would see developer tests as methods, testing each individual method that we've got in our system. We might call these unit tests, coded tests. There's lots of different words that you might use for them. Beyond that, we go a little bit further out, and we have classics, an amalgamation of multiple different methods. There are some people who call these integration tests. I do not agree, because integration test is not a thing. It's a category of a thing. If we go even a bit further out than that, we've then got things like controllers and handlers, if we're using messaging contracts or CQRS, that will bring together multiple classes and test them together as a bit of functionality. Then we can go even a further step beyond there.
Then we've got API endpoints, messages, events, the external interactions of our system. Essentially, the connection points into our system. Because, ultimately, on the outside there, that's the only thing that matters. We can write tests at the method level. I would imagine that all those people who put their hand up before have likely got a class inside of their code base, or a method inside of their code space that has way more unit tests than it needs. They've spent hours building those tests. If you go and talk to them about it, it is the pinnacle of developer excellence, this class. With all of these unit tests, it's never going to fail. I come from a world where we've got things like dependency injection. What will happen is, you'll do that class, you'll make it amazing.
Then you'll try and run the application and realize, I didn't add that to dependency injection, so nothing works now. I normally ask the question of how many people have deployed an application and realized they hadn't added it to dependency injection. Now I just assume that everybody has, because everybody puts their hand up, because we've all done it. Because the reality is, when we're writing tests at that level, we're not really testing what's useful. It's my favorite meme that exists. I'm sure both of those drawers worked, but as soon as we try and pull them together, they don't. It doesn't matter how many tests you write against your methods and your classes and even your handlers and your controllers, because the only thing that matters is those connection points at the top. No CEO is going to keep you in your job when you say, all of my unit tests passed. I didn't mean us to lose £4 million.
Getting back to our system, if we look at this example system. We've got three services that a customer can hit. We've got one service which is essentially fully asynchronous. It has an internal API call. The red lines on here are API calls, the blue lines are messages, not wholly important for what we're doing. The thing that's important on this diagram is these things. These are our connection points. I'll say it again, these are the only things that matter. Inside those blue boxes, nobody cares. Nobody cares if you are writing 400 classes, one class. Nobody cares that you've got an interface for every single class that you've created. Nobody cares about your factories of factories that deliver factories that create factories. Because, ultimately, unless those connection points that we're looking at here work independently and together, nobody will care.
Small little anecdote from the bank, we based everything based on requirements. If we got a requirement for something, we'd write the tests and we'd implement them. We get requirements like this, for instance, the product services' list products endpoint should not return products that have no stock. Very valid requirements. Because it's a valid requirement, it's something that you could write a test for. In this diagram, that's the only two things that matter. On that service, those two connection points are the only two things that matter. You could spend hours building classes internally, or you could spend 20 minutes building one class that does it. You could spend ages building unit tests around that class or those classes, but unless those two endpoints work, you haven't done the work.
Then there's another system that is something that we're going to receive, because the warehouse service should emit messages that contain the new stock level for a product, whenever an order is placed. Again, we've got some contracts. These are two independent services. If we have to build them, deploy them together, they're not microservices. They are independent services. They emit messages. They receive messages. They take API calls, they emit messages. They're to be treated in isolation, because those are the contracts that we've agreed to. Those contracts can't change, because if they change, the consumers need to change.
Test Driven Development (TDD)
The thing is, ultimately, this is actually a talk about TDD. It's not a talk about testing in general. It's actually talking about the TDD workflow, because TDD is not about unit tests. It's not about testing methods. It's not about testing classes. It's not about whether you use JUnit or xUnit, or whatever it is that you're using for it. That isn't what makes it a unit test. That isn't what makes it TDD. It might make it a unit test depending on how you define a unit, which between the U.S. and America, units are different, apparently. Ultimately, TDD is about a workflow. We write a test. We write a test that fails. We write an implementation for that test that makes it pass. Then we refactor, and then we just continue that cycle. If we go back to what I talked about with requirements, ultimately, we're getting requirements from the business or even from internal architects, or we've designed them ourselves to say, this is the contract of our system. What we've done is we've generated a requirement, which means we can write a test for that requirement.
At the bank, what was really interesting was we got really pedantic about it, which is what I love about it, because I'm a really pedantic person, really. We had this situation for around six months, I think it was. We went live, obviously. We had a situation for about six months where you could spend more than your balance, which apparently is a bad thing in banks. We had the BA come to us and say, "There's a bug in the platform". He said, "You can spend more than you balance". I'm like, "I'm so sorry. Can you point me to the requirement so I can put it on the bug report?" He was like, "There wasn't a requirement for it. We just assumed that you'd build that".
The reality was, because there was no requirement for it, we hadn't built a test. We hadn't built any tests around that. We'd built something because they said we need to get the balance. We'd built something because they said we need to be able to post a transaction. We hadn't built anything that said, throw an exception, throw a 500 error, throw a 409 error, throw some kind of error when they spend more than their balance. What that meant was they had to tell us exactly what error message should be displayed under what conditions, and we had to validate that that error message was expressed in those conditions. I'd like to say eventually they came on board, they didn't. It was a constant battle.
Ultimately, they got this idea that they have to define the requirements, not the implementation. We got them coming a few times, saying, so we need you to add this to the database. Like, we don't have a database. Like, but I need it in the database. We use event store. It's not a database. Need to update the balance column, no, stop talking about implementation details, tell me what you need it to do, and we will build it. Which is very important when we talk about legacy systems, because legacy systems, and the way that legacy systems were developed was by that kind of low-level detail. If we talk about working in a TDD way from the outside, where people define requirements and we write tests for those requirements, and then we write the implementation that makes those requirements work.
What's really cool about that is we don't develop code that isn't used. We don't develop a test or 500 tests that test that a class is bulletproof, where none of those conditions can ever be met. We end up with a leaner code base, which, a leaner code base is easiest to support. It also means you can get 100% code coverage. Because the reality is, if that code isn't hit by any of your outside-in tests, your API level tests, then one of two things has happened, you've either missed a requirement or that code can never be hit. What I'm saying here is not something that I would like anybody to go away and cargo call it and just say, "Martin said that I should just do this, therefore, if I lose my job, I can sue Martin". No, it's not the way this thing works.
These are just different gears. We call these gears. We've got low level gears, unit tests, methods, classes that allow us to maneuver, but you don't get very far fast, because you've got to do more work. On the outside when we're testing our APIs, we can run really fast. We can test a lot of detail really fast. Those are your higher gears. It's about switching gears when you need to. Don't feel like you need to write tests that aren't important.
The Warehouse System
If we take, for instance, the warehouse system, essentially what we're trying to do here is we're going to write tests that work at this level. We're going to write tests where what we do is we send in an order placed message. Before that, we're going to send in some stock received messages. Because there's a requirement I've not told you that was part of the system, which is, when we see, received a stock received message, that's when we increase the stock levels. Maybe we don't do an output at that point. Each one of these would be a requirement that would require us to write another test.
If somebody came to me with this requirement of saying, put in the order placed message and send out the new stock, I'd say it'll fail at the moment because there's no stock. What's the requirement? When I get new stock, I need to increase the stock. We start to build up this idea of requirements. We get traceability by somebody telling us what our system is supposed to do and why it's supposed to do it, which means we build up documentation. I know everybody loves writing their own documentation against your classes and all of that stuff, writing those big Confluence documents. Everybody loves doing that? That's pretty much my experience.
Ultimately, when we're building these things from requirements, the requirements become our documentation. We know who requested it, why they requested it, ultimately, what they were asking it to do, not what it's doing. When we place these two things in, we then check to make sure that we've received a stock received message. That's all that test should do. A lot of people would call this integration testing. I don't agree. I call these developer tests. The reason I call them developer tests is because we write these things in memory. We write these things against an API that we run locally against our service.
The developers write these, not some external company, not some person who you've just hired to go and sit on a desk, not some intern. The person who's writing this code, the implementation, has first wrote these tests that will fail. Then once they've failed, they'll write the implementation to make them pass. It's that simple. If there's a requirement that you think is something that should be built, you go back to the business and say, "I think this should be a requirement". "Yes, I agree. What's the requirement? Write the requirement. Now go and write the test". You could call this RTDD, Requirements Test Driven Development, maybe.
Ultimately, what we're doing here is we're testing to make sure that the things that happen in production produce the right outcomes that should happen in production. One of the things I see as an antipattern when people start doing this is they do this. They'll go and inject a load of products into the SQL Server, into the database, into whatever it is that you're using as your datastore, as part of their test setup, as part of preceded data. I'm going to take a wild guess that that's not the way it happens in production, that when you get stock, somebody doesn't craft some SQL queries and increase the stock.
Ultimately, that's not what you're expecting to happen. What you're expecting to happen is you might expect something to happen where the database already has some data. That's not how it's going to happen in production. You get into this step of your tests being tightly coupled to your database schema when they shouldn't be. Because if you have to change your test when you write some new code, you can no longer trust that test. You don't know whether that test is right. It was right, but now you've changed something in that test. Is it still right? Yes, because I don't write bad code. Unfortunately, that's not something that people believe. Ultimately, what we're trying to do here is mimic what a production system does in every single test that we do.
We had, when we built this, around 8000 of these tests per service, and they ran in under 10 seconds. For context, we run in .NET when we did this. .NET has something called a WebApplicationFactory, which runs up our entire API in memory, so there is no network calls, because as soon as you add a network call, you run in at least a millisecond. You need to think about how you test this. It's not just, let's run some test containers and run some Newman tests against these test containers. That'll do it. That isn't how we do it. You have to think about it. You have to spend the time to build the frameworks that allow you to test this way.
Ultimately, when you have those 8000 tests, you have a ton of confidence. It is a journey. We had a team that had built this and then decided that their entire internal implementation was wrong when a new requirement came in. The database design that they'd come up with just didn't work for this new thing. They rewrote everything, the entire blue box, rewrote all of it. Contracts were the same on the outside, because we're still going to get an order placed message. We're still going to get stock received messages, and we still need to output the stock changed messages. They did that, and then all the tests passed. They spent two days procrastinating because they didn't feel confident, because they'd rewrote the entire internal implementation, but the test still passed.
I think that goes to show that when we're writing and we're writing so much code, we don't have confidence that we can deploy things, because what we're doing is we're designing our tests. We're designing our confidence tests around our local environment or at too lower level. We're not going to the outside. We're not testing what production wants. We're not testing what the business wants. We're testing the code that we wrote. They eventually deployed it. It all worked. It was fine. I think that really goes to show that you should really think about these tests. Once you get that confidence, you can run really fast with a lot of confidence. This is how you get to these phases of people who are deploying per commit. Because if your pipeline can run all of these tests, and you can have 100% confidence that every requirement that you had to achieve before is achieved now, just deploy it. Your new stuff might not be right, but put that behind the feature flag. As long as all the old stuff still works, just deploy it, test in production.
Observability
Ultimately, though, there are things that we can't test from the outside, because from the outside, some of the things that we do may look the same. We may say, do a thing, and the outcome may be the same, but the internals of what we expected might be different. We can't do that from an outside-in testing perspective, but there is another way, because this is actually a talk about observability. What do we mean by observability? This is a quote from our founder. Observability actually comes from something called control theory from the 1960s, it's not new. It's not something that the logging and metrics vendors came up with about seven years ago. It's something that's ancient, because there is nothing new in IT, apparently. Ultimately, observability is about understanding and debugging unknown-unknowns.
The ability to understand the inner system state just from asking questions from the outside. That's the software definition based on Cummings paper on controllability. I want to focus on one part of that, which is the ability to understand the inner system state, any inner system state. Because I said there's things that we can't test from the outside, but are important for us to know that they went through the right approaches. When we were writing things for production, we really need to know about how our software is going to work in a production environment. That's really important to us. That's where we do things like tracing, and we get things like this, which is a basic trace waterfall.
Ultimately, the business only cares about that top line, the outside. They care about how long it takes. They care about it returned the correct data. You as an engineer, someone who's supporting that system, may care about either the whole waterfall or just some individual sections of that waterfall. You may just care about your service, your checkout service, maybe that's the one that you're looking after, maybe the cart service, maybe you're looking after two of those services. This is what allows us to see inner system state, and this is what we see when we're in production.
What if we could use that locally? We call this something called observability driven development, which is how we look at the outside of a big service, the big system. How do we use that information to help drive our development flows, to help drive more requirements, and therefore more tests and more implementations and more refactoring? I've got an example. Don't try and work out the language. I specifically talk about six different languages, try and munge together the syntax so that nobody thinks that I'm favoriting one particular language. There may be a test afterwards to spot all the different design patterns. Let's say we've got a GetPrice for a product, and we use pricing strategies based on the product ID. Maybe the product ID has a different pricing strategy on there.
Those pricing strategies are something that we use to calculate the new price of a product. If two of those pricing strategies somehow converge, maybe one of them is a 10% markup, and one of them is a £1 markup. If my product is £10, they're both going to come out with the same outcome. From an outside, I don't know whether that is the right pricing strategy that has been used. Ultimately, it's not just something I want to know in my tests. It's something I want to know about my system when it's running as well, because if it's important enough for you to know whether it's correct in your tests, and if we're writing tests from the outside to understand our system, it's important enough for you to give to either the people supporting your system or yourself, if that's you supporting that system.
Ultimately, I'd like to think that everybody treats the people supporting their system as if they were themselves anyway. We're not in a world anymore where we throw it over the wall and somebody else can sort that. Like, "I've paid my dues. I'm going home, I'm going to the pub". We're not in that world anymore. We need to be kind to ourselves about how we understand how the systems work in production.
We can use those same concepts in our local development because, let's say we were writing a test to say let's make sure our GetProduct endpoint, when it's got a valid product, uses the right strategy. How would we do that? How would we call our ecomService API from the outside our GetProduct? Imagine this is a wrapper around our API that we've built inside of our test. How do we test to make sure it's using the right strategy? This is where we use either something called tracing, or we use logs, or we use some kind of telemetry data that should be emitted in production to test locally. I use tracing. I think it's superior, but doesn't mean you have to. Logs are just as useful in this circumstance. What we can do is we can say, let's get the current span that's running, and then let's set our product strategy ID on our span.
What that means is, when we go and then write the test, now we can go and say, go and work out what spans were emitted for this particular test. Then make sure that the strategy ID has come out, and then make sure it's using the right strategy ID. It sounds simple, and that's because it is. It does however take work. This is not something where you're just going to be able to take it out the box, open up a C# file, or a Java file, and just write something in. What we found when we were writing things at the bank and the new people that we brought on, is this was a big knee jerk reaction to people going, no, that's not the way we build things in finance. It's like there's a reason for that, because we've built things better when we're not in finance.
Ultimately, there's a lot of legacy that comes in with big bank finance systems, that kind of stuff. These are new patterns, but they're different strings to your bow. They're not something that can be used in isolation. You're not going to get everything by doing these things, but you can do a lot of things.
When you do things like these, actually, when you're writing your tests and you're running your tests, you can actually see what that tracing data is going to look like, what the log data is going to look like if you in your local environments push to a system like, Microsoft have just released Aspire for the .NET people, that allows you to push stuff and see all of your telemetry data locally. There's Jaeger, there's OpenSearch, there's lots of things that you can use, but you're essentially now being able to see that data. The other thing that we found when we started writing software like this, very few people actually run the software. Very few people actually run it locally.
They just run the tests, because the tests were actually doing what they would do themselves. They don't have to craft a message, stick it on service bus, stick it in SQS. I just put it in the test, and the test tells me it works. If you imagine how much time you spend hitting play on your IDE, letting that thing run, go into your Swagger doc, putting in the information, and hitting play. Or even just pressing play and hitting a HTTP doc and hitting the send request on those. If you could get around all of that, and you could test all of them all the time, how much more efficient would you be? Because what you've done is you've built that system so it works in production, not locally, not your local classes, not your local methods. You've built it so it works in production. This talk is actually about writing fewer tests, because you can write fewer tests on the outside that test more things than you can on the inside by writing unit tests, by writing those method level tests.
Key Takeaways
I want to leave you with a couple of thoughts. The first one is, don't cargo call this. Don't take this away and say this is the thing that we're going to do now, we're not going to do anything else. Think about which gear you're in. Think about whether you need to move down gears. Don't just move down gears, move up them as well. Can I write a test at the controller level? Can I write a test at the API level? Think about what extra things you're testing as you go to the outside. As an example, if you're at an API level and you test things, you're actually testing serialization and deserialization. Because, how many times have we seen somebody changing the casing of a JSON object? Think about what gear you're in. Think about what you're trying to test. Think about what your outcomes are that you're looking for.
Think about whether this is the right gear for the test that you need to write. When you're doing that, write the tests that matter. It does not matter that you have a test that sets a property on a class and make sure you can get that property back. I can see a lot of people going, I've got loads of those. They're not that useful. Don't make your applications brittle. The more tests we write against methods and classes, the more brittle our application framework becomes, because as soon as we try to make a change, we've got to change tests. Those tests cascade. Finally, think about whether you're covering your requirements, the requirements that the business have put on you. Think about that, because that's the most important thing. The requirements are what matter in production. If you're not testing the requirements, you're not building for production.
Questions and Answers
Participant 1: How do we know that these techniques will actually ensure that our applications are easier to debug, and then they're primed for use in production.
Thwaites: I think it's really about the observability side, because if you've been able to debug something locally, and you've been using the observability techniques to do that, you've essentially been debugging your production system anyway when you've been writing the software. Because while you've been debugging when a message comes in and when a message goes out, that's what you've been debugging, and that's all that's happening in production. That's the difference, because if you debug a unit test locally, you're debugging a unit test, you're not debugging an application, you're not debugging a service. You're not debugging a requirement or something that would happen in production.
Participant 2: If you have a use case or a requirement that also involves a different service doing its thing, how would you test that? If the answer is smoking that service, how would you handle changes to that service's requirements?
Thwaites: If we've got multiple services that are interacting together to achieve something, how do we test that scenario?
Ultimately, you shouldn't care about that other service. That's that service's job to maintain its contract. Each of those connection points that we're talking about is a contract that you've delivered to somebody. You can't change that contract, because that will break things. You might own both services, but we work at a service level. Your tests in your service should make sure that that contract never changes, or, more specifically, never unknowingly changes. You might want to change it, but as soon as your tests fail on those contracts, you know that your consumer is going to fail. If you've got to change your tests, you're going to have to tell consumers that they need to change their implementation. You shouldn't care about what other service is doing. That's that service's job. What you should know about is, I know the contract that that service is going to give me. Let's take an API contract.
The example I gave before was the checkout service is going to check the stock live via an API call into the warehouse service. In that scenario, you know what the contract of the warehouse service is going to be for that call, or you should put a mock or a stub, more specifically, into that service that says, when I call for these specific details, I expect you to give me this specific response. Because that's the pre-built contract that you've agreed with that service. If that service changes, that's on that service, that service is now broken, but it's that service's job to do the same thing. If you want to do that, then you've got to do that with every single one of them. I can tell you that 1500 microservices locally is not the thing you want to do.
You've got to take each of these services individually. You've got to make sure that everybody buys into this approach, that we are not going to change our contracts, or that we are going to support that contract. If you do that as an organization, not only do you get the confidence that internal consumers are not going to break. You can also use this for external consumers as well. Because we had a discussion in the unconference about multiple different external consumers that you don't control, as soon as it's public, you're going to have to support it. API contracts are for life, not just a major release.
See more presentations with transcripts