Key Takeaways
- The problem of testing a system is becoming harder as we have larger teams, as we have more processes and as we adopt microservices architecture.
- The testing problem is fundamentally different moving forward. We have less capability for testing specific functional points that we do in the unit test environment.
- We are cursed by very costly network calls. And we can't escape this issue, unless you homogenise your tech stack and abstract out the network calls.
- We need new tools for dealing with these state machines.
- The state machine is intrinsically beautiful. We should take advantage of it.
In the old days...
Life is about facing hurdles; we make one jump and we are then faced with another. This is especially true in IT where the industry has only really had two or three decades to mature. Of course, everyone is going "agile" these days, and a fundamental component of agile is the notion of a test, a key innovation in the current state of maturity of the industry.
When I started developing, I used to find myself sitting in front of the codebase, paralysed by the fear of introducing bugs. If I introduced one, I would lose reputation at the very least, never mind the material damage. On the other hand, if I managed to introduce a change correctly, I was merely capable of doing my job. The problem was that in order to make changes without any risk I had to have absolute knowledge of everything in the system. Of course working in a team, this absolute knowledge was very hard to achieve. A possible approach was to introduce as little change as possible to deliver my functionality; never-ever refactor! After all, developers may thank you for refactoring, but if something goes wrong, they will shun you like the plague as you explain to the product owner that although your change affected the bottom line negatively, nonetheless it improved the code.
How about some tests?
And now we have tests.
When we started writing tests, we were looking for a method to validate that our code worked. This gave us a framework or methodology that allowed us to make changes without complete knowledge. If the developers who had come before us properly distilled their knowledge into tests, then by running the tests we could be sure that we had not broken anything. That was all well and good, but the utility of tests didn't stop there. Developers started to appreciate tests for other reasons; pedagogical reasons. A test not only tells you that the code works, but gives you an example of how to use it! We now had a natural place to document our requirements in code, and that documentation would live with the code. The tests would live with the code and would be repeatable; the tests could be run after every change. We start from a stable point, then we introduce change, and finish at a stable point. This wasn't a new idea. When I studied computer science, the bright students quickly arrived at a development methodology called 'iterative development'; start small, get something working and then iteratively expand; never write the entire solution in one go. But now we have tests and our tests can tell us in an automated fashion that we are at a stable point. We could talk about having a ‘green’ build; meaning a version of the codebase that was acceptable and could be shared or deployed and used.
But running tests was optional
Developing code was much better. We could make changes and refactor. The quality of the codebase improved dramatically. So inevitably we started running into the next set of problems; we could deal with the changes that developers had made in the past, but what about the changes that developers were currently making?
We have to work as a team. As long as we run the tests after every change we make and only commit good code into source control, everything is good. The problem is that developers are either eternal optimists or perhaps just lazy. That little change surely won't break anything, says we. Let me skip the tests on my local machine and commit into source control. So of course, I inevitably introduce changes that will not build; the tests will fail. Then, of course, other developers will see my changes and experience all my errors. Even worse; they will experience my errors without the benefit of knowing what I had just done. They would have to dig into source control and analyse the failing tests. And every developer who takes an update would have to do this. So the bug I introduced could be fixed many times. If the team communicates the issue, then one developer fixes the issue while the others wait.
Continuous integration changed all of this by setting up a system that would automatically run the tests whenever a developer had committed code. The code would be validated after every code change and we no longer depended on the discipline of developers. Much like multi-threaded software development, we now had a "coding semaphore" introduced into our development methodology. This semaphore coordinated the actions of many developers changing code in parallel. When the build is "green", developers can take an update of the codebase. When the build is "red", then the last change broke something, so best to avoid.
But the semaphore wasn’t really working. The problem was that there was a third state; the build can be "currently building". This was a gray area where developers didn’t know that something was wrong. They just knew that something could be wrong. The prohibition on commits into a red build was originally about reducing the complexity of fixing a broken build, not maximising developer productivity. Can you take an update? Sure you can, if you are willing to deal with any potential problems. Can you commit code? Sure you can, the build is not red, it’s currently building. The gradual formation of these policies wasn’t something that was designed, it was formulated by teams who were trying to figure out how to work together.
And the build times starting taking longer and longer.
Idle hands break rules
So, here we are building our application, committed to the idea that we would keep our build times down. But six months later, our tests have crept up on us. The build times are half an hour to an hour now. We have a team of ten developers and we are lucky to get in one commit a day. Everyone is starting to realise that most of their time is consumed with just trying to get the code in, rather than develop the actual code. There is a race every-time the build is green! Developers are throwing fate to the winds and committing on top of one another. The build is now red. No-one wants to dig through the last five commits to find the issue, sorry, now six commits, to find the issue. Everyone realises there is a methodological problem here. The scrum master reluctantly agrees that commits on a red build are inevitable. The coding semaphore is suspended in favour of one of the developers spending the week trying to get the red build to go green. When a release is upcoming, the master code is branched and the developers and testers spend the next week trying to get some supposedly "delivered" code to even work.
Where did it all go wrong?
And that brings us up-to-date. This problem is currently being experienced by a great many teams. So what are the potential solutions? Well let’s break the problem down into its constituent parts:
- Too many developers trying to access the same codebase.
- The code takes too long to validate, so the semaphore doesn't clear quickly enough.
If you alleviate either of those factors you would solve the problem.
The developers separated themselves into teams
So let’s discuss the first factor: too many developers working on the same codebase. Well there is a reason you hired so many developers in the first place. The idea was to ramp up development. Sending the developers home would not help you do this. It is, in a way, abandoning the problem. What we want is to keep productivity up, but have less contention on the codebase or the continuous integration pipeline. We want fewer developers waiting for each other.
There is another approach. You could divide the codebase into several codebases and have different teams work on each. In concurrent programming terms, we have removed the single exclusion lock in favour of multiple locks. We suffer less contention, developers are waiting less. We have solved one problem but we have introduced another. We now have different deliverables, whether they are microservices, or libraries, which are tested independently. The deliverables share a contract. Our tests have lost sight of the global picture. We are no longer certain that these components interact with each other since they are independent systems now with an independent set of tests. Our tests are now less inclusive, less exhaustive, and ultimately of less use to the product owner and user as an Acceptance Test.
So we introduce a second stage continuous integration pipeline. This pipeline composes the deliverables together and tests are developed to exercise this system. We introduce a new team to manage these tests and we give the tests a new name. We call them end-to-end tests, smoke tests and acceptance tests.
The picture has changed though, and considerably. The components we are trying to integrate are more like black box components. We are starting to lose control over how much of their internals we can control. Our tests at this level are at a much higher level. And they are much more expensive than the unit tests we were using before. They take a very long time to run.
We have divided up the code base, and we find that different teams are productive again while developing components, but when we integrate the components together we run into the old problem again; the tests take a very long time to run. We have arguably just postponed the problem. Along the way, we have componentized our system, the hallmark of microservices. That has to be a good thing, but here we are a little more mature, yet running into the same problems once again.
Facing the slow elephant in the room
Just why are the builds so slow? The heart of the problem is that network calls (or disk access) is so much slower than method invocations. A network call can take as little as twenty milliseconds. But a method invocation is so cheap that you can make millions of method calls in the same time span. So many people are offering the idea that unit tests should be used in favour of integration tests (by which I mean any test that makes a network call.) This solves the problem, certainly, but there is an elephant in the room that we need to acknowledge. Unit tests are not integration tests in that they leave much of the final system untested. They do, however, test what was actually developed by the developer. But 'Configuration is Code'. So when I configure Spring, or the database, or nginx; why am I happy not to test these configurations as all the while I insist on testing anything that a developer writes in Java? Why is the language subject to special attention when configuration is not? If anything, there is a compiler that ensures type safety in the language. There is no compiler for configuration, so if anything needs the test, surely it must be the typeless configuration? The argument to substitute integration tests in favour of units tests is therefore an argument to test less. Unit tests have their own strengths, most notably as a development tool for good software development, but they are not a substitute in terms of value to the end user.
There is another aspect to this debate. In the unit test environment you have the capability to set up the test in any arbitrary starting state. You have access to mocks and stubs and can manipulate the flow of the test to focus on the particular functional point you wish to test. This capability is either afforded by the use of IoC, or by overriding methods (e.g. the abstract factory method pattern). With integration tests, you have less control. You have access to mocks and stubs on the network level by using tools such as WireMock, but you can't inject these components, so you are forced to interact with the system as a state machine. You have to walk the system through multiple states to arrive at the particular stage where you can test your requirement, and you have to do this again and again making network calls. Its no wonder these tests are so expensive!
We have come full circle. The development teams is now productive, but there is more development risk placed on the contracts that they share. The composition of their different deliverables now falls on a different team, that doesn’t have the benefit of running in a unit test environment where they can make method calls. They have to make network calls and now have to test state machines. The blockage in developing software has now moved down the pipeline but it is still there.
But now we are going on journeys
Let’s look at these tests again. We have changed the picture quite a bit since first we started. We have built whatever unit tests we could, and they are running quickly, but now we have microservices and different teams. We have third party components and off-the-shelf software and we want to prove that all of it works correctly. And we know that proving that all of this works well together isn't something we can prove as a single functional point alone. We will have to walk this federated system through a number of different steps.
For example, let’s say we are developing a checkout process for an online shop. We have different microservices that handle different functionality; one service handles the basket, another handles the presentation of products on the shop itself. Another handles advertising etc. All of these components share a contract and are being developed independently. We are now faced with the problem of testing that they all work together. So we write a test that performs the entire journey. In the unit test context, we can test one feature alone; for example we can test if we can put an item in the basket. Or we can test that all the items in the basket sum to the proper amount. But now we are testing microservices and web portals. We need to put items in the basket before we can test if the basket totals correctly. That means we need to browse the website and click on items in order to add them to the basket. All of these actions together have formed a new style of test I will now refer to as a "journey test". We have to go on an often expensive journey in order to treat these components as black boxes.
So let’s illustrate some journeys side by side. Each step is represented by a letter that would be roughly equivalent to visiting a web page and clicking on something. We could have five such journeys, and they all start on web page A, except for journey 5, and they all transition to web page C. And so on...
And they each take about 30 seconds to run.
Well, a close look at them will reveal some important characteristics.
- They are very repetitive. This is mostly brought about by the proliferation of steps at the end of all the journeys; F, G and H
- Steps B and D start in different ways or are affected in different ways, but end in the same way with step F.
Clearly there is some inefficiency here. I can consolidate the journey tests. I would take five journeys down to three journeys. So in tabular form again:
So we have covered the same functionality as before, by being smart about what we test in which journeys. That is a 40% saving in time. That is potentially a 22 minute build compared to a 30 minute build if the same performance gains are achieved across the whole test set. In my experience the performance gains that can be achieved by intelligently composing journey tests are even more significant than this.
So now it becomes a tester's job to consolidate these tests into efficient test scripts and manage them. But there are still some points that should be considered:
- What do we call these tests? They no longer focus on a particular requirement, but rather meet multiple requirements in one go.
- How do we identify gaps? How do we know what we have missed?
- How do we identify what scripts are redundant? If we have a new requirement, how can we know where to insert these new assertions?
Managing this complexity is the purpose of a testing framework I wrote called Cascade.
New problems? New tools!
So how does Cascade go about doing this?
We define each step that makes up part of a journey into a separate class. The framework will then manage and generate tests from them.
The source code for Cascade contains examples of how to use it. The first example covers how you might use Cascade to test an online banking website.
We have to start somewhere, so in this case, we will open up a browser on the landing page. You can see the working code example on github here.
@Step
public class OpenLandingPage {
@Supplies
private WebDriver webDriver;
@Given
public void given() {
webDriver = new ChromeDriver();
}
@When
public void when() {
webDriver.get("http://localhost:8080");
}
@Then
public void then() {
assertEquals("Tabby Banking", webDriver.getTitle());
}
}
Notice that annotations are used extensively in the framework. Steps are maintained in separate files, and each step file is annotated with @Step
. The data is shared between steps using a form of IoC, and the use of @Supplies
and @Demands
annotations. Lifecycle methods are annotated with @Given
, @When
and @Then
methods.
The lifecycle methods deserve some consideration. @Given
is used to annotate a method that will supply data to the test. This test data is shared amongst all the step files. In this way, the step files collaborate to describe the data on which they will all operate. @When
indicates the method that is the action of this step. This method should perform a state transition, such as clicking a button, or submitting a form. Finally, @Then
marks the method that performs validation.
In this case, we are initialising Selenium’s WebDriver and opening the index page of the webserver listening on port 8080. We finally check that the web page has opened.
The OpenLandingPage step is followed by the Login step.
@Step(OpenLandingPage.class)
public interface Login {
public class SuccessfulLogin implements Login {
@Supplies
private String username = "anne";
@Supplies
private String password = "other";
@Demands
private WebDriver webDriver;
@When
public void when() {
enterText(webDriver, "[test-field-username]", username);
enterText(webDriver, "[test-field-password]", password);
click(webDriver, "[test-cta-signin]");
}
@Then
public void then() {
assertElementPresent(webDriver, "[test-form-challenge]");
}
}
@Terminator
public class FailedLogin implements Login {
@Supplies
private String username = "anne";
@Supplies
private String password = "mykey";
@Demands
private WebDriver webDriver;
@When
public void when() {
enterText(webDriver, "[test-field-username]", username);
enterText(webDriver, "[test-field-password]", "invalidpassword");
click(webDriver, "[test-cta-signin]");
}
@Then
public void then() {
assertElementPresent(webDriver, "[test-form-login]");
assertElementDisplayed(webDriver, "[test-dialog-authentication-failure]");
}
}
}
The first thing you may have noticed is that we now have two classes implementing an interface. In Cascade’s terminology, we have two Scenarios for the Login Step. We have a scenario for a successful login and for a failed login.
The second thing to draw your attention to is the @Step
annotation. Here, we declare which steps precede this step. So from this bit of information, you can pretty much extrapolate what Cascade does. It scans the classpath for classes annotated with @Step
, and then joins them together into tests.
Next, you can see here that we have a step that demands data. In this case, the login steps demand an instance of Selenium’s webdriver so that they can drive the browser.
Then, if you take a close look at the values each scenario supplies, you will notice that they are setting up data for the same user. If you read the definitions of the @When
methods, you will see that the passwords supplied to the browser are relevant to the particular scenario being played. In other words, the FailedLogin
scenario is inputting the wrong password.
Predictably, the @Then
methods are then checking that we have arrived on the appropriate web page to the scenario or that a failure dialog is being displayed.
Finally, you might have noticed the @Terminator
annotation. This annotation tells Cascade that no steps should follow this step. So the journey that fails authentication, ends at that point.
The example on github extends out the journeys to a much greater level. Feel free to have a look.
Modelling the state machine
So by the time you have finished writing all your step files and linked them all together, you have done something really interesting. You no longer merely have a collection of disparate scripts that describe journeys. What you have is a model. You have described a state machine and described how to transition from one state to another.
And once we have created this model, which is quite beautiful by-the-way, we can do interesting things with it, because it’s a mathematical system rather than just a collection of scripts.
We can generate graphs in the test reports over this model
Cascade will generate a graphical representation of the state machine.
We can view a summary of all the data at any particular stage in the journey. By clicking on the appropriate links in the graph, we would see something like this.
We can selectively run only specific tests or group tests
Cascade allows you to specify a filter for the test generator. It will then filter out tests that don’t match the specified predicate.
@FilterTests
Predicate filter = and(
withStep(Portfolio.CurrentAccountOnly.class),
withStep(SetupStandingOrder.SetupStandingOrderForLater.class)
);
Cascade can use algorithms to minimise the set of tests
This is the real gem of Cascade. This is the reason I wrote the entire thing. We can use an algorithm that calculates the relative rarity of each scenario, and we can use that information to generate a minimal set of journey tests, so that each scenario is included at least once. Even better, the algorithms used can calculate a relative value for each scenario. And we can use that information to balance the scenarios across the tests. Testers no longer need to analyse the test set in order to minimise test execution times. Cascade does this automatically.
Cascade can name your journey tests
Cascade will generate a name for each journey using the relative rarity of each step in the test suite as a whole as an indicator for highlighting scenarios.
Cascade can minimise tests in different ways
Cascade can use a slightly altered algorithm to minimise tests in different ways. Instead of saying, let’s have a minimal set of tests for all the Scenarios, you can say instead, let’s have that for all the steps, or all the transitions. Each different set is complete in different ways and generates a different number of tests. This is most applicable for smoke testing, etc.
Cascade can keep an eye on coverage
Cascade can calculate the level of coverage this test suite has over the entire test suite as a whole. This is not the traditional code coverage report however.
So what are we losing?
We are losing the straightforwardness of explicitly defining test scripts ourselves, and exchanging that for a system that tells us what the scripts will be. Whether this is too high a price to pay is going to depend greatly on how understandable and accommodating Cascade is in practise. Much of what Cascade does might appear at first to be magic or voodoo, but the test reports that Cascade generates do more than mitigate this issue. The test reports graphically model the state machine. Consequently they offer a way of viewing and analysing your system that I don't believe was possible before.
So let’s summarise
- The problem of testing a system is becoming harder as we have larger teams, as we have more processes and as we adopt microservices architecture.
- The testing problem is fundamentally different moving forward. We have less capability for testing specific functional points that we do in the unit test environment.
- We are cursed by very costly network calls. And we can't escape this issue, unless you homogenise your tech stack and abstract out the network calls.
- We need new tools for dealing with these state machines.
- The state machine is intrinsically beautiful. We should take advantage of it.
About the Author
Robin de Villiers is a Java Contractor and associate of Fluidity Solutions based in London. With 17 years of professional experience in developing applications at banking institutions, gaming companies, telecoms and government agencies, Robin has spent quite a few years in the trenches. Robin is passionate about solving problems; The tools and methodologies that you use matter.