Key Takeaways
- Programmable infrastructure is complicated, has risks, and still needs testing.
- Tools alone won’t solve the problem, especially because cloud infrastructure is fast moving.
- Don’t only test basic assertions. Your testing strategy should mitigate risk and investigate quality aspects that matter to you.
- The combination of cloud infrastructure and exploratory testing is rare. Exploratory testing is a powerful approach that complements automated regression suites.
- Testers need training and experience to carry out effective testing, as well as collaboration and help from infrastructure experts.
At QCon London 2017, I presented the talk Testing Programmable Infrastructure With Ruby. Daniel Bryant, an InfoQ editor, has a written summary here.
In my talk, I relayed my experience at OpenCredo, where I was testing a cloud broker we were developing for a client. The cloud broker provisioned and managed cloud infrastructure in multiple clouds. There were two targeted end user groups, both within the client organisation:
- Internal development teams, who want to deploy their applications on cloud infrastructure.
- Management, who want to track team resources and spending.
We found we could apply standard techniques to test the web pages and API of the application. But we found testing of the provisioned cloud infrastructure difficult. We had to adopt new processes, techniques, and tools to get the same kind of coverage and assurance. We also came up against technical limitations intrinsic to cloud infrastructure:
- Many operations are slow and asynchronous.
- Resources are expensive to deploy.
- Tooling is still maturing.
Worse than the technical challenges, we faced cultural challenges too. Sysadmins and testers aren't used to working with one another!
The project made it very clear to me that programmable infrastructure is becoming widespread. There are very specific domain issues that make testing it tricky. But it felt like nobody had the answers.
Infrastructure resources are critical to successful software. If there's a problem with your database or your load balancer, it now could be due to committed code. That code is production code, so we should test it!
Over a year has passed since I first presented that talk. Even longer since the project which inspired me to present it. I have been on a number of other projects since, and my thinking has changed too.
Tooling is getting better, but it’s not everything
One of the themes of my talk was that tooling has to improve. I’m reasonably confident that it will continue to. As more engineers carry out the same tasks, they’ll want tools to streamline their work.
One tool I’m excited about is Terratest. At OpenCredo, we already have a major project using it. Terratest has common helper functions for Terraform, Packer, Docker, SSH and AWS APIs. You can use it to test Terraform code, Packer templates, and Docker images.
A simple test might involve:
- Invoke your Terraform scripts to create some AWS resources
- Use the AWS API to verify the creation and configuration of each resource
- SSH into any provisioned EC2 instances and verify some things about those
- Destroy the resources afterwards
These aren’t the only technologies that need testing, though. In my experience, changing a Jenkins Groovy pipeline is risky and complicated. This is often because it's awkward to test them. I’ve still yet to see a regression testing tool for it that I’m happy with.
When I first wrote my talk, my primary focus was on virtual machines. I'd want to ensure our cloud broker correctly provisioned and configured servers. ServerSpec was perfectly suited to this task.
In hindsight, this is a very limited view. My last major project was serverless. The application's purpose was to maintain an inventory of cloud resources. This posed two problems. The first was ensuring the application catalogued and discovered external cloud resources. The other was ensuring our own infrastructure was operating as required! We used a variety of AWS tools - DynamoDB, Lambda, SQS, Kinesis, and more. They were all provisioned from infrastructure code. ServerSpec is great, but it didn't help at all, because there were no servers we had access to!
I can't recommend tools for the long term. Tools work for specific technologies, and the industry is too fast moving. The number of AWS services alone rises year on year. The rise of serverless and containiseration technologies only adds to this. You could learn an amazing Kubernetes or AWS Lambda testing framework. But it might be redundant in another couple of years.
So if tools can’t help us, what can?
Revisiting testing fundamentals
When testing anything new, it’s important to revisit fundamentals. When I first gave my talk, I focused a lot on how we tested it, but not a lot on why we tested it. I cannot tell you what your cloud infrastructure landscape looks like. The topic is very broad, and fast changing. On the other hand, testing fundamentals don’t change. We need to think about how we can apply them to a new domain.
One of my favourite topics to ask in interviews is "What is testing? And why do we do it?" It is a curiously difficult question. Here’s my answer:
Software testing is an investigation to discover information about software quality.
It is not a simple exercise in re-verifying what we intended to build - but that is an important aspect of it. You can misinterpret what users want or need. Your application can have use cases you didn't think of.
So how do we start generating ideas about unknown aspects of our system?
Test heuristics
A test heuristic is an experience based technique, and used to generate ideas for tests. Every time you write tests, you consult your heuristics. Elizabeth Hendrickson has a cheat sheet of examples. Some heuristics are generic, some are domain specific. Security and performance have their own heuristics - this is why there are often specialists in these domains. For further reading, I’d strongly recommend reading this excellent article by Katrina Clokie.
Heuristics are a great place to start with what kinds of tests we might want to carry out on programmable infrastructure. These might relate to common risks relating to infrastructure, like data loss or availability. Here is a sample:
- Are my volumes deleted when I terminate my instances?
- Is my database sharding algorithm distributing load to each database shard evenly?
- Are my AWS S3 buckets public, and do they need to be?
- Have I examined my IAM roles and who has access to what?
- Can I redeploy resources whilst the application in use, and will it cause data errors?
One property of heuristics is they require domain knowledge and experience. Asking these types of questions involve detailed knowledge of application architecture. Judging whether these questions are worth asking requires an ability to analyze and assess risk. Lastly, we need the experience and knowledge to understand and interpret problems with the answers we get. The mechanisms by which we recognise problems are called test oracles. To read more about oracles, Michael Bolton has a great description here.
Going off the beaten track
Now that we have all these great ideas for testing different aspects of our infrastructure, we need to be able to execute these tests. In all likelihood, you’re going to struggle to automate every test idea you could think up. And I’d argue you don’t need to.
The best testing is often a mix between well designed automation suites, and exploratory testing. It is a powerful technique that is technology and domain agnostic. There is no reason why we can’t start using it now for infrastructure.
Exploration sessions are dedicating time to investigate risks, without the restriction of scripts. Exploration is not to be confused with ad hoc testing! They involve a time-boxed session of an hour or so, targeted at a specific aspect or risk of an application.
The only guidance is a test charter targeted at uncovering specific information. For example: "I want to discover potential security issues in our application by examining our IAM roles and what they have access to".
This charter gives a broad overview of what kinds of information might be expected, whilst giving freedom to find unexpected issues. In your sessions you think of a test, execute it, observe the results, and use that to inform your next test. Tests can be user focused, technically focused, or both. As you go, you document everything you do and observe, so you can report your findings to others. Using creativity and intuition is actively encouraged. The idea is to find as much new and interesting information as possible.
This activity is not intended as a replacement for whatever test automation you may have. It is a complement. It uncovers lots of bugs, and can feedback into what you want to assert in your automated tests.
It also clarifies ambiguous functionality in software. Sometimes software works in an unusual way and it's not clear how it should work. You get a lot of "huh, I didn’t think of that!" when debriefing others after an exploration session. This kind of analysis is not easily automated and doesn’t have clear pass/fail criteria.
There is a lot of information out there about how to do exploratory testing well. I’d recommend reading Explore It! by Elisabeth Hendrickson. It’s a quick and enjoyable read, and it would be amongst the first books I recommend to any new tester.
It may seem odd that I’ve brought up an agnostic test technique in an infrastructure related article. But the application of exploratory techniques to programmable infrastructure is very low. Experts in cloud tend to come from an operational or development background. Testers are often non-technical, or leave programmable infrastructure to others. I think there’s a lot we can gain if we start cross-pollinating ideas and techniques.
Testers need to get better at cloud infrastructure, ops need to get better at testing
A key theme of my original talk was that, as a tester, I was unfamiliar with such a massive new domain. This made it very hard for me to know what to test, and whether the things I observed were good or bad.
The most important thing is to raise awareness. We need to value testers with infrastructure skills. Currently, the most valued aspect of most testers is the ability to automate. This has led to a lot more automation engineers. Testers have learned new skills and developers help out with technical testing activities. I see no reason why we cannot shift the same way with programmable infrastructure.
Another is to skill up and involve our existing testers. This is not too different than existing approaches to teach testers technical skills. Some activities that can help are:
- Training / certifications: I’m currently studying for the AWS solution architect associate certification. It covers the basics of the different AWS services and how they fit together. The training available for such certs help any tester understand cloud infrastructure. Such training courses include A Cloud Guru and Linux Academy. If you want to learn programmable infrastructure, you could use something like Katacoda.
- Pairing: Testers and devs often pair together. So they can also pair on infrastructure code. Even if it seems like the tester only shadows to learn the domain, this still helps them test the system.
- Assigning infrastructure work: Often the best way to learn is by doing. A lot of testing activity can involve infrastructure related work. For example, setting up pipelines and test infrastructure, or designing infrastructure tests. This may work more easily with technically focused testers. They could even be already doing these tasks.
- Do collaborative activities: There are many collaborative testing activities that apply to infrastructure. Risk storming could highlight risks with the pipeline or environment deployment. You can apply Three Amigos to infrastructure stories. For example: "As a developer, to run my Java applications, I need the JDK 1.8 runtime installed on our containers". How could you test this? Why does it need to be JDK 1.8?
Everything changes, and nothing stands still
Our industry is still adapting to not being siloed. Many of the issues I’ve brought up here have already been solved, but not everyone knows about them. They are not new techniques.
The fact is, we are all engineers. And we all have our specialisms. But we can all benefit from learning from one another. I’ve been really encouraged by people reaching out to me saying they’ve been facing the same problems, and want to do something about it.
If we want to start testing infrastructure properly, we need two things.
First, testers need to learn more about the domain. Learn about cloud infrastructure, and learn how programmable infrastructure allows us to deploy and maintain it. Learn what could go wrong, and where the risks lie.
Secondly, the rest of us need to value what testers can bring to the table, and recognise that infrastructure is not a domain just for ops (and now devs). The best testing strategies are context specific. They successfully target and mitigate risks and stop problems occurring in your production environment. Risks exist in infrastructure too.
About the Author
Matt Long is a Senior QA Consultant for OpenCredo, an open source consultancy based in London.
He has worked as a consultant for many years, and built test automation frameworks in half a dozen languages. His particular areas of expertise involve cloud infrastructure, serverless architectures, API and web testing. He has presented talks at QCon, muCon, TestBash, and many smaller meetups across London.
In his spare time, Matt is interested in indie pop music, videogames, and has an encyclopedic knowledge of Simpsons quotes. He also builds and maintains a machine learning foosball bot.