InfoQ Homepage Articles Reaching Your Automatic Testing Goals by Enhancing Your Test Architecture

Culture & Methods

Reaching Your Automatic Testing Goals by Enhancing Your Test Architecture

Dec 04, 2024 12 min read

James Westfall
Senior test engineer at Neat

reviewed by

Ben Linders
Trainer / Coach / Adviser / Author / Speaker @BenLinders.com

Write for InfoQ

Feed your curiosity. Help 550k+ global
senior developers
each month stay ahead.Get in touch

Key Takeaways

Test architecture integrates technical elements (like code, frameworks, and tools) and human factors (team organization and processes) to create reliable and efficient testing systems that deliver real value: development teams can move quicker without breaking things.
Prioritizing the analysis of test trends over individual test failures allows teams to focus on critical issues, enhancing the reliability of defect detection while maintaining productivity.
Developing a collaborative testing framework encourages feature teams to actively write tests, leveraging their domain knowledge and ensuring quick and effective progress through provided tools and mentorship.
Using machine learning for auto-triaging test failures enhances situational awareness and optimizes responsiveness to software quality issues, promoting a data-driven approach to evaluating test results.
Targeting the reduction of "microfailures" that frustrate end users fosters a smoother customer experience, aligning testing strategies with company values and emphasizing long-term quality improvements.

If you have automatic end-to-end tests, you have test architecture, even if you’ve never given it a thought. How can you ensure your test architecture achieves the goals you have for your automatic testing effort?

In this article, I will dive into the choices my employer, Neat, a video conferencing solution company, has made in our test architecture to preserve and create value for our company, customers, and software development teams.

Test architecture encompasses everything from code to more theoretical concerns like enterprise architecture, but with concrete, immediate consequences. While our specifics (embedded software and end-to-end tests written in Python) may not match your situation, many of the questions we asked to determine the choices we made in our implementation are likely relevant for any team doing automatic testing. These questions have led to the measures we have intentionally chosen to catch regressions, reduce false positives, write maintainable tests, understand test results, and enable development teams to do their job faster and with increased confidence.

What is test architecture?

Core principles of test architecture

When thinking about test architecture, I see a couple of core guiding principles:

Trade-offs: Every decision in test architecture requires balancing priorities.
"Why" over "How": Understanding and clearly communicating the purpose behind choices, especially when deviating from established practices.

A good illustration of the first principle is the way our tooling lets my team focus on test trends. We often down-prioritize investigating individual failing end-to-end tests if our reporting tools show they have a previous clear passing trend. This means developers may get feedback about a defect they have introduced into the nightly build in a couple of days rather than immediately. However, the trade-off is that by choosing not to analyze every test failure deeply and instead focusing on trends, we face fewer distractions and less context switching, thus maintaining a laser focus on what is important. We, therefore, catch defects more reliably, ultimately letting fewer defects slip through to the customer.

Concerning "why" is more important than "how", I think a (test) architect always has to have a clear view of what they are trying to accomplish when, for example, they advocate the use of a specific code construct. Suppose that construct brings, for example, type safety. In that case, they should be open to not using it in particular contexts where type safety problems are not likely to manifest themselves, and that construct just increases complexity.

People sometimes get upset when it looks like an architect is providing seemingly inconsistent advice in a code review by advocating a solution that does not correspond to "how we usually do things". In that case, it is all the more important for them to be clear about the "why" - why they advocate the things they do.

Overview of the test architecture at Neat

I’ll break the discussion of our test architecture at Neat into three points:

Organizational structure: Dedicated test automation team and its collaboration with development teams.
Evolution of test automation: Transition from basic coverage to a collaborative framework where feature teams write tests using provided tools.
Framework and CI infrastructure: Using Python Behave for reusable test steps and building a streamlined CI pipeline for testing and reporting.

Organizationally, we have a dedicated test automation team that owns the test automation discipline. We are a maturing startup, creating video conferencing solutions, and we started our test automation drive about three years ago. We began by building up coverage of our products’ existing functionality. At the same time, we focused on creating frameworks, tooling, and processes that our discipline development teams (audio, video, platform, etc.) have seen added value in adopting and using as they develop new features. This has led to discipline team members using their domain knowledge to write new tests using the tools we provide, while the test automation team provides mentorship, mainly through code reviews of the test code these teams write.

Our primary test framework is Python Behave. In the test automation team’s initial phase, we created a lot of reusable test step implementations that cover most of the basic operations you need to perform when running tests on our video conferencing devices. So when developers in the feature teams start writing a test for a new feature under development, most of the test code required for that test already exists. They can just include the existing steps in the Gherkin (given-when-then) test case. Then, they only need to focus on developing the few remaining steps that correspond to the specifics of the feature they are developing. This allows them to be effective when creating new tests.

At the same time, we have built a simple-to-use CI infrastructure for running the tests, reporting, and following up on the results. Again, our focus has been on creating easy-to-use, reusable components that create value by making it easy and quick to develop new test jobs.

We use Gitlab CI for running our tests, and when you want to create a new job you can just inherit one of the configurations we created to handle flashing your devices with fresh firmware, running your tests, and reporting the results. And that’s just to name some of the most important things this reusable base configuration does.

On the reporting side, we use Elasticsearch, and teams can take a copy of the Kibana dashboards we have made and add just one filter so that their version of the dashboard only tracks the test results and test trends the team is interested in.

Strategic approach to testing

While we focus most of our efforts on nightly builds, we have chosen to have a minimal set of gating tests, a subset we call "essentials".

We then run hundreds of feature tests that are not gating and do not stop the software build in the CI since end-to-end tests are, in spite of our best efforts, affected by an incredible number of variables.

Therefore, we have focused on running as many tests as possible every night, quickly recovering devices when failures occur, and quickly getting on to the next test without previous failures failing the following tests.

We then provide tools focused on qualifying trends to effectively see through the noise and identify actual problems. We have built a framework that qualifies the trend for each test case and stores the 30-day trend for that test case as an attribute of each test result we store in our Elasticsearch document database. This qualification is both graphical with a red and green (passing-failing) trend band as well as a textual classification such as "intermittently failing", "consistently passing", or "flaky". The latter classification is made by a functional programming-inspired framework written in Python, where one function is responsible for identifying each category. The test result gets its category when one of the functions returns a match. The result looks like this on our dashboard:

Actively creating value for the development teams

I have already mentioned how we provide our feature teams with a framework to make it easy to write new tests and get those new tests quickly up and running in the CI. This is just one example of the value that we, as a test automation team provide to our feature development teams. Another example is a chat agent tool that lets both QA team members and, above all, developers order tailored test runs that only target specific domains or feature sets. These test runs can be ordered to run against any build, including feature team developers’ personal builds. We then provide reporting on all of our test runs that auto-triages known issues so the developer or tester can quickly distinguish between expected failures and issues related to the changes that are about to be merged into the main branch.

Machine learning in test automation

When a developer orders a custom test run to validate changes they are thinking of integrating, they get a report that clearly differentiates between expected failures and issues their changes may have caused. You can even tell the machine learning model behind this to auto-triage on the next occurrence of newly identified issues stemming from the code changes being validated. You can do this from within the same reporting tool developers use to see the results from the custom test run they ordered. This way, as a test automation team, we don’t have to opt for the deeply unsatisfying solution of turning off tests that we expect to fail. But we still provide readable, actionable results. And since we’re also tracking trends, developers are automatically alerted when regularly failing tests and regressions are fixed.

If we started our initiatives related to machine learning six months later, we would have had the option of letting an LLM do the heavy lifting for us. But when we started, we had to build our own models, albeit using mature and easy-to-use libraries. Our first initiative, which involved classifying types of test failures, was moderately successful but did not change how we worked. However, that experience laid the foundation for understanding how to use machine learning to auto-triage test failures. That indeed changed the way we work. We could allow tests we expected to fail to continue to fail without wasting time and losing clarity when distinguishing between expected and new failures. We would not have been able to make that leap if we had just passively used a black box LLM to solve the classifying problem without really understanding how these models work. That has been crucial to understanding the possible further applications of the technology.

Auto-triaging known-issue-related failures has been an unqualified success for my team. Our machine learning model doesn’t have the biases and tendency to jump to false conclusions that we humans do. We continually see cases where our machine learning model stops auto-triaging a regularly failing test because that test has suddenly started to fail differently. This fact allows my team to extract value even from failing tests because being alerted to changes in how tests fail contributes to my team’s situational awareness regarding the software we are testing. Our machine-learning models have a much larger memory cache than humans when performing classification tasks for us. So, using these models is an instance of letting the machine use its advantages to compensate for our innate weaknesses and thus complete our skills as a QA team.

Cross-team trust and mentorship

My team is in a unique position compared to other test teams in other companies, where we provide code review and mentoring to the software developers building new features in our video conferencing solutions. This is related to new end-to-end tests the feature teams are working on to test new features they are implementing in our products.

The developers in our feature teams recognize that we, as a dedicated test automation team, have special competence in making tests readable, maintainable, and, above all, reliable. We also help them find and take advantage of the test tooling we have created. This trust and collaboration goes both ways. We, in the test automation team, rely on and collaborate with the feature teams to ensure the necessary test instrumentation is built into our products to make many of the test automation scenarios we need to cover possible to automate in the first place.

Aligning testing strategy with company values

We at Neat have specifically focused on our test automation strategy to eliminate something called "microfailures". We can describe a microfailure as an instance of friction in a system that manifests itself when an end user encounters difficulties and must, for example, retry the same operation several times before it finally succeeds. As such, "microfailures" are the opposite of complete failures, or "macrofailures", where something just doesn’t work. See a scholarly article here Introducing the concept of microfailures.

Microfailures are pernicious because customers rarely think they are worth reporting when they occur. But if one customer hits several of them, it can be very frustrating and may drive them to switch suppliers without us ever knowing why.

The way to mitigate them is to identify and track them over time. Then, work on the problem until you see you have reduced or eliminated the problem. This means creating test tooling that doesn’t fail the test at the first difficulty but instead tries again and reports how many attempts it took and if the test got there in the end. We have created a special framework for this that collects data for every microfailure we track. It allows us to track the trend for each microfailure over time. We can thus see if the issue is improving or worsening and use that to continually evaluate the success of efforts to mitigate each microfailure.

Eliminating microfailures is essential for my employer since we strive to deliver a friction-free experience to customers using our video conferencing solutions. This is a case of aligning our test architecture with who we strive to be as a company.

Creating value through testing

Test automation is a challenging endeavor. It’s not a discipline for people who can’t cut it as "real" programmers. Without good test automation resources, companies may be tempted to give up entirely on end-to-end testing and focus on unit tests on one side of the spectrum and manual testing only on the other.

But that is a significant loss because it can provide value by consistently catching regressions that get past unit tests and make it to customers. Human-based testing is too irregular to catch every regression every time.

The key to consistently delivering value is a well-designed test architecture that frees up manual testing resources for what we humans are especially good at: exploratory testing.

About the Author

James Westfall

James Bornefelt Westfall has worked for over 20 years in software and hardware development. He has worked with test automation since 2008 and has gradually switched from a successful developer and architect to an experienced test automation specialist role. James focuses on how automatic end-to-end testing can bring together all the participants in an agile team and deliver value early in the requirements gathering and active development stages, long before the software is finished. When not giving Neat’s video-conference solutions strange dreams, James enjoys the Norwegian outdoors, especially when skiing and flyfishing. You can find James on LinkedIn and on GitHub.

Show moreShow less

This content is in the Culture & Methods topic

The InfoQ Newsletter

A round-up of last week’s content on InfoQ sent out every Tuesday. Join a community of over 250,000 senior developers. View an example

We protect your privacy.

InfoQ Software Architects' Newsletter

Login with:

Don't have an InfoQ account?