Key Takeaways
- We saw that motivated agile teams grapple to reach their software quality goals, leading to deteriorating software quality over time. We wanted to assist our freshly staffed team in reaching and maintaining their high software quality goals.
- We identified that preventing the broken window effect, putting in a lot of effort for quantitative and qualitative analysis, creating incentives, and regularly paying attention to software quality helped our team.
- To address those aspects, we created the lightweight practice "quality report," which consists of a role, ceremony, and artifact.
- At the center of the practice is the quality champion, the software quality equivalent of the Scrum product owner. The role is mainly designed to incentivize people to constantly invest in long-term software quality.
- With this practice in place, we have achieved and maintained high software quality for years and cope with the daily pressure of short-term achievements.
Even skilled and motivated agile teams sometimes fail to achieve their software quality goals. In this article, we present a simple yet effective practice we use to assist agile teams in reaching their quality goals and share our experience.
The practice is about paying constant attention to specific metrics. It means encouraging people to improve themselves in both qualitative and quantitative ways. We did this through specific roles, a ceremony, teamwork, and automation.
Environment and Context
We started in 2018 as solution architects at Swiss Post. We were tasked with creating a new software system for parcel sorting. The goal was to ship fast using continuous deployment while guaranteeing high reliability and availability. From the DevOps Report and our experience, we knew that software quality was critical, and we must achieve that quality from the beginning.
Our team had a product owner, a scrum master, three solution architects, thirteen developers, and three business analysts. The agile team handled the development and operations of thirteen applications for parcel sorting. We started with a small team and had permission to recruit new team members. We focused on quality from the start and hired with a focus on quality culture.
The software system responsible for parcel sorting determined who would receive a parcel and ensured it flowed through the parcel sorting centers to reach the accountable parcel deliverer on time. This critical system for Swiss Post needed to be available 24/7, reliable, and functional. Amazon Elastic Kubernetes Service (Amazon EKS) operates the applications written in C# and running in Linux containers. We used a mixture of Postgres and MongoDB for persistence and Kafka, MQTT, and REST for integration.
Difficulties with Quality Goals
As solution architects, we were responsible for shaping the software development process. We wanted to enable our agile team to reach its quality goals. We found the following aspects from past software projects that hindered teams from achieving their goals.
Missing Incentive: Often, there was no incentive for adequate internal software quality but rather pressure to ship features as fast as possible. Furthermore, organizations often believe they can trade off quality for speed. Yet, high quality is the factor that leads to fewer defects and faster shipment of new features.
Missing Regular Attention: Often, there were a lot of minor issues, like missing tests, logged errors, or warnings. They were not from significant flaws but could cause harm as they slowly accumulated. Soon, there was so much noise that nobody looked into the details anymore. Regular attention was necessary to catch and fix such issues. This meant looking at metrics and acting on them. For example, look at the amount of logged warnings, investigate each, and fix the root cause. Or, if it’s not worth fixing, dismissing it with a meaningful explanation.
Broken Window Effect: Another problem was the broken window effect, which states that deterioration attracts more deterioration. That meant the number of issues grew faster when there were existing issues. When the frequency of paying attention and acting lowered, the number of problems and the time required to see positive effects grew exponentially.
High Effort: Developers prefer to avoid repetitive tasks. Checking metrics in a bunch of tools was cumbersome. Developers eventually skipped the tools that needed more effort and ignored essential metrics. It must be a low-effort task to jump from looking at metrics and trends (quantitative analysis) to reasoning about why they show a specific value (qualitative analysis).
Addressing the Difficulties
Before developing the first new application, we wanted to address the difficulties with a lightweight practice. We came up with the following ideas. First, we aimed to solve a straightforward problem: high effort. Instead of using multiple tools to check specific metrics, we created a wiki page template to collect all the metrics in one place. At this stage, we did not invest in tooling. However, we cut the effort for developers by delegating manual collecting and template entry to a single person. We added hyperlinks redirecting the user to the source of the metric to make it possible to reason about the details.
The second problem to solve was the need for constant attention. We decided to have a recurring meeting every week. At the meeting, the developers should look at the metrics from the wiki page and explain any that exceeded a threshold or had a negative trend. For example, we would not accept errors logged in production. Thus, we would examine every error logged from the last seven days and define actions.
We were happy we found simple solutions for half of the problems. But how would we motivate individuals to attend the meeting and rigorously perform the tasks? The daily whirlwind would also affect this team, causing them to postpone non-urgent activities.
The incentive theory suggests that a drive for rewards motivates people. While many think money may be a good incentive or reward, research shows otherwise. A more effective and sustainable way is to provide meaning and make the work seen by respected persons. When looking at Scrum, the team works on tasks they get from the product owner and demonstrates the result in the sprint review. We wanted a similar role but with a focus on the quality of the product. We asked the product owner to state that product quality is a crucial goal and that he delegates the responsibility to achieve it to the "quality champion." (We will describe the role and relevant details later.)
Finally, to resolve the broken window effect, it was critical to act on issues as they appeared so nobody got used to them. This quick reaction gave a sense of urgency and signaled that specific issues were unacceptable. Part of the deal with the product owner was to legitimize the quality champion to assign technical tasks for one day per person immediately and with the highest priority. Another important aspect was to have a very low, even challenging, threshold and not accept sloppy excuses for exceeding it. This was especially true when symptoms were seen regularly, and the reporter jumped to a particular conclusion rather than investigating.
Introducing the Quality Report Practice
We defined artifacts, roles, and a ceremony for the practice. The artifact was simply called the "quality report" - a generated Confluence wiki page containing a set of metrics. A tool collected the metrics once a week from different sources and created this page. We ensured the product owner and all developers received edit notifications for this page. That meant they would see the quality champion and the product owner sign off on the report, which underlined the importance of the practice.
The roles were lead developer and quality champion. A lead developer responsible for investigating and defining actions was chosen for every application. A developer could be in the lead developer role for multiple applications, and the person occupying the role could also be changed.
The quality champion facilitated the weekly meeting, inquired, challenged the reasoning, and requested improvement. In addition to the tasks during the meeting, the person in this role also wrote the management summary and made suggestions for the product owner. To guarantee the desired effects, it was essential to occupy this role with a highly skilled and respected senior developer.
The ceremony was an online meeting hosted by the quality champion every Monday morning. We started the meeting with the preparation phase, in which every lead developer looked at the metrics, did the necessary investigations, and filled in the wiki page. Then, the lead developers reported their findings to the quality champion during the reporting phase. In this part, the quality champion challenged reasons and actions, especially by asking for elaboration when someone made an assumption, such as "there was a network problem." Discussions between the lead developers occurred in the last part of the meeting.
The Report in Detail
The report’s "state" was shown at the top. The quality champion set the state to "approved" after writing the free text part of the management summary and signing the report. The product owner received a notification, read at least the summary, and signed the report. The application summary was automatically created, and the overall status of every application was summarized in the report.
The application section followed the management summary section. Each application had a chapter; the header contained the lead developer and deep links to the metric sources. The table with the metrics collected from different sources always showed a number, the metric's name, the value from last week, and the current value. The state was set depending on a configurable threshold and colored to catch the attention of the person who analyzed the report.
Analyzing the report means the responsible person looks at every metric with an "action" state. Either it is to be ignored (for example, code coverage is low, but there is already a pull request waiting to be merged that will fix this), or action is required (for example, we need to investigate and resolve an unexplained logged error in production). The explanation was then written in the chapter below the table, using the number in the table as a reference.
Teams and apps may differ in using and interpreting metrics and thresholds. Our team makes use of the following set of metrics for every application.
Static code analysis: We use Sonar and all code analysis rules from the .NET SDK (aka Roslyn analyzers). The team decided not to accept any issues; hence, the threshold was zero.
Automated testing: We practice continuous deployment and automate all our tests. The foundation of our quality assurance is our component tests, consisting of unit tests and component integration tests. As there is no other quality gate and there is no possibility to distinguish between important and unimportant to test, the desired coverage is 100%, and we do not accept any ignored tests.
Code Duplication: There is no threshold for code duplication, and it is up to the reporter’s judgment whether it is okay. Here, it is vital to keep an eye on the trend. The lines of code are interesting because they correlate with build time and give a hint about application size.
Logging: Our goal is to discover every issue in production before we get a call from our customers. Hence, we have a rigorous exception-handling and logging strategy. The heuristic is: Every logged error needs immediate attention, and every warning needs to be examined within a day. We do not accept any warnings and errors in production, and we must investigate every log message of this type.
Bugs: It is essential to keep track of open bugs. They are always a high priority. The team should fix them or accept and close them. This metric counts the Jira issues from type "bug" for an application to prevent bugs from slipping through.
Pull requests: We apply trunk-based development with feature branches. With this strategy, it is vital to merge pull requests quickly. A tool automatically updates dependencies and merges the pull request when the CI build is green. Pull requests left open for a few days indicate that action is needed.
The Learnings
After four years of using the quality report, the team size and production applications tripled, and the code base expanded significantly. In an internal survey, developers agreed that the quality report practice was a valuable maintenance and operational quality tool. The consistent focus and incentives facilitated adherence to ambitious goals. Achievements included 100% code coverage, the absence of static code analysis issues, and logs without noise, respectively, with no known errors in production. This commitment to high quality enabled frequent deployment of mission-critical applications.
We had some additional learnings and essential elements to highlight. A notable observation was the rapid tendency of individuals to discern patterns and hastily draw conclusions. For instance, when encountering database health check errors, some immediately attribute it to ongoing maintenance despite the absence of such activities. It highlights the indispensable role of the quality champion, whose responsibility lies in challenging explanations from a technical standpoint. We also learned that the possibility of immediately addressing specific issues is vital. It leads to a sense of urgency, which rewards doing such tasks and prevents the pile-up of minor priority issues in the backlog.
We saw that the initial phase, i.e., the ceremony preparation and analysis, is essential. We tried individual preparation with a time blocker, but it did not work. We found that the collaboration aspect and "think out loud" analysis in the preparation or later in the ceremony have positive effects. At some point, we tried to involve non-developer roles in the quality report, but that did not work out. Even though they showed up for the sessions at first, they did not contribute to the report and, after a while, stopped attending altogether. Active participation in the quality report requires knowledge about the application’s inner workings and a general understanding of the metrics.
How to Start
The team must believe that a high-quality standard makes software development faster, cheaper, and more fun. Often, people only pay lip service to it and omit essential practices. A team needs to think about metrics that reflect the quality of the software development process and the software itself. Software quality is challenging to measure objectively, but this is unnecessary as the team defines the metrics for itself.
Still, the challenge with daily business arises. Urgent day-to-day operations create distractions, which keep a team busy and prevent them from focusing on critical long-term goals. To tackle this, a team should use something like what we presented with the quality report. It is crucial to consider and solve the mentioned difficulties. They are incentives, regular attention, the broken window effect, and high effort. Deploy a practice of your choice to address those difficulties, start small, don’t focus on tooling first, and iteratively inspect and adapt it.
In addition to solving problems, a team must deeply understand technology and software practices. This is important because the development process creates quality, and the team needs to be able to act on the metrics. Establish a healthy balance of seniority and education levels in the team.
Lastly, ensure that the team owns the software development process. That means the team’s responsibility is to decide what practices to use and how to work. Agile is widespread. However, we still notice that stakeholders dictate parts of the software process. They lack the knowledge to know what practices will work.