Key Takeaways
- Context is decisive in the creation of an organization's repeatable production-readiness review (PRR) process.
- Understanding the current PRR process and its particulars calls for psychological safety across teams.
- A robust retrospective process creates a lot of reusable components for an effective PRR process.
- Cognitive interviews and talking to relevant teams reveals insights that drive the PRR process.
- The PRR process works if there is a shared understanding of what it means to be production-ready and individuals feel comfortable enough to speak up and ask about the specifics of PRR.
At QCon Plus November 2021, Nora Jones, CEO and founder of Jeli, talked about how to build production readiness reviews (PRR) with emphasis on context and psychological safety.
Her talk focused on the particulars of a PRR process that relates to incidents.
Few services begin their lifecycle enjoying the support of site-reliability engineering (SRE). Your organization likely is spending time to create this process or to modify the existing process because you had an incident or an untoward outcome, and you might end up with a reactive approach — but that's okay. PRRs are usually fueled by an incident or surprise that prompts an organization to examine what it means to release a piece of software and to become a little more aligned on that. Part of that process is to figure out how those surprises happen by first doing a post-incident review and by looking at past incidents that may have contributed to untoward outcomes.
You can bring a lot of what you learn in incident reviews to PRRs, checklists, and processes. Organizations need to cultivate a PRR process and share it with everyone there. This article will present tangible ways to do that.
Context Is Important
Before you start creating a PRR process, you need to consider the context that applies to your organization. The processes outlined in Google's SRE books are not going to be relevant for a startup, and the production-readiness process that a startup uses isn't recommended for a larger organization.
Assume there are two types of organizations: pre-IPO and post-IPO. A pre-IPO organization might care a lot about releasing stuff quickly, whether or not people are using the product. What it means to be production-ready there is going to be quite different from a business with a service that needs to stay stable all the time.
So, don't take PRRs from other organizations as a blueprint; yours should be unique to you and your context.
The tech industry adopted the idea of PRRs from other industries. PRRs started in the aviation industry, where it worked really well. The concept then moved to healthcare, where organizations decided that since checklists were working very well in aviation, they should also use them. Robert Wears and Kathleen Sutcliffe in their book Still Not Safe looked at system safety in hospitals and wrote that “Many patient safety programs today are naive transfers of an intervention of something successful elsewhere, like a checklist. It’s lacking the important contextual understanding to reproduce the mechanisms that led to its success in the first place.”
This quote is relevant to the software industry because we may dangerously follow this path too. Google’s SRE checklist may work perfectly at Google, but context matters. You need to make sure that your checklists fit your own context. You need to know the reasons behind each item. You need to make sure your coworkers understand that as well.
Incidents Are Catalysts
Once you understand your organizational context, you can look at why you want a PRR process. Most likely, recent incidents have pushed you to it.
Incidents are catalysts to understanding the difference between how your organization is structured in theory versus how it operates in practice. An incident is one of the only times when rules go out the window, because everyone's doing what they can to stop the bleeding as quickly as possible.
Incidents reveal what your organization is good at and what needs improvement in your PRR processes. That difference between what you thought was going to happen and what actually happened can and should inform your PRR process in a beneficial way. Look at how people behaved unexpectedly in the situation, especially between teams. Talk with your colleagues. Collaborate. If everyone is remote, this is even more important.
Study previous incidents and releases to inform your PRR process, and repeat your examination each release. Consistency is important. You might not iterate your PRR process or start anew every time, but you should at least look at it a bit.
Building Your PRR Process
Psychological Safety and Multiple Perspectives
Your organization may have a PRR or similar process already in place but it is important to look at it objectively. You should know the history of your current PRR process and when it was introduced. And you should have a process that teaches this history to new hires and a culture that encourages people to question it.
Psychological safety is a work environment where employees feel free to express their questions, concerns, mistakes, and ideas. If employees don’t have psychological safety, they will continue to perform in the way that they are told is how it’s always been done. They will not question or revisit any reasoning, even long after that reason ceases to exist.
A good measurement of psychological safety in the organization is an analysis of how often people blindly follow lists and procedures despite feeling the need to question the context behind them. New hires at an organization sometimes think that something doesn’t quite make sense, but don’t yet feel comfortable enough to speak up — they do not feel psychologically safe. Do colleagues ask each other questions about why certain things are happening? That’s an important measurement for leaders in organizations as well, who should examine if folks ask them questions.
Asking questions is the only way to tease out what might no longer be relevant. People should feel safe to question rules and procedures that feel normal in your organization. Not every organization has that. It starts with cultivating that relationship with your colleagues.
You have high psychological safety when the people in your organization uniformly feel competent and comfortable enough to say what they actually think about a release. Ultimately, there are a lot more individual contributors to a release than there are managers. Managers can’t ignore some of those questions. The communication doesn’t need to be combative; it can just be inquisitive.
For example, you are aware that your company requires three people to review code before each release. As an engineer, say, you’ve never really questioned this before. But now someone has asked why you use three code reviewers, and why do they all come from this particular team. As the engineer, you should have the history of that particular nuance, which will help you in the future as well as during that current release. The process increases collaboration with your colleagues and understanding of your organizational context.
You will get many different answers about what’s important about a release, or any defined part of the business, from different people in different contexts with different opinions, from marketing to engineering to public relations. A senior engineer, CEO, and head of marketing have different mental models, and they're all partial and incomplete. Every person should feel safe to speak up. All should know why the PRR is happening and who is driving it and why.
You could even give your current PRR process to other teams and ask them why they think the organization is doing each thing. Don’t ask if they understand it, but ask them why they think each line item exists. If they give you a lot of different answers, it becomes your homework to find out why and to add more context to promote better alignment.
A junior engineer and a senior engineer may have different responses to a process or circumstance. I would encourage the senior engineer in that situation to ask more questions of the junior engineer on why they feel that way, and help give them the context, because that’s going to be useful data. It is revealing. If this junior engineer is going to be pushing code and doesn't quite understand some of the release processes and the context behind them, it might indicate that the senior engineer needs to partner with them a little bit more. It helps level up the junior engineer in the team. I think involving the junior engineer in the process without necessarily dictating the process to them is going to grow them and grow the organization in general.
Although you should encourage these multiple perspectives, at some point, there is going to need to be a directly responsible individual. Often it’s the SRE or the most-used software team. They’re going to have to put a line in the sand.
All mental models are incomplete. I think your job, as the person putting together that PRR document, is to integrate the data points between all these mental models so that you can derive a process that all can understand and use as a way to collaborate.
Not everyone’s going to be happy with the result. I think focusing less on people’s opinions and more on people understanding each other is helpful.
Components of a Strong PRR Process
The care that you put into the PRR process should match the importance of that particular release. The law of requisite variety holds, informally and practically, that in order to deal properly with the diversity of problems that the world throws at you, you need to have a repertoire of responses that is at least as nuanced as the problems you face.
Fig. 1: The variety of your responses must at least equal the variety of problems.
Figure 1 comes from a consultancy that does some of this work. Imagine this left column contains potential problems for a given release and the right column is a collection of responses that equals the PRR process. This is what we mean by creating a PRR process. Your control system must be at least as complex as the system that it governs.
We know context is important, but disseminating that context is also really important. The software industry spends a lot of time reacting to errors. The equation in figure 2 comes from cognitive psychologist Gary Klein. Improving performance is a combination of both error reduction and insight generation. Insight generation is the context: it’s how much you’re talking to everyone and how much they’re contributing. Yet the software industry primarily focuses on error reduction and is missing the mark on maximizing performance of releases and teams.
Fig. 2: The variety of your responses must at least equal the variety of problems.
My colleagues at Jeli and I have spent our careers studying introspection and incident reviews. We’ve spent a lot of time creating a strong retrospective process. You can lift and shift many components from a strong retrospective process to a PRR process.
You want to identify the data to analyze and where to find it. You want to talk to and have cognitive interviews with the folks participating in the release in some way. You want to share what you find and collaborate, involve them in the review process, design it together. You want to meet and share different nuanced perspectives of the event and different takes. Then you want to report on it and disseminate it.
Every organization has different needs, strengths, and limitations that apply to a PRR process. Take what steps make sense to you.
People that are working on the PRR process probably have constraints on how much time they can spend on it. They probably have other responsibilities, especially in a smaller organization. Keep that in mind as you're trying to introduce this process. It's important to take the time and space to actually put some care into this process. Don’t rush it.
Different releases are going to have differing requirements for the time and care spent on its PRR. Fundamentally, you can learn from all PRRs — but realistically, organizations are balancing time spent in development and operations against time spent in learning and sharing context.
Preferred Tooling for Implementing PRRs
The most important consideration in tooling is to choose a tool that is inherently collaborative. You should be able to see who is writing things in there. You should be able to highlight things from your colleagues.
I would strongly advocate that your PRR not be in a platform like GitHub or GitLab, or anything that is inherently a readme file that might not get updated that much. Use a format that lets you keep a living, updating document. Use that platform to encourage people to ask questions after certain releases and to ponder whether or not the PRR process worked for you.
I’ve certainly been in organizations where we've been a bit too reactive to incidents. With every incident that went awry, we would add a line item to our PRR process. Definitely do not do that, because you end up with this PRR process that’s just monstrous and you cannot understand the underlying mechanisms that led to each individual item you’ve added.
A lot of this is about asking your colleagues questions, and having those questions and answers be public in a living document. It can be Google Docs. It can be Notion. There are a lot of tools. The most important considerations are that it’s collaborative and that it’s living.
Earlier, I mentioned checklists, which can become dangerous, rote box-checking exercises without context. At Jeli, we create individual Slack channels for our releases so that we can analyze the chat afterwards to learn who was involved: who was observing the channel and who was participating in it. This way, we can capture every different perspective about each release. This is a significant amount of work.
You’re going into these events trying to understand what happened beyond your knowledge. That's why we incorporate the different perspectives of colleagues. If the impacted groups aren't talking, that’s something to dig into. Talk to everyone that was impacted by this in some way, even if they were just lurking or observing. Make explicitly clear what you expect from the people you talk to and what their role is in informing your PRR.
Fig. 3: Colleagues discuss a release in group chats.
Figure 3 is a screenshot from Jeli. We’re an incident-analysis platform, but some of this stuff can be applied to PRRs. This is a list of different people that may have participated in an incident. Hearing different perspectives when creating the PRR process is really important. We want to make sure it’s not just the person who pushes the button who judges that a release is ready to go. We want to make sure that we're capturing all the perspectives.
Creating Each PRR
Get introspective before writing a PRR. Look at incidents and talk to the necessary parties and teams impacted by the release in question. I recommend holding retrospectives for previous incidents or releases that surprised you or went well in various ways with the necessary parties to prevent you from doing this PRR review in isolation. Consider the context that inspired this PRR process, such as specific incidents or releases that didn't go as planned.
The person in charge of writing this PRR should acknowledge their role — doing so indicates psychological safety. Even if you’re not a manager or other organizational leader, there's a power dynamic associated with the role of PRR writer. Acknowledging that will help other participants to feel safer.
It's important to ask teams outside of SRE what they think “production ready” means. What does marketing think? What does that mean for product teams? For leadership? You should already have spaces to capture feedback about this particular release.
It’s very rare that one team owns an entire release while it affects no one else. When multiple teams own a specific piece of software, the team that is the subject-matter experts should own the PRR process. This is a mindset shift for SREs. Often, a team will claim their service is impacted the most, so they should own the PRR process for this. But impact has many definitions. If an incident has hit the news, the PR team has interest. Customer support people have their interest. I don’t quite know what it means to be the team that’s most directly impacted. This is why I advocate that we co-design this PRR process with any colleagues who might be impacted.
You want to use a cognitive style of interviewing when speaking with colleagues. You want to uncover their understanding of what the PRR is. Find out what's important to them, and what feels confusing, ambiguous, or scary to them. Ask what they believe they know about the PRR that their colleagues don’t. Talking to them individually will help reveal this. These explorations can even help you catch something before it has a failure or unintended consequences.
One of the ways you can check for impact is to check who’s lurking in your Slack channels about the incident. They might not be saying anything, but they’re there. Chat with them. It ultimately benefits you as an SRE to know what to prepare for when you're writing the software, when you're releasing it, when you're designing it.
Fig. 4: Teams discuss an incident at Jeli.
You may want to look at all the multiple elements. You might be able to identify previous release incidents and relevant people to inform the current PRR. Figure 3 shows a group discussion of three related incidents and group annotation. If you can compare the different ways that you discuss these incidents and examine the different ways these releases affected various parties, you're going to get a lot more value out of your PRR. You can see in this figure that we’re exploring the nuances of this particular release and how it affected the organization. People are tagging when stuff is happening, and people are posting questions that they have about the particular release and how it might have gone awry. This communal discussion can then inform the PRR process in the future, which might be an awesome action item of an incident.
There should be an overarching release process that benefits the organization as a whole but not every service in an organization requires a PRR. Individual services might want to own their own release processes, like before integrating with other systems or before deploying to production. That's up to each individual team.
But even if it’s just your team deploying a service to production, make sure you include all your team members in that, including and especially the junior engineers. In a psychologically safe organization, have them share their context first so that their opinions are not checkered by the perspective of more senior engineers. It really requires a safe space for them to do that.
Look at some of your downstream services, too, to learn how your past releases have impacted them and to incorporate some of those thoughts in your release processes. The same goes for any internal incidents. There are many different nuances to find in many different situations.
Spreading the Word
Once the PRR is complete, you must bring the findings back to the teams. It may feel difficult but it’s so necessary.
So often, people in organizations write incident reviews or release documents to check a box rather than to develop understanding. The goal is not to satisfyingly complete the task but to collaborate and improve. And to do that, you need to disseminate the PRR.
Every organization is a little bit different. Some love Zoom meetings, especially in this period of remote work. Other organizations might shy away from meetings and dive more into documentation. I’ve been in both and more kinds of organization, even those that hold a Zoom meeting to get people to even read the document.
Think about how your organization leans, and do whatever you can do to get people to discuss how they feel about some of the line items and how it impacts them. Too often, people write these reviews in a vacuum — you assume people are reading them but are you actually tracking that? You have to track their engagement with it. Talk to them about the review after it comes out. Don’t just blindly follow the process.
The PRR conclusions themselves may not even be a single document. What holds meaning for developers may not hold the same meaning for marketing and vice versa. Not everyone can extract value from one sole source. You can have a higher-level master document but base more nuanced versions of that for specific teams or departments.
But take care to avoid silos. Silos start by feeding on separation, and remote work has produced even more work silos than normal.
This comes back to tools. If someone haves a question about something they saw in a readme on GitHub, they have no place to comment on it. They can ping a friend or colleague on Slack to ask, but that ‘s not a public channel and a silo has begun to grow. And the person asked might have an incomplete view and response. They might think that that’s just the way things are done. It’s important to encourage folks in an organization to ask these questions in public — and to provide a place to do it. Senior engineers and leaders need to demonstrate this safety, so that other people in the organization follow.
We tend to think about systems, but people are part of these systems. It’s not simply pieces of software. We miss a whole piece of the system if we don’t take the time to understand the people component. Software engineers usually don’t get that training. It might seem like this is the easy part, but it’s just as complex as any other part of the system. Understanding both sides and combining them is so important for an organization’s success in onboarding new hires and keeping employees.
How to Know That There’s Improvement
How do you know if your PRR process is effective? You’re not going to get to zero incidents, obviously, but maybe your releases feel smoother. More folks might be participating in the PRR process. Maybe you feel more alignment with and among colleagues. Maybe people are more comfortable and more frequently speaking up or questioning the why behind different pieces of the review.
You know things are working when people know where and how to get context on the PRR. Folks feel more confident about releases. Teams collaborate in the PRR process so that SRE is not working in isolation. All colleagues have a better understanding of any incident and share a definition of “production ready”.
Not seeing that usually indicates that there’s more to dig into.