In this podcast Shane Hastie, Lead Editor for Culture & Methods spoke to Vladyslav Ukis about his new book, Establishing SRE Foundations.
Key Takeaways
- Most of the existing publications that look at implementing SRE assume a mature existing infrastructure
- Vladslav approaches the topic from the perspective of an organisation who do not have the infrastucture and processes in place and shows how to build them from scratch
- Adopting SRE requires a profound socio-technical change in a socio-technical system
- SRE as a methodology that is about bringing alignment on operational concerns across the entire organization
- SLO’s, error budgets and incident response processes are part of a sustainable SRE approach
Subscribe on:
Transcript
Shane Hastie: Hey folks, Qcon London is just around the corner. We'll be back in person in London from March 27 to 29. Join senior software leaders at early adopter companies as they share how they've implemented emerging trends and best practices.
You'll learn from their experiences, practical techniques and pitfalls to avoid, so you get assurance you're adopting the right patterns and practices. Learn more at qconlondon.com. We hope to see you there.
Good day, folks. This is Shane Hastie for the InfoQ Engineering Culture Podcast. It seems like a year ago, the last time Vlad and I sat down and we were just checking the records it was. We are talking today about Vladyslav Ukis's new book, Establishing SRE Foundations. Vlad, welcome back. Nice to see you again.
Vladyslav Ukis: Thanks Shane. It's great to be here again on the show and it's incredible that exactly year ago we were recording the first podcast, so now it's the second one.
Shane Hastie: Indeed. For the benefit of our listeners who haven't come across you in your work, give us the five-minute background. Who's Vlad?
Introductions [01:13]
Vladyslav Ukis: I've been working in the healthcare industry for a long time and in the recent years I've been working on cloud computing. We've got a big platform for digital services in the healthcare domain, and this is where I'm running the R&D which includes development and operations.
That also leads us to the content of the book, which is about how do you introduce operations in an organization that have never done that before, that only did development before. Overall, as part of that work of the last decade, we have introduced lots of new processes in an organization that again, has never done those things before.
Like continuous delivery, releasing faster, operating the product, measuring the value of the services that are running in production and so on and so forth. I's been a real transformation journey over the last decade.
Shane Hastie: Interesting stuff. Why do we need another SRE book?
Why another SRE book? [02:12]
Vladyslav Ukis: That was actually the question that I asked myself when I first got in touch with SRE because all the existing SRE books back in the day, they were written either by Googlers or ex-Googlers. Basically, the whole idea of SRE comes from Google and therefore also the publications are coming from the Google employees.
Therefore, also those publications, books and so on, they focus on, I'd say, advanced SRE. If you are a new organization, new to the topic of operating software, new to the topic of software as a service which requires you to operate the software, then you first need to build some foundations until you are ready for the advanced stuff.
So that's what the book is about. It's about helping the organizations to make the first couple of steps on the operations ladder before they are ready for the advanced stuff that is described in the original SRE books by Googlers.
Shane Hastie: What are some of those foundations? What are the things that organizations don't have in place typically that they struggle with at the beginning?
Areas where organisations struggle at the beginning of an SRE journey [03:20]
Vladyslav Ukis: I think at the very beginning, there will be a lot of questions why development needs to do operations at all. In the traditional organizations, the typical setup is that there is the development department and then there is the operations department.
The development department does the development and the operations department does the operations, and that's the way it's always been. Therefore, it's very difficult to get to a point where it becomes clear that you need a totally different approach if you are serious about running digital services in production.
As your traffic grows, as your frequency of delivery to production grows, you will see the need to operate the services by the people who are actually developing them because otherwise the handovers to cumbersome the handover from development to operations cannot happen frequently enough for the frequency of production updates.
Also, the troubleshooting the time to recovery from failures in production can only be done fast if you are able to get those alerts, that report on something wrong in production to the people who can actually fix the problems fast enough.
Shane Hastie: This is a culture shift. How do we encourage that culture shift?
Making the challenges visible in order to address them [04:49]
Vladyslav Ukis: How it typically would happen is that the organization operating in a traditional way would come to a point where the limitations of the current approach will become evident. You'll have, for example, lots of production outages or you'll have long time to recover from outages and things like that.
With the current approaches, the organization cannot improve the situation so that will be a natural push to look for something new. Here the SRE as a methodology comes in handy because it's about bringing alignment on operational concerns across the entire organization.
For the product owners, for the operations engineers, for the developers, there are things within the SRE framework that make them all think about the reliability that the services in production need to have and how much they're willing to pay for it. How much they're willing to pay for the level of reliability that they want to get.
Over time, I think it'll need to lead to an organization-wide initiative to introduce Site Reliability Engineering because it literally affects all aspects of the product delivery organization. It affects product management, it affects development, it affects operations.
Therefore, it'll need to be something that the organization puts on top of the list of organizational initiatives that they undertake. And then from there, once there is an agreement that you put SRE onto the list of initiatives, then at some point there is endorsement for that to happen.
And then the activities in different teams can start and slowly over time, team by team, that transition can happen until the organization reaches an optimization, a point where the foundations are established and the optimization happens in all teams.
Shane Hastie: If I'm a middle level team leader, I'm being asked to do more and more of that, getting to the culture of you build it, you run it, bringing these things in, how do I encourage my leadership to invest in this?
How to influence change from the middle [07:02]
Vladyslav Ukis: I think there are a couple of things that a middle manager can do. On the one hand, they first of all need to be convinced that this is the right approach themselves. They need to educate themselves and know that this is the way to go from their point of view.
So then with that conviction, they can start talking to the leaders at the organization to basically gauge their temperature to see where they stand in their understanding. And then you can find probably a leader that would support this.
And then with that leader, you can then broaden the alignment because they would then talk to the others and so on. That way you create the understanding at the leadership level. Then once you go further, you need to sell the benefits that that would bring to the individual function, right?
What are the benefits, for example, to the development function? What are the benefits to the operations function? What are the benefits to the product management function? Basically, over time you build up the momentum until there is enough understanding in the organization to try that thing.
I think you definitely need to frame this as this is another experiment that we will run and see how that goes because this requires a profound socio-technical change in a socio-technical system.
Therefore, it all very much depends on people that are there and it might work or it might not work depending on the circumstances and the attitudes of people and the willingness to change and the necessity and so on.
Shane Hastie: Stepping back a bit, what are some of the key concepts in the book? I see the acronym SLO popping up over and over again. What is an SLO and why do I care?
Introducing Service Level Objectives [08:40]
Vladyslav Ukis: Typically, if you take an organization that has never done operations before, then they are not aware of the level of reliability that they provide to their users, to their customers. Typically, this is not a quantified thing this reliability.
Obviously, everybody talks about our system needs to be stable and reliable, but this is typically not quantified and this is what those SLOs do. They let you quantify your reliability. That forces the organization to first of all come together and think about the level of reliability that you want to provide.
Therefore, the SLOs, that's an acronym of Service Level Objective. What is the objective of the service level that we provide for this service and that service? If you aggregate the services to bigger digital services that you sell, then what is the level of reliability that you provide at that level?
If you break that down, then you start thinking about or you are forced to think about, okay, so this Service Level Objective is then for which dimension? Is it for availability? Would be then availability SLO. Is it then latency? Would be then latency SLO. Is it then say durability? Then durability SLO and so on.
Basically, it forces you to think about the dimensions that are important in terms of reliability and then it sets a numeric number for the reliability that you want to achieve in that dimension and this is the SLO then.
Typically, a service would have a set of SLOs across several dimensions, say availability, latency, and then in the SRE jargon, these dimensions, they are called Service Level Indicators. A typical SLI is then availability, another typical one is latency, and then for each of those SLIs that apply to a particular service, you set a goal and this is the objective or the SLO.
That's why you would care about SLOs and that's why SLOs can be seen as the alignment unit, so to speak, for the software delivery organization to align on the level of reliability that you want to achieve and then also what it would take to get there in terms of investment in reliability.
Shane Hastie: How do you avoid the, “but of course we want 100% reliability”, conversation?
Error budgets make improvement possible [11:05]
Vladyslav Ukis: That's right. Yeah. That can come up very easily because of course we want to be reliable all the time. There is a nice concept for this within SRE and that concept is called error budget. Error budget is a calculation based on the SLO.
For example, if you've got your availability SLO set for 99%, then your error budget is calculated automatically by taking 100 and subtracting 99. That means your error budget is then 1%. So then let's take this for a particular time period and then we would say, okay, so your availability SLO is to be 99% available within a month.
That means that your error budget is 1% non-availability, so unavailability within that month. That means you are allowed to be unavailable for 1% of the calls, for example, within that month. And then you can use that error budget but only what you've got inside the error budget.
So you're not allowed to exceed the error budget. You can use the error budget done to do anything you want in terms of, for example, feature experimentation, some technical work that requires a little bit of downtime. Basically, anything you want you can do within the window of that error budget.
Now, imagine the SLO is hundred percent, that means your error budget is what? Hundred minus hundred is zero. You don't have any error budget. That means then if you take this serious, if you are strict about it, you actually don't want to do anything to your service.
Because every time you try to deploy, every time you try to do something with the service, there is a force some likelihood that you will cost some downtime and that will chip away from your error budget. If you've got none, then theoretically speaking you don't want to do anything to your service.
You are basically back to the old days of never touch your running system because if the system is running and it's running fine, then you don't consume any error budget. The moment you try to do something to the service, try to update it, there is a possibility that you chip away from the error budget causing some downtime and you've got none.
Basically, you end up then in a paralyzed state so to speak. On the one hand you've got such a high target, a hundred percent availability. On the other hand, you don't want to do anything to the service because that might then cause you to slip from your goal of hundred percent.
This is of course not realistic because if you talk to product management then they will want to push features all the time and every time you push a feature, you deploy, you again are under risk of causing some downtime so basically it's a nonsensical situation.
Therefore, to bring things to a point where you can on the one hand have some reasonable level of reliability, say 99.5 or so, and also have some reasonable level of error budget, that means being allowed to make some mistakes while you are working on the service, there is this concept of error budget where the organization is forced to agree on the SLO.
With that then automatically the error budget is granted and you then operate within the error budget and you then track your error budget consumption. Whenever you hit the zero error budget, there needs to be also some consequences which are also then titled in SRE as Error Budget Policy where those consequences are defined then team by team.
That way you've got a self-regulating system where on the one hand you've agreed on reliability, on the other hand you've got still some error budget and therefore you can deploy new features frequently and so on and you've got controls. You don't want to exceed the error budget. If you do, then there is a policy that also was agreed before that you execute then.
Shane Hastie: When something goes wrong, what do you do? The incident response process.
Having a clear incident response process [15:04]
Vladyslav Ukis: When something goes wrong, then you need to have the ability to mobilize the organization in an efficient and effective manner. Efficient means you don't want to page all the developers, you don't want to go too broad.
But on the other hand, you also want to page a developer that can actually fix the problem instead of their neighbor because then the neighbor will get in the end to the developer who can actually fix the problem. You want to be fast because you want to have a small-time to recovery from incidents.
For that you set up a process. You need to be able to classify your incidents, one way or another. That's another thing that you need to put in place. You need to be able to say, okay, that incident, okay fine, that's a priority one and therefore we are on our incident response process say in the full-blown way.
Or that one, okay, that's say priority two, therefore we are on the incident response process still, but it's not DP one level of mobilization of the people and so on. Of course, once you fix the incidents, then you also need to have some effective postmortem process where people will come together timely after the incident was fixed and then discuss what happened.
How you could improve in future on tech stuff, on culture stuff, on process stuff and so on so that you also have got then a tracking system for those action items that come out of the postmortems and then ensure that they are dealt with in a timely manner so that the whole process is then actually working.
Shane Hastie: How do you prevent the postmortem from degrading to blame?
Making postmortems safe and avoiding blamestorming [16:41]
Vladyslav Ukis: This is something that needs to be set up by the people who are running the postmortems. That means that the people who are designated to do say incident coordinators or incident commanders, they need to establish this as the table stakes for taking part in the postpartum.
They need to tell the people that we are here not to blame each other, but there is a retrospective prime directive that you can look at and say this is what we want actually to do. We assume that everybody acted to the best of the abilities.
We are not here to blame each other, but we are here to find out the root causes and therefore let's be open about sharing information and nobody will be punished in the end for this. That will not affect your performance review and so on.
That said, there are of course thorny interpersonal issues that might have come in the way during the incident resolution and these things they need to be dealt with before the postmortem meeting.
Before the postmortem meeting, the incident coordinator needs to ensure that they talk to the individual people who are affected by some conflict and then clarify things before the meeting takes place with more people. Therefore, in the book, the process is divided in three steps.
What do you do before postmortem meeting? What do you do during postmortem meeting and what do you do after the postmortem meeting? Astonishingly, in each of these three phases, there is a whole lot of things to do in order to make it work.
I think a typical misconception in the industry is that you just have a meeting and that's the postmortem meeting and that's it. This is just a tiny bit of the process. There is a whole lot to be done before, during and after to make it work.
Shane Hastie: Organization structures often get in the way of these types of improvements. What are the patterns? What are the common things that we need to look at to allow that change and what are those changes going to come up with?
There are a variety of team topologies that enable SRE [18:34]
Vladyslav Ukis: Thanks for that question because that points to a particular chapter in the book, which I think is becoming the most popular chapter in the book, which is organizational structure, how to organize for SRE. I just gave a talk at the DevOps Enterprise Summit US on this.
SRE team topologies and there were a lot of questions around this. I think there are lots of organizations that are grappling with this, how to organize for SRE, what is the right organizational structure.
I think this is also coming from the fact that the bigger internet companies like Google, Facebook, Amazon, they are organized differently and they all do either SRE directly like Google or something similar to SRE like Facebook and Amazon, but they are organized differently.
That means that there are different ways to organize well for SRE. In the book, I've got an entire chapter on this where I detail through all the different SRE team topologies that seem to be working in the industry.
You can have an organization that does SRE but doesn't even have the SRE role because everything is done by the developers on rotation and everything is within the team so that's one point on the spectrum.
Then at the opposite end of the spectrum, you can have an entire SRE organization, which is a function with online management and everything. That organization then runs services in production on condition. If the services have got a certain level of reliability, that means they fulfill the SLOs and so on so then the organization runs the services.
If the services then fall below those reliability levels, then the SRE organization hands back the services for operations to the development teams until they improve the reliability and then the services fulfill higher SLOs at which point then the services can be operated by the dedicated SRE organization again.
And then there are different setups in between. I think to approach this, you need to have several step approach. First of all, you need to clarify the question of who runs the services. On the scale of developers run the services themselves to there is a dedicated SRE organization that runs the services.
There are also in between shared ways of running the services. For example, the SRE organization really lends people into the development teams and they are there for a long time effectively becoming the team members or also a couple of other setups.
Once that is clarified, then the next step is to say, okay, with the setup that I've chosen or that I'm considering right now, what kind of incentives are still there with the developers to implement reliability as they implement features?
Because you want to maximize those incentives. Of course with you build it, you run it, you've got then the maximization of the incentives and then with other setups you've got them slightly less, but the incentives for the developers are a bit less.
But still depending on the setup, you can find a balance between having a dedicated SRE organization if you want to do this, if that's important enough for the business and still provide enough reliability incentives to the developers so that they implement it during the future implementation and not as an after thought.
And then there are also a couple of other considerations. What kind of knowledge sharing is required between different parties, between different teams? Again, if you are doing everything within the development team, well the knowledge sharing is just natural so they're just working together as a team.
If it's still another organization, a dedicated SRE organization for example, you still require some more synchronization. What is the cost of the synchronization that that causes and so on?
Basically, there is a set of considerations that you need to undertake until you make a decision and the decision is basically two-dimensional. One dimension is the organizational structure itself, which is difficult to change, therefore you should put a lot of thought before you declare a structure because it's difficult to change.
And then another dimension is the dimension of basically who runs the services because who runs the services, then it can change more easily. As I mentioned before in the setup where the dedicated SRE organization runs the services, they've got an option to actually say, no, I don't run those services.
Therefore, you go back to you build it, you run it automatically because the services fell below a certain reliability discretion. Yeah, this is a question that many organizations are grappling with at the moment and I hope that the detailed account of the options in the book will help different organizations with their decision making.
Shane Hastie: What is the role of the Site Reliability Engineer and what's the career path?
The role and career path for a Site Reliability Engineer [23:29]
Vladyslav Ukis: That's a very good question. Yeah, I think the typical starting point for the Site Reliability Engineer is a very technical role at the center of which is being on call. You are on call for services and you know how to fix the services when there is something wrong.
You know how to set up SLOs in the right way, you know how to agree error budget policies, you know how to discipline the team to follow the SRE practices and things like that. I think in order to be more effective, that role needs to extend more.
That role needs to also take part in the product management activities a little bit, bring the reliability perspective in there. That role also needs to be in things like user story mappings for example, where people are discussing the stories individually and the workflows individually, again, bringing their reliability perspective in there.
So then during the development of course, that role also needs to be there to coach the team in all things reliability, because as they implement the features, this is where you actually decide on the actual reliability by either implementing certain reliability features in as you go.
Or they're not there, and therefore you then discover later once the services are running that something is missing and so on so therefore being involved in the development, although not necessarily implementing the reliability features themselves.
But being involved there and having an influence on how the team thinks about reliability and does the implementation and so on, implementing stability patterns, circuit breakers where appropriate and so on. This is important.
Then of course during operations, depending on the setup, then if the team does for example, you build it, you run it and you are an SRE from an SRE organization, then you can of course bring a wealth of experience to teach the team how to do this, especially for teams that have never done something like this before.
I think it's a very wide range of activities that you need to spawn in order to be really effective at implementing a reliability in the organization. I think also there is a lot of misconception now, especially also if you look at the job adverts.
There is a high demand for SREs and you'll find that those job adverts, they very mostly are focused on the very technical bits of the job and as often in the industry, there is also a bit of confusion with terminology.
You'll find job their adverts, they are not talking about SRE job adverts, they are talking about DevOps positions and just Operations Engineers and so on. But by large, I think in order to be effective at installing reliability, you need to be active in all those areas of product life cycle.
Shane Hastie: An interesting and wide-ranging conversation. We'll include the link to the book in the show notes, but if people want to continue the conversation, where do they find you?
Vladyslav Ukis: They'll find me on LinkedIn. There are actually several conversation threats about the book going on right now. I'm looking forward to further folks joining those conversations and connecting and talking about this because I think this is really important because the industry is going towards software as a service more and more.
When the software is a service, comes the obligation for the organizations to operate the software and the question is how you do it. It's how to establish operations effectively as an organization and this is what the book is about. I think this is an important topic for the industry at large.
Shane Hastie: Vlad, thank you so much.
Vladyslav Ukis: Thank you very much, Shane.
Mentioned:
- The book: Estabilishing SRE Foundations
- DevOps Enterprise Summit
- Vlad on LinkedIn