BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Podcasts Hiring and Growing Great Site Reliability Engineers

Hiring and Growing Great Site Reliability Engineers

In this podcast Shane Hastie, Lead Editor for Culture & Methods spoke to Narayanan Raghavan, Senior Director for Site Reliability Engineering for Managed Services at Red Hat, about hiring and growing 

Key Takeaways

  • Site reliability starts with good engineers - software engineers with a systems mindset or systems engineers who have software development background
  • Site reliability engineering is about making platforms systems boring so application developers can focus on adding business value
  • An effective team needs a culture of trust, curiosity, collaboration and commitment
  • Retaining people needs an environment where they feel they can grow and learn
  • Trust is built when the leader is vulnerable

 

Transcript

Shane Hastie: Good day, folks. This is Shane Hastie for the InfoQ Engineering Culture Podcast. Today I'm sitting down with Narayanan Raghavan. Narayanan is the Senior Director for Site Reliability Engineering for Managed Services at Red Hat. Narayanan, welcome. Thanks for taking the time to talk to us today.

Narayanan Raghavan: Thank you for having me.

Shane Hastie: The way I typically start with my guests is who's Narayanan?

Introductions [00:40]

Narayanan Raghavan: Like you mentioned, my current role is Senior Director of Site Reliability Engineering for Managed Services here at Red Hat. I've been with Red Hat 15 plus years. It's a long time. Had different roles at Red Hat. Ended up in this space and this role as Red Hat was starting the Managed Services effort. And this was one of my initial challenges to say, "Come on in and build an SRE organization at Red Hat." And that's what got me interested in this space to say it's an opportunity to build a new space, get into a new space, take on a new challenge, and been in this space for seven plus years and still enjoying it every day. Been at Red Hat 15 plus years like I mentioned, and every day is a new day, new challenge, a lot of learning. So it's been good.

Shane Hastie: SRE, relatively new, probably seven years plus. Red Hat would've been fairly early in that. Why does it matter? Why did this emerge? Why wasn't it just something we did?

Defining Site Reliability Engineering [01:38]

Narayanan Raghavan: First off, let's take a step back and go, what really is SRE, site reliability engineering? Now you look at the Google definition, Google came out with their SRE book about seven plus years ago. They really talk about it's a team that balances risk of unavailability with rapid innovation, ensuring that you're focused on building systems that are automatic or put differently, when you take a bunch of software engineers, put them in an infrastructure specific role, how do they think about it? How do they approach it? In the high level, in my mind, really boils down to how do you focus on reliability, scalability, security, performance of systems at scale? And for us, from an OpenShift perspective or managed services perspective, we have to think about scale, cutting across different cloud providers, different hyperscalers and fundamentally, how do we make platforms systems boring so your application developers can focus on what they do best, adding business value, focusing on their applications, et cetera. More so than thinking about the underlying system.

Shane Hastie: What does it take to be a good site reliability engineer?

What makes a good site reliability engineer [02:48]

Narayanan Raghavan: That's a great question. I want to answer that two ways. First off, first and foremost, I'll say people. You need good engineers, you need good people to make site reliability engineering come to life. The second big piece is culture, because you're not building site reliability engineering organization with one person, you're building it with a group of people. And when a group of people come together, the culture that you put in place, that's what makes it tick. The individuals are the key, the culture is the fodder, so to speak, to get that team going. And the type of hiring you do, the type of engineer that you hire goes a long way. Then I generally go, I'm looking for people that are software engineers with their systems mindset or systems engineers who have a software development background as well, because that mix is critical for an SRE, as you think about scale, as you think about reliability, continuity, et cetera.

But I'm also not looking for that perfect fit, not looking for someone who checks every single box on my list of things because that person, they might be great to start off with, but that person is going to get bored. That person is going to, two months down the road, probably look for other opportunities. So I'm looking for people that are not the perfect fit in many ways. I'm looking for people that are eager to learn, that have that curiosity to pick up new things and jump into a challenge. You might have an escalation, an incident or what have you. I'm going to explore and try and understand and debug what's happening. So being willing and able to dig into the weeds, being willing and able to have a conversation about it without blaming people, that's important. And the cultural aspect plays a big role in that.

The importance of a trust and assuming positive intent [04:48]

So giving people the confidence, for example, to say it's okay to fail. And one of the first things I did when I started building this group was set some team principles to say, "First and foremost, it is okay to fail. It's important we learn from our failures and we don't repeat the failures, but it's okay to fail." I can't tell you how many times I've apologized to stakeholders to say, "Whoop, sorry, we've made a mistake. We learned from it. It's not going to happen again." But being able to show the human side is important. It's important because SRE is a high stress role. You're dealing with incidents and escalations, crisis management essentially. So being okay to fail was our first principle. Our second principle was assuming positive intent. People are trying to do what's right for the organization, the company, et cetera. It's okay to ask questions and sometimes you might have a question where the answer might be confidential.

I can't always share, but assume that I am doing what's best for the team, that my manager and his manager, et cetera, all the way up to our CEO, doing what's right for the company. That assuming that positive intent absolutely matters, especially in the space that SRE plays. Because SREs are going to be interacting with our customers, partners, cloud providers, internal teams. So you get to play in a space where there are multiple stakeholders, multiple people that are impacted, et cetera. The third principle is starting with trust and extending that trust. Very similar to assuming positive intent, except this is all about encouraging curiosity, asking questions. Let's not make assumptions because when you're trying to manage systems at scale, last thing I want to do is make some assumptions that somebody did something. I'm going to ask that question, but also trust that somebody else is doing the job the right way. So we're all swimming in the same direction.

Disagree and commit [06:50]

The fourth principle for me is disagree and commit. And this is true with any software development. I can build something a thousand different ways, always argue about one way or another. Let's not get caught up in analysis paralysis. Give people the opportunity to pick an option, make the changes. If it works, great. If it doesn't work, remember that it is okay to fail. And remember that we're all doing this with positive intent. So we trust each other, we do it with positive intent. If we fail, these are just bits and bites, let's rearrange it. Let's acknowledge that we tried something, we learned from it, we're going to take a different approach. So disagreeing and committing versus analysis paralysis is so important because that's what allows us to keep the pace that's important when development teams are pushing out changes after changes. How do you support that while also keeping systems reliable?

And then the last thing that bleeds out of that is communication. Communicate, communicate, communicate. I'd rather we over communicate, step on each other's toes versus not. I'd rather we build a culture of feedback where people are open to giving and receiving that feedback for our own development too. I tell my team this, is we spend more times with each other either virtually or physically than we spend with our significant others. So why erect those barriers? Why not lower those barriers and learn from each other, give each other that feedback. Because team in my mind is a set of puzzle pieces coming together and if they don't fit right, you're not going to enjoy that space.

So getting that team together, getting those puzzle pieces, in this case, people to come together through culture is absolutely important. You're not expecting perfection, you're expecting things will happen, incidents will happen, things will break. That's okay. Learning from it, making mistakes, learning from it. Again, that's life. And it's important from an SRE perspective. Like I said, crisis management is important, so is culture and communication. So the soft skill pieces start to engage and that starts to matter as well.

Shane Hastie: How do we help the people on these teams build the resilience that they need?

Resilience comes from failure [09:11]

Narayanan Raghavan: I'm going to answer that in, I don't want to say contradictory way, but resilience comes from failure. You're not going to build that resilience if you haven't failed before. And this is why that culture matters. This is why communication skills matter. This is why being calm during a crisis, that's a skill and that's a hard skill to find. And building that resilience starts from the top. When the leader shows that it's okay to fail, I am human after all, and I am going to acknowledge that and I'm going to show that when I fail, I'm going to celebrate it with the team to say, "I'm not going to ding anybody just because they failed." We're not going to learn otherwise. And resilience is part of that learning that happens as a team comes together.

Shane Hastie: And where do you find these people?

Good people come from anywhere [10:04]

Narayanan Raghavan: Good people from my perspective come from anywhere. I've had people that have had a background in music, I've had people that have PhDs, and so good people can come from anywhere. Like I said, I'm not looking for the perfect fit. I am looking for potential. I'm looking for behaviors that show a learning mindset. I am looking for skills too. But many of the skills are teachable when somebody has a learning mindset and shows curiosity versus not. And part of that is the conversations we have. Part of that is the engagement we have with that individual. And some of these people are internal. Many of these people are internal and others are external hires. There are people from different organizations, like I think I mentioned earlier, as an SRE function, we're exposed to customers, partners, cloud providers, the infrastructure providers, communities, open source communities. So because we are exposed to so many different groups that communication becomes important.

But you also start to look at are there other groups that are exposed to different groups like the SRE organization is? And sometimes there are people in the support organizations that have the skills with customers and understand the technology. You can find good people there. Sometimes they're good people in the engineering organization where they're engaged, they understand the stack and they think about success of the entire stack versus success for just a layer of the stack. And when somebody's invested in, I am in it for the success, the business outcome versus very easy to say it's a networking problem. When somebody's exhibiting those types of behaviors, usually for me that's a good indicator that these are people that have the curiosity, a learning mindset and take accountability for business outcomes for the success of the product. And they're usually the ones that I'm looking for and in trying to pull into the SRE space itself, because I can then invest in them. They may not be a perfect fit, but I invest in them and then they start to invest back into the company itself.

Shane Hastie: And that segues a bit into how do you keep these people?

Retaining people needs an environment where they can learn and grow [12:25]

Narayanan Raghavan: So hiring people is hard enough, keeping people is harder still. A lot of this again comes down to culture. It also comes down to giving people the opportunity to grow. Engineers are a curious bunch. Feeding that curiosity is important. Not just important, it's vital. And making sure that we're giving people opportunities to grow and learn, that becomes fodder for people to stay interested in wanting to learn and grow for, from my perspective, even within my own teams, for example, growth is not just vertical growth. It's not just promotions. It's also horizontal growth. So if I can wake somebody up at 2:00 AM in the morning and they can, one eye close, solve a problem, I start to question, "What are you learning? You know everything in the back of your hand, you're not learning anything." So my ask, my challenge to them is go pick a different space.

And sometimes that means letting go of somebody who's really good because that's what's right for that individual. But when people see that you're actually caring for their growth, for their development, they want to stay because they see that as a leader, as a manager, that you are invested in their success, they want to stay. And for me, that desire to learn translates to what sort of opportunities can I give them. What sort of projects can I put them on? Or, what sort of training can I have them take? And oftentimes engineers say, "Already busy. I can't afford to take time off and my job is keeping me plenty busy." My question back to the engineers is, "Does this mean you don't take time off? Does this mean you don't take paid time off and be with your family for a week and enjoy that time? Are you constantly working 24/7, 365?" The answer should be no.

And then I ask them the follow-up question is, "If you are taking time off, what happens when you are off? Your teammates step in, they cover for you, they help, they engage, et cetera. Why aren't you doing the same thing with development? Why aren't you treating development as a, I am going to take development time. I'm going to declare it upfront to my team, to my manager, et cetera, to say, the month of May, I'm going to be off for a week." Great, take the time, focus on your development and then come back and put that knowledge that you've gained to use, because your team has supported you while you were out in training or what have you. Now you are coming back with additional knowledge to come and contribute back to the team.

So as a manager, that's the investment. I am promising and committing to making sure that my team is growing and learning and has interesting work. And as a result, the team is committing back to saying, "I'm bringing this knowledge back into the team to think about automation, to think about scale, to think about reliability, think about building systems that can sustain failures, et cetera."

Shane Hastie: Quite a lot of our audience are new to the leadership responsibilities. We take the best engineer and we promote them and we give them no management training and they become the worst leader. What advice would you have for people who are new in this leadership responsibility to avoid become the worst leader?

Advice for new leaders [15:47]

Narayanan Raghavan: That's a great question. I think first and foremost is acknowledging your humanity. Just because I'm a manager, a newly promoted manager or what have you, or an experienced manager doesn't make you right all the time. In fact, doesn't make you right most of the time. Acknowledging that upfront to say, "This is my first time stepping into a management role. I'm going to be making mistakes. I need your help, the team's help to learn because this is a new space for me. Help me learn. And I'm here to support you as well. So I am going to make sure that you are given the opportunities to learn and grow yourself, but without you helping me learn, I'm going to stumble. And as a new manager, I want to do what's right for the team." So I think calling that out upfront to say I am human after all, I'm going to make mistakes, that is important, if not vital in my mind, because you need to gain your team's trust.

And trust also starts with showing vulnerability and being vulnerable to your team, with your team, for your team, is also important. The other second part I'll share here is as a new manager, know that you're not going to have all the answers. That is okay. I tell my teams this when we reorgs. I'm going to make a mistake in a reorg, we'll make a reorg. Sometimes a reorg might make sense. Sometimes a reorg might not make sense. That is okay because one of the team principles is it is okay to fail. The analogy I give to engineers is, I wish I had a development environment to make a reorg, try it out before I roll it out to production. But I don't. So when I make changes, I have to make it in production. I have to make it with real people, with feelings, with desires and aspirations, et cetera.

I don't have a development environment, so give me some slack to say if I make a mistake or if something does not make sense, talk to me, call it out to me. But my commitment to you is when I make those changes, I'll tell you why I'm making those changes. Because the why matters to people to say, "This has been my thinking, this is why I'm making the change." The what then follows on, the how then follows on. I'm less concerned about that. My focus is a lot more on why am I making this change? What was my thinking? What were the drivers? And being able to do that as a manager I think is fundamental.

Shane Hastie: Some great advice and some really deep points for people to ponder. So Narayanan, if our audience want to continue the conversation, where do they find you?

Narayanan Raghavan: I'm not as active in social media, but I'd say LinkedIn is probably the best way to have a conversation. Twitter handle is "bign_". Like I said, not very active in social media. So LinkedIn is probably the best option.

Shane Hastie: Narayanan, thank you so much for taking the time to talk to us today.

Narayanan Raghavan: Thank you for having me again. This has been great.

Mentioned

About the Author

More about our podcasts

You can keep up-to-date with the podcasts via our RSS Feed, and they are available via SoundCloud, Apple Podcasts, Spotify, Overcast and YouTube. From this page you also have access to our recorded show notes. They all have clickable links that will take you directly to that part of the audio.

Previous podcasts

BT