BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Presentations High Performing Teams Act Like Owners

High Performing Teams Act Like Owners

Bookmarks
40:48

Summary

Katharina Probst talks about what it means to act like an owner and why teams need ownership to be high-performing. When team members, regardless of whether they have a formal leadership role or not, act like owners, magical things can happen. She shares ideas that we can apply to our own work, and talks about how to recognize when we don’t live up to our own expectations of acting like an owner.

Bio

Katharina Probst is a Senior Engineering Leader, Kubernetes & SaaS at Google. Before this, she was leading engineering teams at Netflix, being responsible for the Netflix API, which helps bring Netflix streaming to millions of people around the world. Prior to joining Netflix, she was in the cloud computing team at Google, where she saw cloud computing from the provider side.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.

Transcript

Probst: I am really excited about sharing with you some other way of thinking about what ownership means and hopefully send you home with some food for thought.

Let's start with something that seems completely unrelated. Let's start with this. You go to a website, probably happens once in a while, and it's not working. You, as the customer, have no idea what's going on behind the scenes there. What is going on behind the scenes, in this example, is what I would call the classic incident in three acts, and many of us have seen this kind of incident, I bet.

It starts with the system, and for whatever reason, that system catches on fire. Probably not physically, but it catches on fire. Maybe somebody pushed some bad code, or it got overloaded, or somebody caught a cable in the data center, or something bad happened. For whatever reason, it catches on fire. Of course, because we live in a world of microservices, this system does not live on its own. There's another system that really needs this system that just caught on fire. What happens in this other system, as I'm sure many of you have seen, some metric that indicates health of that system goes through the roof. Requests stack up, latencies go through the roof, it runs out of memory. Then, here's your third act. This is you, as the customer, wanting to get to that webpage, and it's not working.

We've all seen this. Why is this relevant? I'll tell you why this is relevant to ownership after I introduce myself. My name is Katharina Probst. Confusingly, I go by Katharine. You can find me on LinkedIn and various other social media.

Ownership in Distributed Systems

I want to start by getting you to think a little bit about how we define ownership in distributed systems before we talk about people and teams. Let's think about this a little bit. Many of us live in this world, where we have many different systems that interact with each other. There are probably services that, together, form a larger system that provides some kind of functionality to the user. How do we actually define ownership in such systems, and what are the expectations one system has on another? Let's explore that a little bit.

If you have two systems that need to talk to each other, you need to be incredibly explicit about how they talk to each other. One system needs to say, "Ok, if you need something from me, if you need, in this case, a list of phone numbers from me, you need to send me exactly this kind of request with exactly this kind of data and input, and I will return you exactly that. I will give you back a list of phone numbers, and each phone number has a, say, number in it and a name in it." The owners of these two systems get together, and they talk in great depth about what exactly is the API between those two systems.

Then, when these two systems are actually running together, there's agreed-upon status codes that they exchange often. Hopefully, when this first system sends a request to this other system, that other system returns a "200 OK" along with the response. That's hopefully the vast majority of cases. It returns a "200 OK." The system that was calling knows exactly what that means. It knows, "Ok, things went well," and that the response is not garbage, and I can expect to find the list of phone numbers, in this case. That's great.

There are other status codes as well. There is the "You've come to the wrong place." That's the 301. There is the "I have no idea what you're talking about." That's the 404. There is, "I am a teapot." That's 418, I looked it up. There are various other very clearly agreed-upon status codes that these two systems have, and they each know what it means. Another status code that I put up here is 503 - Service Unavailable. Another thing that happens that is best practice when two systems talk to each other is that, if my system up here calls this other system and this other system is having a bad time or is going through a rough patch, the best practice is for it to fail quickly. Fail fast. Then, the caller knows, "Ok, I cannot expect an answer right now." Then there are best practices around what to do with that. Maybe you retry with backoff, maybe serve a fallback, maybe do something, maybe make your users more or less happy even if you can't get them exactly what they need. There are very clear protocols that we follow.

What are one service's expectations from another? We talked about clean APIs, articulate what you can and cannot do. We talked about status codes, articulate exactly what your caller should do next. It should be clear what to do next. Also, fail fast. Those are just some of the ideas that we, as a community, as an industry, have developed around how systems talk to each other.

We also have developed a lot of best practices around what to do with failure modes. We're very careful not to have a single point of failure. If you heard Aaron's [Blohowiak] talk this morning, he talked about how Netflix runs in three regions so that there isn't a single point of failure. We always want to make sure that there isn't just one server or one zone or one region that handles any one specific thing, ideally. We have best practices around what to do when a lot of requests come in. A system cannot always say yes to a request. At some point, it needs to throttle or it needs to say, "I can't do any more." Then we have very good practices, again, around communications. Fail fast, let your caller know what's going on. Communicate clearly where you are.

Those are some of the things that we, again, as a community, have developed around how systems interact with each other. It's not all about what to do when a request comes in. There's sort of this meta thing here, which is that we, as the owners of these services, spend a lot of time thinking about how is this going to scale. How much traffic is my system going to have to handle in one or two years? What's going to happen when we have that kind of traffic? Probably, we need to do some work to make it be able to handle that kind of traffic. What do we need to do to set ourselves up for success?

The key insight here is that if we think ahead, and we actually spend the time preparing ourselves for the future, we can actually prevent some of the reactive work from happening. One of the examples here is precompute some stuff. "There is this request that always takes me forever, and I need to do a bunch of work. How about I precompute it and cache it somewhere?" That then protects your system from having to do all that work repeatedly. Those are some of the patterns that we've developed.

Ownership in Human Systems

The thesis that I bring here to this talk is that humans are very different from machines. Ok, that's not the thesis, we all know that. Human systems actually work in surprisingly similar ways sometimes. I want to draw an analogy between all the best practices and patterns that we're seeing in distributed systems and our human systems and how we, as humans, interact and how we, as teams, can interact. If there's nothing else you take away from this talk, I want you to think about, when you go back to your jobs, in your daily job, what are the best practices that we've learned from systems that we can actually apply to our teams and to our interactions as people? Let's explore those ideas a little bit.

This is a blurred, image of my inbox. I have like 100,000 unread messages here. All of them are interesting to me. I'm never going to be able to read them all, but all of them are interesting to me. You see, up here, I have a bunch of updates that's usually something like people leaving me comments and docs and things like that, and somebody's pinging me on chat, some personal message. Then, helpfully, Gmail tells me, "You received this message five days ago. Are you going to do anything about it?" Then I have a system that doesn't always scale, but I have a system where I star emails that I really need to do something about, like pay a bill. This is my personal inbox. My work inbox looks different. Not any better.

The point here is that I'm sure I'm not the only one, that sometimes feels like a system that is getting a lot of requests and a lot of stuff being put on my plate. I need to be able to answer the question, "How can I function in this environment and still provide to my co-workers what they need so they're not blocked so they can do their jobs? What does my team need to do in order to make sure that other teams aren't blocking, that we, as a company, can actually move forward productively?" Our human systems look something like this. They look very much like our distributed systems in a lot of cases. Basically, there's a lot of people talking all to each other. This is probably not news to you, and your inbox probably doesn't look very different from mine. Again, the question is, what do we do about it?

Option number one. Some days, I feel like all I'm capable of doing is saying, "418 I'm a teapot." I like that status code, I really do. That's not actually what I say. What I actually say sometimes, and I don't know if you do this too, but sometimes I actually say, "I'm being DDoSed," I have so many people wanting so much stuff from me, I sit down at my desk, and just the act of prioritizing it makes me throwing "out of memory" error. I'm literally DDoSed. That's not a good state of being, and so I started thinking a lot about how can I apply some of the things in this analogy to distributed systems to my own work so that I can show great ownership and so my team can show great ownership. My 10-year-old son really loves Mondays. I don't know why, but he will get up any day of the week and say it's Monday. Sometimes he'll get up on Thursday and say, "Today is the fourth Monday of the week." Sometimes I feel that way too. Let's do something about it.

If I think about this analogy again, what can I do to behave like a well-behaved system? When we have computer systems that misbehave, what do we do afterwards? We get together, we have a blameless postmortem meeting, and we talk about what have we learned and where do we need to put better APIs and things like that in place. Let's do the same about our work. Let's reflect a little bit on what we do. As we go through this and we go through this analogy, I'll lay out a few goals that I have set for myself and for my team. Let's talk a little bit about clean APIs.

One of the things that we want to be able to do is articulate what we can do and what we cannot do. Just like a system's responsibility is to get whatever they need from their downstream systems, it is my responsibility to do the leg work to be able to handle a request. If somebody comes to me and asks me a question, and I'm actually the one who is going to answer this question, it's my job to go and figure out who to talk to and what systems to dig into, and so forth. That is a clean instruction. The API is you ask me a question, I do the leg work, I figure it out, and I come back with the answer. Just like in a system, I would expect that system to go call its downstream systems rather than me having to call.

Similarly, status codes - what's the equivalent of status codes with human interaction? Again, if you draw that analogy, my responses should give you a pretty clear indication of what to do next. I should be clear about, "Can I do it? Can I not do it? Can my team do it? What is the next step here?" Then, third, fail fast applies very much as well. You don't want me to fail, I hope, but if I do, you probably want me to know right away.

Let's dive in a little bit. Let's talk a little bit more about clean APIs. What are your focus areas? What are my focus areas? What are the things that I want to focus on this quarter or this year, and what is it that I'm not going to focus on? If I have a clear articulation of what my team's charter is and what my charter is, it makes it so much easier for people to know when to come to me for something. Being clear about what our priorities are, and also, that leads to the second thing, what our priorities aren't right now in this quarter, is actually super helpful to the rest of the organization. Courtney [Hemphill] talked earlier a little bit about communication, how communication is key. If nobody knows what I'm doing, then that really doesn't help anybody. Me disappearing into a corner for six months and then reappearing makes it very difficult for the rest of the organization to fit around that model.

Let's talk a little bit about status codes. Let's take a hypothetical example. My colleague comes to me and asks me, "Can you figure out why X is failing?" If all goes well, and for instance, it's a quick thing, my colleague gets back a 200 OK, not really, but you get the idea, and I will say, "Definitely, here's the problem. Here's why X is failing," and I'm able to answer their question. Hopefully, all goes well. That's the 99.99% case. Everybody knows where they are and get what they need. In reality, what happens sometimes is, my colleague comes to me or my colleague sends me an email saying, "Can you help figure out why X is failing?" What do they get from me? First of all, nothing right away, because I'm probably behind on my email, like today, I was kind of busy with this track. So I'm behind on my email, and right now they don't know anything.

Now, there are two scenarios. Scenario one is I'm actually working on it. I received their message, I'm doing my leg work, I'm looking at some dashboards, I'm looking at some systems, I'm talking to a bunch of people, but I haven't told my colleague that. I'm working on it, but my colleague doesn't know that right now. The other option is I'm not working on it, because they're not the only one who wants something from me at that point in time. To my colleague, obviously, that looks the same, and I'm sure you've been in the same boat before with your colleagues and you have done that as well.

How does my colleague react, or how does my colleague even tell the difference? What would a system do? A system would retry after it times out. What's the human equivalent of retrying? Ping. If you've ever gotten a ping, this is what this means. My colleague comes back and says, "Ping. Hi. Anybody there? Can you figure out why X is failing?" Of course, all these other people retry too. Then, we're back in this DDoS situation, because everything stacks up. I catch on fire, hopefully not literally.

What do I take away from this? One of my goals is to be clear about where things are. I told you a little while ago, I have a few goals that I set that I learned from drawing this analogy. I'm nowhere near perfect on these goals, but this is something that I'm working on and that I talk with other people about.

The first goal is return 202 Accepted when I can do the work. I wouldn't say, "202 Accepted," although maybe I should. I will say, "Ok, I'll do that, and I will get it to you by the end of the week." That analogy actually works pretty nicely for teams as well. I don't know about your teams, but my teams get a lot of requests saying, "We really need this feature. The world is going to end if we don't get this feature. We really need all of this." When they come to us and they say, "Can you build this feature?" or "Can you do this integration?" or whatever it is, we have to make a decision, and we have to make a decision and be very explicit about it. The other team needs to know where they stand. We should send them a 202 Accepted, "We'll do and we'll do it by the end of the quarter," or we have to say we can't do it.

Fail fast. I talked about 202 Accepted with an asterisk of when I can do it, and that's actually really key, because I think all of us have a lot more things that they could be doing than they have time for. I think we need to be very clear as owners of our areas to be honest about what we can and cannot do and why. Getting back to my colleague and saying, "I get it, it's important. I really do understand you need this, but I simply cannot humanly do it," is an important skill. You have to be able to say, "I have these other priorities right now that are just more important than this specific request." Being honest about that and saying, "No. Right now, I'm throttling and I cannot do this right now. Let's talk about it in a week."

I think one of the worst things that systems can do and I think humans can do is time out and then return a permanent redirect. Let's say my colleague sends me an email, says, "Can you figure out why X is failing?" I do nothing. Probably, maybe I haven't even really digested the email. I do nothing. Eventually, they ping me because I timed out. They ping me, and then I come back and I say, "I finally read your email. I'm not the right person to talk to. Go talk to this person over there." That's probably not great. Not a great experience. If that happens to me, I feel like I haven't shown great ownership of my area, but also, just great ownership of the company. I haven't unblocked my colleague and they just lost a whole bunch of time where, really, the right person to talk to is somewhere over there.

When we talked about systems, we also talked about failure modes and how to avoid them. The same thing happens in teams, and some of this is probably very familiar to you, and it applies. You want to make sure there's no single point of failure. If all this stuff is on your plate and you're the only one who can do it, you have work to do in making sure you're not the single point of failure here. You need a deeper bench on your team. You need to make sure that other people can answer these questions and do this work as well.

What do systems do? They scale horizontally. What do teams do? They scale horizontally. Many of us spend a lot of time thinking about how can we get our teams to a place where multiple people can do any given task. If that one person is out for a vacation or wherever, other people are there, and the business continues, and nobody's blocked. The same thing applies.

Don't always say yes. We already talked about that a little bit. You can't commit to everything, probably, so be clear about when you cannot say yes. If everything's a P0, nothing is. Many of you feel that too. Goal number four is be very explicit and say yes when you can do it, or when I can do it. That means my utilization is down here in the lower levels, not up here. Be very clear when you can and cannot do it. Then, communicate. The more time I spend in this industry, the more I've become a really firm believer in communication, up, and down, and sideways, and everywhere. Through of the talks earlier today, this was touched upon as well and how it's very important, not just to have good communication, but to be really thoughtful about what that communication is.

For instance, you might send status reports, but be very careful about how to send those status reports, because they need to be usable by the readers of those status reports. Otherwise, they're not going to get anything out of it. You wasted time and they wasted time, and still, nobody knows what's going on. I've read many status reports where I just read a page and I'd have no idea what it was talking about. It made perfect sense to the people who wrote it, but they're kind of not the target audience. Let's not write status reports that are not readable by other people outside the team. We think a lot about that and how can we just be very crisp and clear in our communication, and I think that's really important, especially as you work on projects that involve multiple teams.

Then, the corollary is that you really kind of need to have a system. Develop that system, whatever works for you that's for your internal tracking, making sure that you know what's going on. Also, figuring out what works with other teams, what works with your management chain so that everybody is on the same page and in the loop and there are no surprises.

When I talked about systems, I talked a little bit about we think proactively. We want to think about where are we going to be in a year or two. Ideally, the same thing applies. It's not all about reactive work. It's not all about, requests come in, you handle them, and the work is done. It's very much about, "What is the charter of my team a year from now? What do I need to scale myself to a year from now? What are the skills I need to develop to be able to do my job a year from now, or two years from now?" That's really important, and I feel like we often don't spend enough time on that, especially on the personal growth piece.

My theory is that we all say we're so busy and we just don't have any resources left to think about that. Whereas, when we design our systems, sometimes we have time to think about what will that system look like. I really think it's important to think about also for our own personal development and personal growth. I do think that when we think ahead, and we do spend some time planning and developing ourselves for what lies ahead a year from now or two years from now, just like in systems, we can avoid some of the reactive work further down the road. If we develop tools so that people can answer questions more easily or if we scale out our org in a way that more people can handle the work and so forth, then that really helps us scale and prepare for the future.

Take-Aways

What are some of the takeaways here? The thing on the left, if there's nothing else you take away from this talk, the one thing I want you to think about is this analogy. If you know me, I'm all about analogies, and sometimes I really take them too far. I feel like this one can go pretty far. Think about that in your daily work, how some of the best practices and patterns we've developed in distributed systems, how they apply to us. Then do less, but do it well. Just like systems, don't do everything, don't sign up for everything, but what you do sign up for, make an effort to do it really well and have a high SLO. If you say no, explain why, and be explicit about what else is more important right now and why it's more important for the business. Communicate. I'm just going to bring home that point one more time. Communication is key. It's important especially as we scale. If you don't know what's going on in some other team, you're just going to drown them in more work, and it turns out, it's probably not even them that need to be doing them. Then, minimize timeouts, which is part of communication. If you can't do something, tell them fast.

Finally, as we head into the evening and close this talk, one of the things that I really want to take away for myself and that I want to share with you also is, just like distributed systems will eventually fail, we've learned that, we need to set ourselves up for failures. Eventually, distributed systems will fail. The same thing will happen with us. We can have all the systems and processes and good intentions in place, eventually, we will drop something. Apologies if you're waiting for an email from me. Hopefully, it'll happen at some point soon. I don't expect perfection from myself in that regard. I strive for it, but I don't expect it. It's also very important that we apply that to others. We understand they're under the same pressures as we are. We understand that they are also just trying their best, and they're trying to be a good citizen and a good owner. Sometimes that means taking stuff off their plate. Sometimes that means saying, "I could ask this person to do something for me, but you know what, they're overloaded, I know they are. It'll take me 10 minutes longer than it would take them, I'll just do it myself." It can also mean that. That also helps us work together better as a company, I think.

Questions and Answers

Participant 1: The question that I had is that, usually, what bothers a lot is the context switching. Whenever someone asks me a question, that takes me off my work. Even if I say no, so I don't have time, that takes me off work. Switching back to an original context, that takes a lot of work. Do you have a way of preventing them to ask the question the first time? "I'm busy right now, don't bother me. After 3 p.m., you can call me," or whatever.

Probst: Do you want me to take my distributed systems to the limit, that analogy?

Participant 1: Please do.

Probst: I'm going to need some time with that. The question was, how can I avoid distractions? I think there's two aspects to this. One is avoid distractions in the moment, I actually now completely turn off notifications sometimes when I'm heads-down on something, because I know somebody will ping me or some email will come in. I get distracted super easily. People have been saying this for years, but it actually makes a huge difference when you turn off other notifications and you're just, "I'm just going to read this document now for the next half hour." That's number one, tuning out the distractions.

That doesn't solve the problem that you were posing, which is how do I prevent people from asking me in the first place. The way that I think about this is, I want to be clear about what people should come to me for and what is not something that people should come to me for, what is already documented, or what is another team's charter or responsibility. Being very clear about that helps a little bit. Sometimes you still get lots of questions, and you redirect, and that's fine. I think just being clear about that really helps.

Participant 1: How do you make yourself clear about that, just have a list of topics that people can ask you?

Probst: I do, actually. We have team charters, and we can always do better, but we can say, "This team does these things and is responsible for this system," and so forth.

Participant 2: What happens when you have so much mail or so many requests? Because you want to fail fast or respond to every request. Sometimes it takes a while for you to read a request.

Probst: You have to triage, yes.

Participant 2: Really understand what is being requested before you can respond with, "I can do this," or you have to ask someone else. Just processing these requests will take a significant amount of time. How do you deal with situations like this?

Probst: I don't have all the answers. I can tell you what I do, or a few things that I do. Number one is I do spend time every day scanning through my emails and kind of triaging. The ones I can respond to quickly, I respond to quickly. The ones that take more work, I get back to later. They get a star. There are threads that are just very involved. I have actually, for instance, forwarded that thread to somebody who is already on it and said, "Look, this thread, I don't have time to read and digest it, but it looks like you're active on, so just pull me in if you need me." Because maybe there's some deep thing that's being discussed, and I fully trust that person to figure that out, and I don't need to be in it. I try not to let really important things stay on my plate for too long for that reason. It does take a lot of reactive time. It's very true. I always have a list of things that I know I want to accomplish this week, like the more proactive stuff, and then email is separate. Email is like the more reactive stuff, typically.

Participant 3: As developers, we help each other, and I suppose we want to grow up with our team members. What if one of your team members is asking you for help, and after you provided your thoughts or pseudocode or solution, he or she even asked for more? Basically, they expect you to provide exactly the code.

Probst: I think there are also two aspects to this. One is, where do you draw the line in terms of how much time you're willing to give to your teammates? We always talk about we help each other, and together we're a more proactive team. Maybe one thing that can help is to think about where is that person on their journey of learning. Aaron [Blohowiak], this morning, talked about these levels of ownership, and is this person in a place where you just have to teach them and kind of bring them up these levels of ownership. Then they require more oversight or more time commitment, I guess, from the rest of the team. If that's too heavily leaning on one person, then that could be a problem. I think it's the manager's or the team lead's job to recognize this with input, obviously, and load balance better.

The second aspect of my thought is that something that I see happen quite a bit is that the people who show the greatest ownership, they care a lot, they're very involved, they get all the stuff handed to them. We're, "Oh my gosh, we have a problem. Who can handle it? This person over here can handle it, because they handle everything else. We know they're going to do it really." To me, that's like a single point of failure problem. Then, it's up to the team and the team lead and the manager to make sure that we can scale horizontally. We're not always perfect with that, and people emerge as these single points of failure because they're so great. This is an ongoing process of you recognize the problem, and then you scale out again, and you recognize the problem, and you scale out. If you're having that problem personally, I would certainly talk to the manager about it. I wouldn't consider that like an escalation. It's just, "How do I do this? You expect me to do all this stuff, and now I'm spending all this time on this other person." That's what I would do.

Participant 4: You mentioned identifying single points of failure within teams. I've heard some exercises teams will apply to identify single points of failure. An example is they'll identify a team member and have them be gone for the day and find out what things fail. Are there approaches that you have applied to find this effective? As you mentioned, it was kind of reactive, you may identify the problem once it happens, but is there anything you apply to identify that single point of failure?

Probst: That's an excellent point, and it will just further my analogy, I love it. Everything I talked about here is reactive. What we do systems, we do failure testing and failure injection testing. Sending people away is a great example. Just send them away and see what happens, just like we do with chaos testing. I think I haven't, to be honest, done too much of that, like send people away and let's see what happens. I do think about sort of the hypothetical, of what would happen if this person weren't here? What systems do we have that don't have clear owners? It's a huge red flag to me if some bug comes in and everybody's, "Nobody really knows this code." That's a big red flag, so we need to do something about that. Just paying attention to these things before they blow up and become problems is the approach I've taken, although I really like your idea of failure injection testing. I might have to do that.

 

See more presentations with transcripts

 

Recorded at:

Feb 20, 2020

BT