BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Presentations Lessons Learned from Remote-First SRE

Lessons Learned from Remote-First SRE

Bookmarks
38:47

Summary

James McNeil discusses how they have made remote working sustainable at Netlify, practices which can improve hybrid and in-person incident management.

Bio

James McNeil is a lapsed strategy consultant who did a coding bootcamp on a whim and hasn't looked back. He is interested in equal measure by human factors in incident response and computer networking. He has plenty of time to think about both as a Site Reliability Engineer at Netlify. He works remotely in London.

About the conference

QCon Plus is a virtual conference for senior software engineers and architects that covers the trends, best practices, and solutions leveraged by the world's most innovative software organizations.

Transcript

McNeil: My name is James McNeil. I'm going to talk to you about my lessons learned from remote-first SRE. I work for Netlify. Netlify is a Jamstack company. Jamstack stands for JavaScript, APIs and markup. We provide customers with a Git based workflow to host pre-rendered assets in JavaScript on our CDN, while also integrating with functions and external APIs. As an SRE team, we manage the infrastructure of our CDN origin server and build system, while also developing our incident response practices with our other application teams. I am based remotely in the UK, and I've been at Netlify for about a year now.

Netlify's Distributed Culture

At Netlify, we care deeply about remote work. We like to think of it as building a distributed culture. Before the pandemic, 65% of our staff were fully remote. I have colleagues in the U.S. obviously, but also in Canada, Brazil, Argentina, all over Europe, Kenya, Israel, India, Japan. Strictly speaking, though, we don't follow the sun. What I mean by that is that we don't distribute teams and work geographically. We don't hand over during the day at different points to the next set of engineers who are awake, although we do staff support teams more broadly, geographically, so that we do have in-person spot cover throughout the day. My personal team, the SRE team spans 5 countries and 10 time zones. While I want to be clear that I'm by no means an expert on remote work, and we, I'm sure, get things wrong, we do try to iterate. My point, if anything, is that remote work was not forced on us by the pandemic, it was a conscious choice that we've made over many years, about how we wanted to grow as an organization and who we wanted to hire.

Back to Work Is Becoming a Divisive Topic

In the wider world of work, though, there's a, I think, divisive topic of back to work, or at least back to the office, which is coming to the fore. In one respect, I think that's something that we're here to talk about. A few weeks ago, the chairman of the Conservative Party here in the UK, said that, "People need to get off their Pelotons and back to their desks." He was responding to comments from one of his own civil servants saying she preferred working from home because it gave her more time to exercise on her Peloton. While the tech industry is very different from the UK civil service, I think there's a lot in this soundbite that we can reflect on and that parts of it may resonate for some of your experiences.

First, there's a sense in this statement that there's maybe something missing from remote work compared to office work. This might be productivity, or community, or coordination, or oversight. On the other hand, and in opposition to that from certain employees' perspectives, those who found remote working at least in part fulfilling, there's a concern about losing that time that they gained back from not having to commute. The autonomy over your schedule that you get from being a remote worker. I know personally, as an on-call engineer, there is a peace of mind to being in my own space, and having at least some control over my setup, and how I am particularly when I'm on call.

I fully acknowledge that remote working is not for everyone. That people have a number of reasons why they either can't or don't want to work remotely, and that many people prefer working in an office, and that is very much a privilege to have the space and infrastructure and willing employer to allow for remote work at least some of the time. I am concerned that we might focus too much on a dichotomy between remote work and hybrid or in-person working. We treat these things in opposition instead of potentially looking at how one can improve the other.

The Practices We Employ to Make Remote Work Sustainable

My feeling is that remote work can be incredibly rewarding, but as many engineers and many people in the world have learned over the past year, it's not just about replacing meetings with Zoom calls. I'd like to speak to you about the ways in which adopting practices to encourage effective remote working can benefit all modes of working hybrid, and as well as in-person. At Netlify, we spend a lot of time thinking about how we want to work together, how we can work sustainably, how we can be productive. I'd like to share some of those practices with you. My hopes is that we'll see the benefits of them regardless of where you choose to work.

Some of these practices for us are setting manageable work-life expectations, building cohesion in our remote environment. Prioritizing asynchronous communication, and establishing well understood norms for incident management. I'll first talk about our general practices, then about the practices that I and my team use in SRE and in incident management. Finally, have a bit of a reflection on where I hope that other people might take this in their own working environments.

It Is Easier to Collaborate In-Person

I know full well that it is easier to collaborate in-person, because I haven't always worked remotely. I was a software engineer at Pivotal. Pivotal was founded explicitly on the principles of extreme programming. Our offices were literally designed for collaboration. We had creative spaces, whiteboards at every turn, quiet spaces. We would arrive at the office before 9:00 every day, have a wonderful, catered breakfast ready for us. Then at 9:06, have our company-wide standup and then go to our desks. The reason that this was structured in this way, was because work for us was pair programming all day, every day, from 9:06 to 6 p.m., with Ping-Pong breaks. The image you see here was a very typical workstation. One computer with two separate setups for two engineers so that they didn't have to crane to look at the screen, and they would have essentially a conversation about code all day. We would switch pairs daily. The way that the office was set up was so that this mode of working was how we would do our jobs.

Remote Work Requires Discipline

Obviously, a lot of that is not possible in a remote environment, or at least some of it is very difficult. The benefits that you have of in-person collaboration and interaction can't be taken for granted. This is why I feel that remote work requires, if not more discipline, a certain maybe different kind of discipline. At the beginning of the pandemic, Mitchell Hashimoto, the founder of HashiCorp, tweeted a thread which has really stuck with me since, and resonated with me at the time about essentially the fact that for all of those people coming in to remote work, who were new to it, there were some things that they needed to take into account. That it wouldn't just be a question of having your home office mimic your in-person office. I'll be covering a lot of the same topics as this thread. I would encourage everyone to go and read them because I think they're very valuable and it'd be interesting to see and hear if and how they've resonated with you the past 18 months.

How We Work

How we work at Netlify. This is a combination of my own lived experiences and observations as well as practices that we endorse at Netlify, some of which are written down, some of which are picked up through habit and through culture really.

Make Yourself Almost Unreachable

This first point, one of the biggest pitfalls I've discovered from remote work is actually that you have more time. There's no commute. I am a few steps away from my bedroom to my desk, so I can get up and make my coffee and then sit down and get started. Likewise, I'm a few steps away from my living room. After work, if I'm reading or watching TV, I can also very easily get up and get to my desk and check my messages. Also, I can just check my phone. One thing that has been a striking realization for me is that when there's no office, Slack is your office, and that this fundamentally changes the dynamic of just checking your messages. If you check Slack after hours, you might as well be walking back into your office and in fact you essentially are walking back into your office. That's why it's really important to set your own boundaries and guard them quite fiercely because this is ultimately a question of your own mental health. For an employer, it's a question of retaining their workforce and making sure that they don't burn out.

At Netlify, we put our working hours in our calendar, and we use the Clockwise app to optimize focus time. Clockwise also adds a widget to our Slack avatars, which indicates whether or not a person is in or out of their working hours. I turn off all notifications out of hours. No human being other than our people team has my phone number. We don't share phone numbers. We don't text. We don't call each other or do anything like that, for work. At the same time, I am an SRE. I'm an on-call engineer. Obviously, I need to be very easily reached under certain circumstances which is why this is a bit of a baiting headline. In general, we want paging and on-call engineer to be an incredibly easy process, both in terms of it being streamlined, but we also tell new joiners that when in doubt page. That is our baseline. There should be no misgivings or questions about whether or not someone's available or awake, that is what on-call is for: they are ready to be paged, they're prepared to be paged.

However, if you are not on-call, there are still circumstances where someone might need to reach you. You might be an SME, and there could be something very bad happening, or you could be the last person to have touched some code, and likewise there's a problem that we're going to need you for. If a particular engineer who's not in office hours or on-call is required during an incident, we want that to be a conscious and non-trivial act. We find that PagerDuty is actually a very good way of setting this up. I am reachable via PagerDuty, but people cannot reach me. Usually, I shouldn't have to be paged, but it's possible that I could, so someone will create an incident and tag my escalation policy in it. That will come to me and I will respond.

Communication

Our communication stack and infrastructure I don't think is particularly special. We like to have clear boundaries between them. We, in all of them try to prioritize both asynchronous and open communication. Slack is our office. It's where we spend most of our day. It's where we coordinate efforts and socialize. It's also the central hub for our incident management. We have a very loosely enforced no DMs, no private channels policy. Obviously, people do DM each other and there are private channels. We try to set as our default that everything should happen in the open. This is actually very important for the way that we work. One of the first steps I'll do when trying to get context on an issue or any problem I might be facing, or even just to figure out if it's something that we've thought of before would be to do keyword searches in Slack. Or if I ask for help in a Slack channel, very often an engineer will go find a conversation that could be years old even, and paste that thread as the initial context.

I Signpost My Day

One thing that we do very often is signpost our day. My own team lives across 10 time zones. We don't have a set lunchtime. We can't take our coworkers' schedules for granted. If you were to look at our Slack channel, you would see people signing on, checking in, going for coffee breaks, picking up their kids from the school run, and logging off. Being explicit about these things, saying that they're doing them, and getting appropriate reactions from other people or presenting reactions like the back emoji that you see there to indicate that you've returned from whatever it was that you were doing. This isn't about keeping tabs on people. My manager lives in Portland. I hope she has better things to do when she gets to her desk than check how long I went for my lunch break. In particular, in my team, when at any point an alarm could go off and we need to coordinate a response, you do need to know who's in the room. This is a really important and helpful way of getting situational awareness of who you can call on.

Emojis Are a Critical Part of Our Communal Infrastructure

There were a lot of emoji reactions on those past comments of mine, and emojis are actually a fundamental part of our communal infrastructure. This is actually also something that Mitchell Hashimoto pointed out in his points about remote working. You can see a tweet from one of my colleagues about how many emojis we have. I just did a search for netli in our emoji search bar and you get dozens of emojis in our Colorway. We also have a old man screaming at the cloud emoji for probably every SaaS provider for when there's an outage and we want to moan. It serves a few functions. It's a means of condensing information, so you can express so much in an emoji that you might have to put out several sentences to express. It also lets you stack it, so a lot of our channels you'll see group reactions from various people. It lets everyone express emotions and have a communal reaction to things without involving long threads.

Ceremonies Are an Important Way of Maintaining a Distributed Culture

Ceremonies are also an important way that we maintain our distributed culture. We have a large number of channels that are very actively contributed to. We have celebration channels like #internal-thanks, where people thank other people for things that they've helped out on, or ways that they've been really great. We have loads of fun channels for various games. The #we-moji channel is where people will post new emojis that they've found or designed as they add them to our library. We have loads of shared interest channels, where we talk about things that we've made, or we do actually go outside. It seems like every weekend someone's climbed a mountain. We have a, #we-talk-mental-health channel where we talk about our own issues with our mental health or questions that we have to our colleagues about how maybe they've dealt with stress or share resources. We have an awww-together channel, which the name belies what it is, it's where we post images of our pets. We also have a Pets of Netlify website, where if you have a pet when you join Netlify, please share your pictures and it gets added to it. Then, learning, so we have book clubs. We write a lot of Rust so we have a Rust Learning Channel. Also, we have a Monday Kick Off meeting, and when possible, local meetings where people can actually meet up in person if they're in the same geography.

We Go From Conversation to Documentation as Quickly as Possible

Good documentation is absolutely fundamental to the way that we operate as a team and as an organization. One thing you'll see repeated very often in our Slack channel is let's move this to a doc. While most insights and issues and conversations start in Slack, we are encouraged in as short order as possible to research and translate those thoughts into a longer, more reflected on thought-out document. If it's a bug or a feature request, or it relates to a specific repository or part of our code, then obviously it goes into a GitHub Issue. Any broader architectural topic, or discussion, or explanation will go into a Notion doc. The benefits are twofold. It forces us to think and put more time into structuring our thoughts and arguments and also evidence. It also provides a platform for the wider team to add comments and questions asynchronously. The documents that we produce on Notion will have anywhere from a couple to a couple dozen comments and questions for clarification from colleagues from all around the world who will get to that document in their own time on their own day's schedule.

Sometimes You Need a Meeting

We don't entirely shun meetings, we use Zoom for meetings. We have company updates which happen through Zoom. We have team check-ins and ad hoc debugging sessions. Sometimes if two people are online, they'll just ask for a Zoom call because that is easier. Also, our incident response in our incident channels, we will open a Zoom call for all participants to join and coordinate in there. It is often not possible to get everyone in a meeting without requiring them to wake up early or sign off late. We also don't expect people to attend anything out of hours necessarily. Instead, all company meetings are recorded to ensure that everyone has a chance to view them on their own time. We also regularly record team meetings in case stakeholders are not currently online.

In terms of meeting hygiene, all meetings are expected to have an agenda. We take notes during meetings. It's very important to rotate the note taker because often the same person gets by default put into that role. That's something that we try to look out for. Finally, when we schedule our meetings, we have an automatic setting in our calendars which will take off five minutes at the end of every meeting, so that people have a little bit of time if they do have meetings stacked. Those meeting recordings are searchable by the entire company.

Mute Yourself

There's a lot that I could say about the way that I structure my environment and how I do my setup. I think there's fundamentally one thing that's probably important and transferable for the sake of this talk, and that is to mute yourself. You've probably all encountered this or heard it or been requested maybe to mute yourself in a call if you'd forgotten to. This headset here in the picture is a Sennheiser gaming headset. This is a throwback to my time at Pivotal, where every desk had one of these. The interesting part of it is that the boom when it's lifted, would mute the speaker. Every workstation would have one of these. This is something that I've brought and promoted to Netlify as a differentiator, because I've found that having a physical act which mutes yourself is very different from remembering to go to the Zoom panel and click mute. The headset that I'm wearing is not the one that you see, but it does have a button behind my ear, and that has become an automatic reaction for me to mute myself once I've stopped talking. This is very important, particularly in an incident setting where we need to reduce the background noise to as little as possible.

Incident Management - Leverage Our Existing Tools

Let's talk about incidents. The first thing to say is that our incident management builds on these common habits. Because of how much discipline I've talked about it takes to just work remotely, ignoring being paged, we don't want to add the burden of some new system that only gets considered or activated when we're in an incident. We really try as much as possible to leverage the tools that we have, and just extend them. We have Slack bots to allow people to page particular teams or service escalations. We also have a Slack bot too, which handles our incident management, so it declares our incident, opens our incident channel, invites people to the channel.

Once we are in an incident, we open a Zoom call. That is where we join and we do discuss the incident in person for those people who are there. We also maintain a library of CLIs which we built out of feedback from incidents, in order to be able to get the common answers that we're going to want faster. I've built a wrapper to the Humio CLI, which will get me the common queries that I know I'm going to want, if we have certain kinds of outages. We have testing tools which can target specific subsets or combinations of our CDN nodes to get answers from them quickly. Another interesting thing that we found, potentially recently, Datadog is our metrics aggregator, where we have our dashboards. If you Control-C on a Datadog time chart or any graph, you can then paste that into the channel and it will paste with all of the settings, so the time specificity and any of the attributes that you've set in your own Datadog window will then be in the incident channel.

Incident Management - Over-communicate in a Shared Language

We, like everywhere else at Netlify, we over-communicate. We try to over-communicate in a shared language during incidents. The first thing is that we highly encourage engineers to have their reporting settings so they're logging in their metrics, set in UTC. Definitely, too when they're sharing that information. We repeat each other constantly. We want to confirm that we know what we think we heard. Again, this is also very important in a setting like this where we are potentially on our calls. There may be some static in particular internet connections, so being sure and repeating things back to people becomes very important to be sure that you are actually getting the correct information. Mute yourself. In the incident channel topic, we have a field for the handle of the current incident commander, and we will do either a verbally or typed explicit IC handover and then change the incident commander name in the general topic.

Incident Management - Document Everything

Again, like with the rest of the company, we document everything. The incident commander is the guardian of the timeline in the channel. This includes conveying anything which is discussed in Zoom into the incident channel. One thing that we are considering and investigating is also potentially to have transcripts from our Zoom conversations to add to that documentation. Maintaining runbooks is the responsibility of the entire organization. We have touchpoints in our incident review templates checking that our runbooks are up to date, and whether or not as a result of things that we've learned in an incident, we need to make any changes there. We summarize the incident in our announcements channel and include links to the incident channel, to the incident review document that will then once the incident review has happened also have a link to the recording so that anyone in the company can see what the incident was, understand what it was. This helps us convey the types of incidents that were happening, potentially their frequency, and helps us be open and share information about what happens at the company in a way, and how we learn from the incidents that do happen.

The incident review document is in Notion again, which means it's a collaborative process. It is spearheaded by the IC, but you will find questions for clarification and comments from people, while we're preparing it in lieu of the review. There is no one timeline, so it's very important to get people's perspectives on the timeline. You will get people saying that they thought certain things happen differently. That needs to all be reflected in the document. We record our incident reviews.

Implications for Other Ways of Working

Those are my main themes. I hope this has been constructive for those of you maybe even who don't work remotely at all, but especially those of you who work in a hybrid setting. What I would like to get across is that the discipline required to build a sustainable distributed culture can benefit all kinds of work. Those things that we've gone over in terms of maintaining rigorous documentation practices, investing in an online culture, prioritizing asynchronous communication, both for promoting coordination, but also because it makes your employees who are in time zones which aren't PSD or ESD feel more engaged, involved with the work that you're doing. Finally, protecting your people's and your own time, which helps make the work more sustainable and helps everyone just do better work, I think.

Questions and Answers

Verma: I'm going to start with something that spans every kind of work, which is human connection. Maybe you can answer for your particular organization, how do people create a human connection, because it's such an important part of work? You being in a remote company, in a hybrid company for so long or in various companies actually, what have you seen are the best ways to do that?

McNeil: I think that Mitchell Hashimoto thread, if you couldn't read it, I think the first point is you won't make friends. He qualifies that a little bit. What he means specifically is like, you won't go for drinks after work, or your kids won't go to the same school, that kind of stuff. I don't know if I would be that strong. I think that there's a headline grabbing bit there. There is a part where it is true that you don't have the same outlets to socialize. For us, in a lot of ways, that hasn't been as strong a difference as I thought it might be. At Netlify, a big part of that ends up being how engaged the entire company is online, on Slack in particular. I think some of that ends up being genuine interest in different things. There's all these different channels where people talk about the plants that they're growing, or the tables that they've made over the weekend, or the mountains that they've climbed.

A lot of it, I think, is not particularly special when you boil it down, which is, you make connections or you make friends. You make connections with people by finding commonality. I suppose that everyone at Netlify is very invested in building those bonds as much as we can, and therefore participating almost. I think there was a question about participation in those channels.

There's definitely a sense that some people are more active than others. Because this is our office, I've been surprised to find that there is a very high level of participation in the fun channels, as well as in the work channels. Because all channels are open, anyone can either join a conversation or can be pulled into a conversation, whether professional or just some something fun that's going on. There's no barrier that way.

Verma: I also remember you mentioning about some meetups, like employees tend to get together, plan meeting on their own. Can you maybe touch on that a little bit? That's another way of forming connection that is driven by the employees.

McNeil: We can't replace in-person meetups. One thing I saw recently was someone who worked remotely, just saying something along the lines of just one time meet your coworkers in person. It's important. Before the pandemic, and hopefully soon enough Netlify would do yearly offsites. Also, individual teams would do offsites at some point during the year. More locally, at least for the UK, we try to meet up, I think we're at about once a month right now, maybe once every six to eight weeks. That is just something that we organize amongst ourselves. We have a Slack channel for people who are in the UK, and try to find a date and get together. I wouldn't want to suggest that we basically force remote doctrine in all situations. If you can meet up with your coworkers, please meet up with your coworkers, absolutely. I think the point ends up being more about work for one thing, and how you have access to that, and how you share those professional ideas.

Verma: It's a shared responsibility. What you're saying is, you should not feel limited by what the company's explicitly saying. There's always freedom for employees to organize around these things. It is true that connection is pretty important.

McNeil: If there was a guiding tenet behind why we do things the way that we do, because it is not easy. It takes up more work than it would to call a meeting and have a whiteboard, for instance. It ends up being about inclusion. We are very serious about being able to hire and work with people from all around the world. If half of work the work was being done in a conference room, and I'm sure you've all been in the situation where you've had something on the whiteboard and the last person to leave the room is like, "Hold on, let me just get a picture of this." They upload grainy screencaps of what you've been talking about. That just wouldn't be good enough. That wouldn't cut it for us.

Even in the case of in-person meetups, that is open to everyone in the UK. If my colleagues in Spain want to come over to the UK for a drink, then by all means. All of these circles are essentially open to as many people as can join them.

Verma: Have you found people standing up voice channels, some always-on voice or scheduled voice where people can get together and exchange ideas, or talk, or sync?

McNeil: We do have Zoom syncs. I think that there have been experiments with always-on channels previously. For instance, when there was a San Francisco office. I don't think it went over particularly well. My experience of those in the past, because we did have one at my previous company, is that it's very much not the same. For the people who are there in-person, you actually don't necessarily remember that there are people online. I don't necessarily know that that's a solution for what I think the problem is, which is more about the ways that you communicate than necessarily being present.

Verma: Do you guys use some integration, which allows communication to happen seamlessly between GitHub Issue comments and Slack? Because you're switching context, and people are generally paying attention to maybe one channel at a time. Any tips there?

McNeil: It's probably going to be a bit more low-tech than people were hoping. We use the GitHub integration in Slack, which will print out the issue and give you the summary, or if you copy it from a particular comment, you'll actually see that comment. That is, in many ways, enough. You get enough context while you're in your Slack conversation, and then you can click over and read it. Another thing that is an informal policy is from GitHub, for instance, probably because you don't get that same thing, never to paste a Slack conversation with context. You don't just say see here. You summarize, and you provide that as almost a reference and a link, wherever possible. That's the benefit of having that GitHub integration with Slack is that if you are reading something, you want to be reading where it is. If it is not possible to do that, then you want to summarize and provide for the context as a link.

 

See more presentations with transcripts

 

Recorded at:

Jun 22, 2022

BT