BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Presentations Less Mess, Less Stress: the Reliability Benefits of Custom Tools

Less Mess, Less Stress: the Reliability Benefits of Custom Tools

Bookmarks
27:07

Summary

Daniel Hochman discusses how an overreliance on vendor tooling leads to worse reliability outcomes, how Lyft lowered MTTR for its most common alerts using custom tooling, and how Clutch can help.

Bio

Daniel Hochman is the tech lead of the platform tools team at Lyft and the creator of Clutch, the open-source platform for infrastructure management. As an early engineer at Lyft, Daniel successfully guided the platform through the explosion of product and organizational growth.

About the conference

QCon Plus is a virtual conference for senior software engineers and architects that covers the trends, best practices, and solutions leveraged by the world's most innovative software organizations.

Transcript

Hochman: Hello, my name is Daniel Hoffman. I'm a software engineer at Lyft, where I've been for about seven years. First, I just wanted to thank you for attending my session. Please, reach out to me afterwards. My favorite part of conferences is getting to network and talk with people, and get feedback on the topics and ideas that I present. This virtual format, that's going to be a little more difficult. Back to the topic at hand. In my time at Lyft, I've seen a lot of scaling challenges. Reliability has always been top of mind, of course. It's very important for modern apps and software. Tooling, I think, plays a big part in reliability. Recently, we decided to open source our own solution for custom tooling known as Clutch. I'm going to talk through our discovery process. Why we decided to build custom tools in the first place? What are the benefits and drawbacks of custom tooling? I wanted to compare that also to the tooling that we get out of the box with the infrastructure and software that we use a lot these days.

What is Reliability? The RASM Model

First, I want to define reliability, or at least some ways that we can measure it. As engineers, we're problem solvers. Most problems have been studied before in some capacity. This model is known as the RAS model, and it was introduced in the field of hardware engineering for IBM's System/360 mainframe launch in the 1960s. I think it still applies to today's software. People aren't going to use your hardware, your mainframe, or your app if it's not reliable. Reliability is defined as the system functioning as expected. That means that the level of performance with the level of correctness that's required without any failure. Availability is continuing to perform the overall function, even when there is some failure. Distributed systems, of course, covers this. Serviceability is the ease and speed at which a failed system can be repaired. Manageability, which was not part of the original model, but I think is still important, is can we [inaudible 00:02:16] APIs necessary to monitor and control the system.

Mean Time To Repair (MTTR) & Mean Time Between Failure (MTBF)

We're going to talk a lot about serviceability. First, let's establish why it's important. Failures are going to occur. We rely on a cloud which is made up of virtual machines, shared infrastructure, we rely on networks, which are not reliable, third party APIs, third party libraries. We ship code constantly to multiple systems at the same time, which can interact with each other in unforeseen ways. We know that there's going to be failure. MTBF, which you've probably heard, is mean time between failure. That's how frequently failures are occurring. We don't focus too heavily on the mean time between failure, though it is important. What's really important is when a failure does occur, A, can we remain available? Are our systems auto healing to an extent? When there's a larger failure, how fast can we repair it? How fast can we roll back the code? How fast can we update the configuration, change the infrastructure, or something like that? That's measured by MTTR, or what's known as mean time to repair.

When I looked at tools, and was studying MTTR, I started to notice two overriding themes. There's the stress of fixing the problem, at the same time, trying to avoid making it worse by accidentally doing the wrong thing. Then there's the mess. You're either dealing with a large number of tools, or the tool can be very complicated and difficult to get things done quickly.

MTTR: The Four Factors

With those in mind, the stress and the mess, I tried to kind of formalize that into some factors, these four factors here, so that we could look for solutions in order to improve the MTTR. We've got the complexity of the system, specialization of the tools, the number of tools, and finally the ability to improve the tools.

Factor I: Complexity of System

The first factor is the complexity of the system. I'm going to gloss over this very quickly because there are several talks worth of content here. There's lots of different techniques, some of which I've shown here, such as separation of concerns, or aspect-oriented programming, that help you build a system in a way that lends itself to being serviceable, to not accidentally introducing bugs, to being testable, etc. Complexity of the system is the most important aspect in having a reliable system. As I mentioned, there's a lot to unpack there and probably best done in another talk.

Factor II: Specialization of Tools

Second factor is the specialization of the tools. Here are each of the three major cloud providers consoles. They present a lot of information, a superset of functionality. It's not about what you're running, it's about providing, again, everything that you can do, helping you discover new functionality. I like to compare that to the console of a space shuttle, which we see here. There's lots of different knobs and screens and information, and it can be overwhelming, for sure. Spatial operators at least have the luxury of Mission Control telling them, "You should pay attention to this screen," or, "You should click this button." We don't get that as operators in the cloud.

On the other end of the spectrum, when thinking about some of the simpler devices, whenever I think about modern, simple UX, user experience, I think of the PalmPilot, which is a precursor to cell phones that we all carry today. It wasn't the first handheld device, but it was successful because they were able to distill the device down to three major factors.

First one being how many taps it took to complete a task. They would actually sit there and count for every single piece of functionality on the device, how many taps does it take. How many taps to add a calendar item, to book a contact, to edit the phone number of a contact, and they had a threshold. If it took more than, say, five taps, they would redesign or reconsider the feature altogether. Is it actually important? Are people going to go through that many taps to get to it? Are they going to remember how to get to it?

Second, what features really matter in people's daily life, or in their business? This was from actual device features and software capabilities to the battery life. Why are people going to actually use this thing? How does it make their lives better?

Finally, how can we display data efficiently? At this time, of course, we're dealing with a low resolution display. It's difficult to see a lot of information and parse through it. If it's not relevant, and you're just looking for one thing, it's just going to slow down your use of the device.

I want to kind of take those concepts and apply them to some of the tools that we use today. This was the most common alert, or most common remediation action, I would say, performed at Lyft for a very long time, before we moved to Kubernetes. We were running on VMs, we were using auto scaling groups, and we were cycling through VMs a lot. We would get bad hardware quite frequently, actually, at the rate that we were introducing new instances. When someone was paged for high CPU, first, they would say, "Did I deploy any code? Is every instance, for example, showing high CPU?" If they found out why, they would go and terminate the hardware, terminate the instance. That took seven taps. Select EC2, go to the instances, find the instance, select the specific instance that you're looking for, find the button that lets you choose the new state, click the state, click the button. Seven taps. Again, each tap actually is kind of important.

Not only are there seven taps, but we look at the larger picture of what each of those taps represents. There's lots of different information that's being presented to the user. When you come to the homepage, there's 175 AWS services that you see. Of course, you know you want EC2, but you may have to parse through that list and click EC2 to get there. At Lyft we run tens of thousands of instances in some cases. That slows things down, not only for me to look through that list, and try and figure out which one I want, but the page loads very slowly. In some cases would even timeout until they fixed that bug for us.

Then when you get to each instance, you're presented with a lot of data. When I'm turning the instance, I don't care about its storage, or what's going on on it's hard drive. I just want to get rid of it. I want to make sure that it's the right one. I may want to see the tags that are on the instance to confirm. That's not the default view that's presented to you, so I have to kind of go through all this other information to find that. Finally, when I do click to terminate the instance, I'm presented with this dialog which tells me, in some cases, you may want to do more. This isn't relevant at Lyft at all. It's relevant to the general user of the tool, but not relevant in the case that I'm trying to perform remediation at Lyft on our services.

When we talk about specialization of tools, what we're talking about is that the tools become slow or confusing to use due to the lack of specialization. There's too much information, there's too many steps. That just slows us down.

Factor III: Number of Tools

On the opposite side, if we talk about tools not being specialized enough, imagine every tool being hyper specialized and just doing one thing. Then you end up with a lot of tools which is its own problem. Here we look at, actually, again, at Lyft, a potential incident remediation. All of the different tools that you could possibly use to get that done. This is not even all of them. These are the major ones. We even have runbooks to help you try to figure out which tool you need to use. Trying to decipher all this and sort through all this information, cut and paste between tools, doing all of that during an incident, when you're in that time-to-repair window is not good. Customers are experiencing downtime and outage, the wrong information is being presented to them, they're seeing an error. We want to remediate that as fast as possible. That's just not possible when you have to look in a lot of different places to find information.

Also, when you have a lot of different tools, operators, generally, just become unfamiliar with them. At Lyft, we're on call maybe once every six to eight weeks. Every system is not going to have a problem when you're on call. So you may not even touch a system for several months, or understand exactly how to perform remediation actions on it, because it's just not part of your day-to-day work. Then the systems themselves are changing. We're introducing new infrastructure, we have large infrastructure teams, and they're trying to improve things. People are starting at different times. There's different cohorts. We don't have the onboarding and continuous education necessary to familiarize people with that. It'd be very difficult to even formulate a curriculum that would help people understand these tools.

Second, the diagnostic information, of course, is spread across many systems. That just delays the decision. If we're cutting and pasting, if we're looking at multiple tools, logging into them, it just takes more time.

Factor IV: Ability to Improve Tools

The fourth factor is the ability to improve the tools. These days, it's very easy to get started with what other people would call, I guess, cloud native infrastructure. There's hundreds of articles out there like this. I can launch a new Kubernetes cluster in five minutes. On EKS, or one of the other hosted options, I can launch new databases. There's lots of different projects out there. Many of them have the goal making it very easy to get started, because that's how you start to get users and build a community and gain traction.

Kubernetes, we actually get this nice open source dashboard that we can use. For Lyft, it's not applicable. For availability reasons and separating the blast radius, these different clusters, we run an individual Kubernetes cluster in each availability zone at Lyft in Amazon. The dashboard doesn't support that type of context. It is open source, I guess we could fork it and modify it. Again, it's a very large tool and getting it to support all these different functionalities would be difficult. We did write a command line wrapper that would iterate over the clusters when you were using it, and help you just perform these multi-cluster actions.

I want to talk about Google's postmortem philosophy, and in general, postmortems. We want to understand what the problem was, but more than that, we want to then actually take action. The next time it happens, we don't want there to be as large of an impact, or we want to prevent it from happening altogether.

With most things, we own the software. We find a bug, we open a pull request, it's fixed. We get user feedback, we open a pull request, user's happy. When you're dealing with vendor tools, that's just not possible. Most of these are closed source tools. They have a lot of different functionality. If we find a bug, we can report it, but you're just one of many people. They have lots of customers reporting bugs, so it can take time to fix it. Maybe it's just you want the tool to work differently, but your use case is different than the general use case. The general use case wins out and you end up, understandably, with this large complex tool that, again, has a superset of the functionality, not just the specific things that we need to do.

Command Line Tools vs UI

I want to talk briefly about command line tools versus UI. Obviously, as engineers, we'd like to reach for command line tooling. We'd like to automate things. Command line tools are really only great, actually, when you know exactly what you want. If I want to find a bunch of files, and then find the ones that are a certain size, and then pipe those to grep to look for ones that contain a certain word. That's the command line, that's where we're experts. When I don't know exactly what I want, a UI is actually better. We can give more context, we can give more signaling. It's a rarely performed task, but again, you can get a nice display with colors. Of course, you can do some of these things in CLI, but then you're generally just kind of trying to emulate a UI.

CLI, let's say I want to increase the size of my cluster, so that the memory usage per host hits 40%, whereas it's currently at 75%. Trying to incorporate all of that into a CLI is going to be difficult. In a UI, we can easily have a graph, and we can have a form, and we can validate the form. If I want to change the value, I don't have to go back in my history, edit. None of that happens. UIs do win out in some cases. Another benefit we found was that our command line tools would often go out of date, so people would pull them. Then we would ship an update to it. Then they want to pull the latest update after their initial install. Of course, you could build auto updating. When you're dealing with a web UI, it's just not a problem. When you navigate to the address, you get the latest software, no problem.

Tooling is a Product

Tooling, I like to think of it as a product itself. When you think about Lyft, I don't type in a text box, "I want to go home, and I want to do it for less than $10." I get a display, UI of this information. It's very action-oriented, it is very clear. Hopefully, I don't have to look at Google Maps and figure out what the traffic is. I can go in one place, I can see the information I need, and I can request that ride. We want to think about infrastructure tooling in the same way.

Techniques and Benefits of Custom Tools

I want to talk now about the techniques and benefits of custom tools. This is going to be in the context of our open source project, Clutch, which we open sourced in July. When you see screenshots, they are from Clutch.

One of the things we tried to do in Clutch, again, was reduce the number of taps, and to be very action-oriented. Again, not to present a superset of information. This is what compared to the other login database console, I'm trying to filter through, find the instance, then you've got all the capabilities on what you can do with the instance. Here, I say I want to terminate an instance. I click that button. Now I'm presented with a lookup, where I can provide the instance ID. Unlike the Amazon console, we search across all regions. We also allow you to input partial IDs. Normally, Amazon IDs have an i- prefix in front of them. If you put that in command line without the i- prefix, it just fails. Here, in the UI, we can just be a little bit more liberal with the input that we accept.

Finally, we just present a relevant confirmation information that you would need. We can present tags, we can present the IP address, maybe that your cross-referencing somewhere else. We're not presenting unnecessary information here. The destructive button is always colored red in our tool, which just helps signal to you that, "It's a red button, I may want to be a little bit more careful when I click it." Or, "I know that when I click this red button, it's going to take a destructive action." Finally, we show a confirmation page. We can show any relevant information for our needs, like, "This instance might take several minutes to shut down." Not stuff about how, "If you're running an auto-scaling group, and you meant to do this, you should probably not..." That's not what we're getting at here. We're just displaying clear, concise, relevant information at every step.

In this case, it's three taps to perform this action versus seven taps. Seven doesn't seem like a large number, but these things are actually important to people. Just rolling out kind of this basic tool for doing instance termination, again, a very common task for people at Lyft, we got great feedback.

People were like, "This just improves my quality of life. I'm less stressed during incident management. This is a useful tool. It loads much faster." Or in some cases where the console just won't even load at all, it loads. We didn't even force people to use a tool. We had a lot of just organic usage. People were just very happy with this alternative. Kind of like the PalmPilot, "I'm going to use this instead of a paper organizer." People really like to use it.

Safeguards

There's other techniques that we can adopt too. Safeguards are one. We can pop a dialog, for example. I'm sure if you've ever used GitHub and you've tried to delete a repository, you've seen this screen. If we know that the action is going to be risky, we can confirm with you, do you want to do this, make you re-type it out and not just blindly click that button. In this example, we can even check other metrics and see, "There is no traffic going to this cluster, so it is probably safe." We could block that altogether if there was traffic going to the cluster. Then let's say that we're running online infrastructure. In the Amazon console, we don't want people to enter zero. There's no way you can enter an IM policy or anything that would prevent people from entering the number zero in a field. If we're running a custom tool, we can write a couple lines of code here. If you enter a bad value, we can do some basic input validation. We can say, "In our case, at our company, value of zero is not acceptable. It would cause problems."

Canonical Resource Names

Another thing we do in the tool, which I touched on earlier with not making you enter the i- prefix every time you enter an instance ID. We allow you to enter canonical resource names. At Lyft, in logs, in many different systems, we have the hostname. We take that hostname and we can actually decipher what's the underlying resource without having you go to multiple systems, maybe SSH into the host and look for the instance ID, or search in some other system to figure out what the instance ID is. You can put in the IP address, you can put in the resource, the hostname, you can put in all this different information. We can take that and figure out exactly what you were looking for it without having to put you through different systems, or other layers of indirection.

Finally, with custom tools, you want to think about contributions. People see this, now it takes three taps, and it's much simpler, it's much more straightforward. They have ideas for, "What functionality can I build for my team?" I operate the streaming platform at Lyft. I want to let people [inaudible 00:22:05] their streams using something other than the command line. Again, that's one of these tasks that we don't perform often. It's a nice fit here.

Clutch Architecture

Our architecture in Clutch really supports that. It's pluggable, basically, at every layer. In the front end, we've got abstractions in the back end that kind of separate things and make them substitutable. It allows you to reuse a lot of the code, it allows you to write new functionality, write additional input validation. We allow role-based access control via middleware. Having this very pluggable architecture in the tool is pretty important.

If you want to know more about clutch, you can go to clutch.sh. We've got a Slack channel. We are obviously on GitHub, you can go and check out the code. We have office hours. We've got a lot of documentation that if you want to learn more about Clutch and that specific custom tool, you can do so.

Pitfalls of Custom Tooling

What are some of the pitfalls of custom tooling? It's not all easy when you decide to introduce something else. First of all, you're going to have new types of expertise that are required, particularly if you're talking about infrastructure tooling and infrastructure engineers. We're accustomed to working on systems, working on the back end often, but now we're talking about building front ends, we're talking about UX design, making a tool that's better than the alternative, better than whatever the vendor tool is. It's not necessarily easy. You have to think about that when you're building teams and shipping functionality. There's also going to be maintenance required. As the infrastructure change, whether the cloud provider ships an update, whether your internal teams are making changes, you're going to have to keep things up to date. It's not just something that you write once and kind of let sit there. It will stop working if you do that, and then people won't trust it, they won't think it's reliable, and they'll stop using it. They'll prefer something else in its place.

Frameworks are Necessary

We talked about clutch, the pluggable framework, but in general frameworks are necessary because we need to provide a consistent experience. If you flip the order of buttons on different forms, it becomes rote for someone to click next. They may now accidentally click the wrong button. We want to use color where possible, we want to keep a consistent order. You need a framework to do that to kind of maintain the cognitive load of the system. Not only have we reduced the amount of information, not only have we made it less taps, but we just need an overall consistency for the tool so that, again, people aren't overwhelmed when they come to it. They're confident that, "When I click here, that's what's actually going to happen."

What is Clutch?

Clutch is a framework. Really, it's two frameworks in one. You can use them independently or together. You've got a UI framework, which is what we call the wizard. Then you've got an infrastructure control plane. The wizard allows you to build those multi-step flows. Infrastructure control plane allows you to unify all these different tools that you have behind a single API, so that you can easily maintain that API and access them from the front end, or maybe even another tool. We've talked about adding a Slack bot, basically, that would also interact with the same APIs.

Finally, you're going to be dealing with kind of an explosion of scope, as soon as you ship one of these tools if you're not used to having custom tools internally. We're shipping tools for the operation phase. As soon as we do that people are, "How can I launch services through this tool? What are some other ways I can operate or view information about my service?" Maybe we'll even introduce the decommissioning flow into a service. Finally, you have to consider your customers. Who's going to be using the tool? At every company, you're going to have a different mix of people who have different familiarity with the underlying infrastructure, templates that [inaudible 00:26:13] dedicated infrastructure engineers who talk to the infrastructure.

At Lyft, for example, you build it, you run it, so everyone is using the infrastructure. You're going to have to tailor it for your audience. Not only for your audience, but again, different information is relevant to different people. Just having the same custom tool rolled out into different place won't necessarily work. The safeguards we talked about, there may be different accent safeguards at different companies. This is kind of the last thing that you need to consider when you have this undertaking.

 

See more presentations with transcripts

 

Recorded at:

Jul 27, 2021

BT