The role of a system administrator is changing from being the one who performs the work, to the one who keeps the robots in the assembly line running smoothly. Teams that build and run distributed systems embrace component failure and must work closely together to ensure long-term success. These are just two of the opinions found in the new book, The Practice of Cloud System Administration: Designing and Operating Large Distributed Systems by Thomas Limoncelli, Strata Chalup, and Christina Hogan. While the book authors claim that system administrators are their intended audience, the comprehensive, practical content is relevant to anyone in a technology-focused team.
The first part of the book, consisting of six chapters, addresses distributed system design considerations. Chapter One lays the groundwork by describing the core considerations of distributed system design. These considerations include the principles of composing smaller systems into larger ones, interacting with distributed state, understanding the CAP theorem, thinking in terms of loose coupling, and paying close attention to design choices that impact response time. The second chapter starts to demonstrate the authors’ impressive depth of experience managing distributed systems. Here, the authors look at how to take operations into account when designing distributed systems. Instead of treating operations as an afterthought, developers should design systems with manageable configurations, straightforward startup/shutdown routines, clear upgrade procedures, support for scale-out of each service, rigorous instrumentation, and more.
How do you choose the right infrastructure hosting platform? Chapter Three explains the decision-making criteria for choosing between cloud delivery models. The authors look at when to run a system on virtual machines, physical machines, and containers. Application architecture is the focus of Chapter Four. This chapter reviews multi-tier architecture strategies and goes into a deep-dive on load balancing techniques and algorithms. The second half of the chapter digs into message buses and the role they play in a distributed system. Scalability is one of the most important aspects of a distributed system, and Chapter Five is packed with excellent advice for design teams. The authors start by identifying strategies for identifying bottlenecks and scaling solutions up or out. Then, they move on to an invaluable assessment of caching, data sharding, queuing, and content delivery networks. Chapter Six concludes the “design” section of the book with a vigorous discussion of resilience. The authors believe that software resiliency should always trump hardware resilience because of cost and flexibility. Then they proceed to explain how spare capacity, failure domains, load balancers, and automation can help systems survive a wide range of attacks and human errors.
Part Two of this book focuses on operating distributed systems. Chapter Seven introduces the Site Reliability Engineer (SRE) and how their attitude and scope of responsibility differ from a centralized Enterprise IT team. The rest of the chapter tees up the broad topics of distributed computing operations including the approach to automation, the service life cycle, and how to organize operational teams. DevOps is the topic of Chapter Eight. The authors speak from a position of experience and explain the “three ways of DevOps” for improving operations, the core values of DevOps, and converting to a DevOps culture that adheres to continuous delivery. Chapters Nine and Ten look at the phases of service delivery and how teams should create an efficient pipeline. Chapter Eleven includes an fantastic discussion of how to update running services in production. The authors explain the differences between rolling updates, canary deployments, phased rollouts, proportional shedding, and blue/green deployment. Automation is critical for management at scale, and Chapter Twelve takes a realistic approach to automation and the human factors at play. The chapter explains the goals of automation, what to automate, and how to automate aspects of the system. Chapter Thirteen looks at design documentation and how to create useful artifacts that describe complex systems.
Chapter Fourteen describes the value of oncall resources and how to set up a structure that is both responsive and efficient. Disaster happen, and Chapter Fifteen explains how to plan, implement, and regularly test a disaster preparedness plan. Chapters Sixteen and Seventeen investigate the use of monitoring in a distributed system and provide actionable advice on what system data to monitor and store. How can you ensure that you have the resources you need, when you need them? Chapter Eighteen includes an excellent look at capacity planning and what factors to consider when launching new services.The authors say that “measurements affect behavior” and Chapter Nineteen explains how to set goals and create useful Key Performance Indicators (KPIs). The final chapter addresses the overall quality of service operations. What does excellence look like, and how do you achieve it? The authors provide tips for self assessment and acting on the results.
This book is well-written, easy to follow, and bursting with relevant information about distributed computing. Readers will find realistic examples, practical advice, and an impressive range of topics. Whether you are a CIO, architect, developer, or system administrator, you will find something that improves your ability to deliver systems that scale.
InfoQ spoke to the authors to find out a bit more about this book.
InfoQ: You say that the target audience of the book is system administrators and their managers, but you pack in a lot of information that is useful to developers and architects as well. How should they approach this book?
Author Team: Modern developers and architects need to understand operations. This is very different than in the days of shrink-wrapped software where once the code was burned to CD-ROMs, operations was someone else’s problem. The book has two parts: Part I explains system architecture and design with a focus on the elements important to operations. Part II explains operational practices for large systems such as service delivery, monitoring, and oncall. Developers and architects will find both parts enlightening because it will expose them to operational realities they may not have considered before.
InfoQ: Who fights the DevOps changes the most? Developers or Operations?
Author Team: Management. You can’t “adopt DevOps” since DevOps isn’t just one thing. It is a toolbox of practices that have a common thread: improving business by breaking down the barriers that prevent communication and cooperation between silos. When there is a grassroots movement to adopt DevOps practices, it is management that becomes the blocker. DevOps is scary to any executive whose power comes from hiding information, pitting teams against each other, or being “right” by virtue of their title. Sadly that’s more common than you’d realize.
Everyone says they’re transparent until they are asked to be transparent (for example, this situation).
Not that DevOps has to be a grassroots effort within a company. I’ve heard horror stories of management pushing DevOps in ways that were either heavy handed or embarrassingly naive. One person I spoke with compared it to a parent trying to look cool to their child by listening to One Direction. That’s not a pretty picture.
We prefer to adopt DevOps practices one at a time, solving problems that are causing immediate and obvious pain to members of multiple teams. For example when Tom worked at Google he would have meetings between teams that were partnering but not communicating with the agenda to simply discuss what each other’s “pain points” were. This developed immediate empathy between the team and led to collaboration on big and small projects to relieve each other’s pain. Sometimes it was as simple as one team feeding data to the other alphabetized rather than sorted chronologically. Sometimes it was larger, like exposing an API to a service that enabled the other team to automate their process. Sometimes it led to the elimination of entire subsystems that were being maintained without realizing nobody used them any more.
Nobody would reject a meeting whose goal is to eliminate areas of common pain. Once people are talking, more opportunities present themselves. Eventually more and more DevOps practices become self-evident.
InfoQ: You mention that understanding OS internals is key to designing and operating distributed systems. What other lower-level concepts should teams not gloss over, but truly understand in a distributed system?
Author Team: We would add a deep knowledge of networking and storage internals. Networking has many facets, from how packets are routed, to how TCP works, to higher level protocols such as HTTP. The other day I was debugging a network problem and it required me to know how Linux routes packets (as opposed to how people think it does), how TCP window sizes work, and how SSL certificates work (and when they don’t).
Operational projects related to performance improvements require one to understand how all the elements work. Meanwhile to design a system that is reliable, one must understand how all these elements can fail. That’s not often something documented in the manual. One must understand the internals of how things work well enough to understand all the implications.
Data storage is similar. Disk storage isn’t just disks any more. Data storage has RAID systems, controllers, connectors, volume managers, and file systems; and the technologies range from actual disk drives, to SSDs emulating disk drives, to raw SSDs, to SANs, and NAS and DAS. Memory isn’t just RAM any more, it is L1/L2 caches and virtual memory systems. (see )
Recently I was working with a developer who couldn’t understand why data being copied over a 10G ethernet connection wasn’t transmitting very quickly. The bottleneck wasn’t in the network stack, but the fact that the data was being read from a slow disk drive. When the source data was moved to an SSD system the bottleneck was the memory bus. The destination had a bottleneck related to the maximum write speed of SSD. Developers should be aware of these issues, but partnering with a system administrator that has deep system knowledge was required to engineer a solution that would meet all the design requirements.
InfoQ: How do you see containers affecting mainstream perception of building and maintaining distributed systems?
Author Team: It is my hope that containers standardize deployment processes similarly to the way that shipping containers standardized logistics. Containers should become the universal format for delivering applications whether you are deploying the service on a developer’s desktop, in a small cluster, or by the thousands in large distributed systems.
All new technologies go through phases. In the first phrase the advocates proclaim it is useful for all situations and will fix all problems. In the second phase reality sets in and, often through painful experience, the industry develops an understanding of which use cases the technology solves problems for in a meaningful way and for which it has no benefit.
Containers are in the first phase. Over time people will understand that containers are good for specific situations. I find them useful for fixing the problem where there is one procedure for installing software in a development or testing environment and a different procedure for setting it up in production. By having two different procedures, one is always ahead of the other. A person with one watch knows what time it is; a person with two watches is never sure. Containers make it possible to unify these duplicate deployment processes, avoids duplication of effort, stops you from re-inventing the wheel, and limits version skew.
The industry is reaching consensus on applications where containers are not useful. For example there is a consensus building that containers are not good for securely isolating a process from others (at least as the technology stands now). Over time other use cases will be proven or disproven effective.
InfoQ: You discuss a host of capabilities that applications use for scaling and resiliency, including caching, data sharding, and queuing. If someone has a monolithic system they are starting to tease apart, what are some of the first (and last!) technologies they should be considering?
Author Team: Turning a monolithic system into individual components (“service oriented architecture” or SOA) can be a very long journey. It is best to start with one simple “quick win”. This lets you burn-in the SOA infrastructure you are building. There will be many growing pains and new operational duties. It is best to learn about them by starting small.
Often there is a subservice that is the obvious candidate to tease out first. Maybe it is a discrete feature that is well-bounded and isolated.
Another strategy is to add a caching layer that is its own service. For example Memcached and Redis are two high-performance caching solutions. They run as their own service. The monolithic system can be modified to direct queries through those caches. This helps scale the system.
Tom works at Stack Exchange, Inc. (home of StackOverflow) which used to be a fairly monolithic system. Redis was added as a caching layer and Haproxy was added for load balancing. By separating these functions out, scaling and resiliency could be improved independent of the main system.
Often monolithic systems can only be scaled in bulk: To scale it to service more users, the entire system must be replicated. As subsystems are broken out to be individual services, each can be scaled individually. This is often more efficient as some subsystems do not require scaling as they are not used as much as others. For example if a system has an integrated chat system that isn’t used very much, that subsystem doesn’t need to be scaled out. On the other hand, if it suddenly becomes more popular, it can be scaled out without requiring the massive resources of scaling the rest of the system.
InfoQ: You claim that the "most productive use of time for operational staff is time spent automating and optimizing processes." Do you think that's a controversial opinion, or a common sense one?
Author Team: It’s common sense but many organizations get stuck doing the opposite because they are in “fire fighting” mode.
Doing something manually has a linear payoff. The more you need to do it, the more people you must hire. Instead, if you allocate your people to focus on improving the process (via automation, restructuring, or otherwise) the effort benefits the company every time the process happens. It has superlinear (exponential) benefits.
Companies like Google don’t automate tasks because it’s fun or a “nice to have”. They literally could not exist at their scale without being obsessed about automating tasks. They don’t just automate, they create self-service systems so that non-sysadmins are empowered to act in an entirely self-sufficient way. An engineer can launch a new web property in self-sufficient way, without needing to bother the front-end team, networking team, or other parts of the company. Those teams are hard at work maintaining the self-service systems rather than doing the task themselves.
On the other hand, you have companies that haven’t gotten the memo. Recently a friend of ours left a company because he was forbidden from automating the process they used to install and configure the OS of their Linux servers. He was informed that management had identified the problem (needing to configure a lot of machines) and hired him specifically to click click click click click all day to configure them. If he was on vacation, no new machines would be installed and that was ok. Automation would have made the process self-service and he could have been reallocated to other projects (all of which seemed to be starved for new hires).
The fact that such companies are able to survive is proof that there is no justice in this universe.
InfoQ: You say that "continuous deployment should be the goal of most companies." How do you explain that to someone who has a "traditional" system that historically only updates once or twice a year? Why would they care about continuous deployment?
Author Team: If you do updates once a year, you have 12 months of changes being deployed. If there is a bug, you have difficulty finding which change caused the bug. Debugging takes longer because the bug is in code your developers haven’t looked at in months. Meanwhile deploying the update is likely to fail as there are 12 months of environmental changes waiting to foil your project, like alligators hiding in a swamp.
From a financial perspective, you have invested capital (developer time) and won’t begin to see the payback for as much as 12 months. Would you invest in a stock that, once sold, forbid you from accessing the profit for a year? That would be silly.
If you do continuous deployment and, as a result, do weekly or daily upgrades, you benefit from the “small batches” effect. Each release has fewer changes, therefore if there is a bug you can isolate the bad code quicker. Debugging is more efficient because developers are more familiar with the code. If changes happened in the production environment, the ramifications are felt right away rather than finding a surprise months later. More frequent releases makes it more economical to fully automate the testing process, which results in more and better testing.
However continuous deployment isn’t about speed. More critically it is about confidence to make change. The additional testing and smaller batch sizes improves confidence that the next release will be a success, making it easier to make changes. When change is easier, it becomes easier to innovate.
If your organization has become calcified and unable to innovate, ask yourself, “How hard is it to produce and deploy a new software release?” If the end-to-end process includes many pitfalls because of a lack of testing, or paranoia about the potential for a failed upgrade, maybe the solution is to release more often, not less.
There are case studies of SAP deployments using DevOps principles to go from scary yearly upgrades to reasonable quarterly upgrades. If people can do it with SAP, nobody has an excuse.
InfoQ: You make a big point of highlighting the human component of automation. Can you explain what you mean by that?
Author Team: Commonly, people automate using something called “the left-over principle”: We automate everything that can be automated within reason. What's left over is handled by people: situations that are too rare or too complicated to automate. This view makes the unrealistic assumption that people are infinitely versatile and adaptable, and have no capability limitations. Eventually so much is automated that operations is blind to what happens behind the scenes and can not effectively debug the system.
Another model is the compensatory principle, which proposes a set of attributes to use when deciding what to automate. Using the compensatory principle, we would determine that a machine is better suited than a person to collecting monitoring data at 5-minute intervals from thousands of machines. Therefore we would automate monitoring. We also might determine that a person cannot survive walking into a highly toxic nuclear accident site, but that properly designed robots can.
However the newest thinking centers around a different model entirely. The complementarity principle looks at automation from the human perspective. It aims to help people to perform efficiently in the long term, rather than just looking at short-term effects. It looks at how people's behavior will change as a result of the automation, as well as without the automation.
In this approach, one would consider what people learn over time by doing the task manually, and how that would change with the automation. One might build automation that allows for both manual and automated versions of the same task. This allows functions to be redistributed as needed. People are viewed as taking an active part in the system and are adaptive, resourceful learning partners who are essential to the functioning of the system as a whole.
Unfortunately, there is no silver bullet or easy-to-follow formula for automation using the complementarity principle. But if you remember to consider the human factor and the long-term effects of the automation on the people running the system, you have a better chance of designing good automation.
Here’s an example of the left-over principle resulting in a system that is highly automated at the expense of requiring an extraordinarily high level of skill to fix it when there is a problem. IBM’s version of Unix, called AIX, models the configuration of the system in a database. As a result, everything can be automated easily as most operations are database operations, followed by an update to the actual system to reflect the desired change. The problem, however, is that if the database gets out of sync with reality everything breaks down. You can’t edit any of the usual Unix files you’d find in /etc otherwise everything goes to hell because those changes aren’t reflected in the database. While this makes automation easier, it has made the system more complex and when something goes wrong a super expert is required to fix it. This is only a good model if you think nothing will ever go wrong.
Contrast this to a system like Puppet which manages systems by editing the configuration files in place. Operations read from the actual Unix configuration files and update them directly. Yes, it must be able to parse the myriad Unix configuration file formats, but because it does this, it is working with the operator. it is closer to the ideal put forth in the complementary principle.
Another example of the complementary principle is a management framework used in the Ganeti open source project. While at Google, Tom co-developed a tool called Euripides for managing large farms of Ganeti servers. The tool was more like a bionic arm that made the operators more powerful, and could actually complete most tasks independently. However it knew to stop in certain situations and ask for human help. The tool eliminated 80% of the manual work required by the oncall staff. Rather than being paged 3-4 times a day, people were paged once or twice a week, often less. Yet, because it worked in partnership with the technical staff, their skill levels did not atrophy. Instead, it enabled a small group of people to get as much work done as a small army.
For more information about this book, please visit the-cloud-book.com
About the Book Authors
Thomas A. Limoncelli is an internationally recognized author, speaker, and system administrator with 20+ years of experience at companies like Google, Bell Labs and StackExchange.com.
Strata R. Chalup has 25+ years experience in Silicon Valley focusing on IT strategy, best-practices, and scalable infrastructures at firms including Apple, Sun, Cisco, McAfee, and Palm.
Christina J. Hogan has 20+ years experience in system administration and network engineering, from Silicon Valley, to Italy, and Switzerland. She has a Masters in CS, a PhD in Aeronautical Engineering and has been part of a Formula 1 racing team.