We've been hearing about agile operations quite a bit lately. There have been some good talks, articles and a few lively debates. It has even been called the "secret sauce for startups". What about those of us who aren't in a startup or a Web 2.0 company? Is agile operations something that can really work inside a large, established enterprise?
I think the answer is "yes, but it won't be easy." In this article, I'll look at some of the impediments to adopting agile operations and some ways you might be able to work the system.
Summarizing Agile Operations
Agile operations emerged from two distinctly different camps of people. First, agile and lean software developers realized that their nice, tight iterations produced releases faster than they could be deployed. In the quest for throughput, these teams observed that the value stream isn't completed when they check code into source control, but when it gets deployed to the Web and starts generating revenue. That is, when it is "done, done, done" and not just "done, done". Short iterations followed by long delays for deployment just don't make sense. Worse, manual deployment and configuration invites human error. True agile teams want to automate every repeatable task, so that includes deployment.
The second thread leading to agile operations comes from fast growing Web 2.0 companies. These outfits sometimes provision a thousand servers every fortnight. If they tried to do this by hand, half the population of the US would be Facebook users, the other half would be Facebook administrators. Obviously they need some way to enlist their infrastructure in its own management. These companies achieve scalable operations by shifting from imperative (manual) actions to declarations about the desired end state of a server.
While much of the talk around agile operations centers on tools, the tools come last. Starting from the tools would be like learning agile software development by learning JUnit and Hudson. You might be able to imitate the practices, but when the context changes you won't be able to respond effectively. Instead, tools should support agile principles. In terms of operations, agile principles include:
- Communication
- Short feedback cycles
- Simplicity
- Courage
- Transparency
- Sustainability
- Reflection
- Continuous attention to technical excellence
These principles manifest themselves in several ways:
- Fully automated system builds (not just kickstarting a server, but rebuilding everything)
- Configuration management via version control system
- Widespread access to monitoring and metrics data
- Willingness to rip-and-replace mechanisms
- Preference for automated documentation extracted from systems
- Fluid attitude toward hardware. Adding or replacing a server is no big deal.
Like agile software development, agile operations emphatically does not equate to cowboy administrators running amok on the systems, without plan or documentation. Quite the opposite, agile operations requires great self-discipline. Operators must commit to putting everything into version control. They must accept nothing less than 100% automation. Manual actions must never be permitted.
Hindrances and Conflicts
Agile operations sounds pretty sweet. Just check in your source code, make sure it builds on the CI server, then update a recipe and watch your changes roll out to the world. Anyone who can't see the obvious benefits must be hopelessly behind the times, or just protecting their jobs, right?
Well, it's always easy to look at a group from outside and scoff at their work. Like Scott Adams says, "Anything you don't understand must be easy to do." The truth is generally more complex. People don't sustain practices for silly or arbitrary reasons, but because they resolve some tension that may not be evident from outside the group.
The operations group always sits at the focus for stakeholders' diverse - sometimes contradictory - needs. Attempt to introduce change without being aware of these forces, will frustrate you and probably result in failure. Here are some of the real concerns that Operations deals with, agile or not.
Audit and Compliance
Operations plays an important role in the struggle for regulatory compliance. Any company that is publicly traded must be able to prove that their financial results are accurate. In the U.S., since the Sarbanes-Oxley Act was passed in 2002, company officers can be held criminally liable if the company's results are found to be tampered with. The key provision is section 404, which addresses the company's Internal Control over Financial Reporting (ICFR). That's right, if the auditor finds that you've got a lousy sysadmin and poor ICFR, the CFO could go to jail for up to 20 years.
SOX is notoriously vague. The Securities Exchange Commission and the Public Company Accounting Oversight Board have issued multiple packets of guidance. Even so, much is left to interpretation by the auditors. Cases can be made, argued, accepted, and overturned every year. The resulting uncertainty, combined with the problem of proving that tampering did not occur, makes public companies very edgy about their controls.
I could tell a similar story about the Payment Card Industry Data Security Standard. The details differ, but the overall flavor is the same. Well, except that VISA can't send anyone to jail. Yet.
The key issue here is the definition of "controls". In auditing lingo, controls are not technological measures, but are the processes by which technological measures are assessed and reviewed on a periodic basis. One of the main controls that your auditor will look for is separation of roles. That is, people who can write code should not be able to promote their own code. Furthermore, once code has been promoted, it should be impossible for admins to change it. Again, "impossible" doesn't mean that the files literally cannot be edited, but it does mean that it cannot be edited without anyone knowing.
In practice, this means that agile operations - and particularly devops - will raise all kinds of red flags with your auditors. Let me assure you, if development gets into an argument with audit, audit will win. Your best bet, if you are in a public or pre-IPO company, is to meet with audit before you start introducing agile operations.
Agile operations can supply compensating controls. For example, if all changes to production are fully automated, then the version control system keeps a record of every individual's modifications. It's easy to pull a report from version control so audit can verify that changes were authorized. If you keep SHA hashes of all deployed packages, then it's possible to verify that nothing has changed outside of the automated deployment process. These mechanisms support sound ICFR.
Still, you'll get farther if you have the conversation before switching to agile operations.
ITIL
The IT Infrastructure Library is a configurable framework for IT Operations processes. That's a long string of words that doesn't communicate a whole lot, but it's something you should be aware of.
ITIL provides IT organizations with a blueprint for high-quality operations. It's a template, derived from the U.K. Government's mainframe operations in the 1980's. Yes, really. Now in version 3, ITIL is being adopted by many companies as a way to standardize processes around a common set of challenges: "What do we have?", "Is it working?", "Why does it break?", "Who is responsible?" and so on.
ITIL is a hard thing to love. My personal reaction on first being trained was incredulity. There's a very elaborate Change Management process, but it doesn't involve actually changing systems. That happens under the Release Management process.
Once you dig a bit deeper, though, the pieces of ITIL start to come together in a sensible way. Most companies don't implement all of ITIL, but they do generally implement the Big 3: Incident Management, Change Management, and Release Management.
As we saw with the audit and compliance concerns, agile operations actually provides a stronger set of supporting practices. For example, one of the key issues in Change Management is to identify every configuration item (CI) that will be affected by a change. With all system configurations under configuration management, it's easy to satisfy that requirement. In fact, the answer will probably be more accurate than ever before.
The conflict with ITIL will not be over fundamentals, however, but tools. Groups that implement ITIL usually buy a suite of tools from a software vendor. Implementing ITIL often becomes synonymous with implementing the toolset. So you may find a high level of resistance if you propose to fulfill the ITIL processes through your own toolset instead of the corporate standard.
My advice on this is, "Don't fight city hall." ITIL toolsets are expensive and take a long time to implement. That means some executive has visibly committed to them. Circumventing the toolset is equivalent to challenging that person. Maybe that's a fight you want to pick, but lets consider another approach first: APIs. All of these toolsets have APIs to submit change tickets, update CIs in the configuration management database (CMDB), open tickets, and so on. Just call the APIs to automate actions against the tools the same way you're automating actions against the servers. You'll still need to watch out for lead times on change review board meetings, but at least you can automate most of the pain away.
One place where agile operations and ITIL should play well together is in the Problem Management process. Once an Incident has recurred, it should be designated as a Problem. This triggers a whole separate process. Whereas Incident Management is about restoring normal operations, Problem Management seeks to correct the root cause of the incidents. For example, a web server crash would be an incident. The goal is to get the server back up and running as quickly as possible, which probably just means restarting the process. If it happens again, it would be considered a problem, which would trigger debugging, looking for memory leaks, reproducing the bug in QA, and so on.
Both ITIL Problem Management and agile operations favor a "5 whys" approach to root cause analysis. That is, not just solving the immediate problem, but also understanding how it was possible for the problem to occur in the first place. I see a natural alignment here
History and Culture
Sometimes I think a startup's biggest advantage is its lack of history. Development, operations, and the business all evolve a culture together. In an established company, it can be difficult to generate simultaneous culture change in disparate groups. As a result, it's likely that one group or the other is out in front of this change. Leadership within your own group is welcome, but leadership from another group rarely is. That's especially true when you don't particularly like the other group.
Yes, it's true. Development and operations sometimes don't like each other much.
In fact, they might be openly hostile.
Like a family feud in the hills of West Virginia, these things get started lots of different ways. Usually, it begins with a launch failure or outage, followed by a fierce round of blamestorming. The budding feud gets fueled by antagonism between the director of operations and the director of software development. Attempts to bridge the gap get stymied by the different languages spoken in operations and development. After a few years, finger-pointing and throw-it-over-the-wall processes become ingrained habits.
Agile operations - like agile development - requires a high degree of trust among all the parties. How can you build up that trust in a historically hostile environment? Your strategy depends on your position in the organization. There are really only three generally true statements I can make:
- Don't bother with technological solutions to cultural problems. Culture trumps process, every time.
- You can't be agile while you're in a defensive crouch.
- Strong leadership is vital. Leadership doesn't require management authority. You can be a leader at any level in your company.
Summary
We learned many things introducing agile software development over the last fifteen years. A team with the agile principles can reinvent the practices from scratch if they need to. On the other hand, a team without the agile principles can emulate the practices but will not derive the benefits. (And, in truth, they won't stick with the practices for long, either!)
Adopting agile operations will require "unlearning" some assumptions about how to create reliable processes. People involved in this transition must be led to realize that it is possible to be fast and disciplined, that quality doesn't contradict expediency, and that it's worth spending their time to eliminate those last 5% of manual activities.
At the same time, we must keep in mind that Operations serves other stakeholders than Development. The corporation relies on Operations to ensure the sanctity of it's financial results and protect it's reputation with customers and investors alike. Any move to agile operations must also preserve these important responsibilities.
Resources
ITIL Reference site for the IT Infrastructure Library.
Agile Web Operations blog Ongoing posts and discussions about Agile Operations for web sites.
Andrew Schafer's blog Andrew frequently discusses the intersection of development, operations, and the web.
Automated Infrastructure enables Agile Operations A useful article about infrastructure as code.
Defining Agile Operations and DevOps Discussing the similarities and differences between Agile Operations and it's more extreme cousin DevOps.
10 Deploys Per Day: dev and Ops Cooperation at Flickr A great from-the-trenches presentation from John Allspaw, one of the key figures in re-uniting development and operations.
Agile Infrastructure A presentation from Agile 2009 about some tools and techniques to support agile operations. Principles come first, but strong tool support is essential.