Key Takeaways
- The DevOps movement is a positive step for unifying development and operations tools and processes, but rarely emphasizes the point that systems exist to deliver value.
- Architectural thinking and design can help when embracing DevOps by clarifying who the stakeholders are, what concerns they have, and how those concerns are being met.
- Beyond the developers and operations staff, many key stakeholders are interested in a system’s operation, including testers, communicators and assessors, and their concerns must be addressed. Viewpoints and views are tools to help capture those concerns.
- The operational view describes how a system will be installed, operated and supported in a complex or critical production environment.
- Dozens of potential questions about operations can be asked, ranging from how installation, upgrades and migrations will occur, to what monitoring and alerting is necessary. Following software architecture practices and principles can help identify the right questions to ensure a successful approach to DevOps.
This article first appeared in IEEE Software magazine. IEEE Software offers solid, peer-reviewed information about today's strategic technology issues. To meet the challenges of running reliable, flexible enterprises, IT managers and technical leads rely on IT Pro for state-of-the-art solutions.
MOST SOFTWARE ARCHITECTURE books focus on building new systems. However, successful systems spend much more time running in their production environment than being initially developed. That's why the Dev Ops movement's recent emergence is so heartening1. It emphasizes development and operations staff working together as early as possible—sharing tools, processes, and practices to smooth the path to production.
DevOps requires embracing new, often unfamiliar technologies and ideas. Architectural thinking and design can help clarify who the stakeholders are, what concerns they have, and how those concerns are being met.
Len Bass and his colleagues recently published a book on DevOps technology and practice for software architects2. Beyond that, little guidance exists for architects dealing with their systems' operational environments. To help fill this gap, I outline here an architectural viewpoint, one that updates and reworks the operational viewpoint Nick Rozanski and I described in Software Systems Architecture3.
What Is Production?
"Production" is a widely used term whose meaning varies. In this context, I define a production environment as any environment in which valuable work is being performed. This usually implies a controlled environment that can be altered only through a change control process, rather than directly by developers. Consider what would happen if you mistakenly remove a database's contents: if this is considered a disaster, you're probably working in a production environment.
Production environments have these characteristics:
- They include a wide range of stakeholders, including business management, auditors and risk managers, and infrastructure and operational staff, as well as the regular users and development staff who are also interested in the development environment.
- They require a high degree of control in an attempt to maintain the environment's reliability, but this means that formal processes must be followed to make changes.
- They're highly visible, particularly when things go wrong. Few people express interest in your development environment's status, but when production environments malfunction, you quickly discover just how wide your stakeholder community is.
- They're vulnerable to external events, particularly when connected to the Internet. Your development and test environments might be hidden, but your production environment is visible to everyone.
- They're unpredictable because they're part of a much more complex environment than most development and test systems. So, unforeseen external events sometimes affect them.
Taken together, these factors mean that working in the production environment presents many challenges that we, as software architects, must address.
The Operational Viewpoint
An operational view describes how a system will be installed, operated, and supported in its production environment. A different view—the deployment view—addresses the deployment environment the system needs (servers, software, networks, and so on)4. The operational view usually applies to any system being deployed in a complex or critical operational environment.
Key Stakeholders
Although we often focus on operations staff, a wide range of stakeholders are interested in a system's operation.
- Operations staff accept and operate new and changed software in production and are responsible for its service levels.
- Infrastructure engineers provide the infrastructure services the system relies on.
- Developers are responsible for the software, its smooth transition to production, and, ultimately, its success.
- Testers verify that both the software and the production environment as a whole will operate correctly.
- Communicators explain the system's operation to clients, in the context of developing a product for installation on client premises.
- Assessors must be satisfied that the risks of operating the system in production are acceptable and managed.
The needs of all these stakeholders must be addressed, not just those of the developers and operations staff.
Concerns
Stakeholders have the following key concerns about the operational environment.
Installation, upgrade, and migration. How will software changes get to the production environment reliably? How will you know they've been applied successfully? How will failures be rolled back?
Migration. Of related concern is the migration of the system's workload to the new software and the process of making changes to the system's stored data (both the storage schemas and the data itself). Will multiple versions of the system exist in parallel, or will everything migrate at once?
Operational monitoring. Once the system is running in the production environment, how will you know it's operating correctly? How will you control its operation? What vital signs will you need to monitor: standard metrics, such as CPU use, or metrics more specific to your environment, such as the message volume received on a particular interface? Business measurements, such as the average and total transaction value per hour, will likely be as important as technical metrics.
Operational control. What tools will you need to control the system? Can you use third-party tools, or will you need to create your own?
Alerting. If the monitoring mechanisms identify an unexpected condition, what happens next? How will this event get recorded and propagated, and to whom? What will this person do? How will you use the monitoring and alerting history to continually improve key metrics?
Configuration management. Most modern systems comprise dozens or hundreds of infrastructure elements, with the largest containing millions. So, it's becoming common to add and remove virtualized infrastructure elements on demand. How will you manage these elements' configuration? Modern tools such as Puppet and SaltStack can help simplify the process. But they're only part of the solution and could affect how you design and deploy the application.
Performance monitoring. There's well-documented evidence that as Internet applications slow down, users leave5. Although in-house users are more tolerant, performance remains one of the biggest factors that affect their satisfaction with a system. How will you monitor the application's performance? What metrics are important to measure? How will you spot degradation?
Support. Things inevitably go wrong in production environments, usually in highly unexpected ways. Who will handle such incidents? What tools and processes will they need? How can you design the application to be easy to support?
Data availability. Backup and restore is a concern as old as computing — and one that needs more thought than it often receives. With today's huge databases, backup and restore can become a mammoth operation if very smart strategies aren't applied.
Models
Understanding and solving problems is the essence of architectural work. This often involves creating models that help us understand systems and support good decision making. Here, I briefly describe some models I've found useful. Note that these models aren't the typical software architecture or design models—they're pragmatic models focused on particular concerns. So, you'll usually need your own "boxes and lines" or "text and tables" approach to creating them. You'll also need to develop them in conjunction with the development and operations groups to ensure a common understanding of how the operational environment will work.
Release. Release models describe the path to production from the development environment (in terms of stages, technologies, and approval checks). In effect, they act as a model of the route from the end of your continuous-integration pipeline to the production environment. This allows clear communication and identification of risks and weaknesses.
Configuration management. Configuration management models can help you capture and analyze the configurations you'll need across your operational environment and determine how to manage and control them. Today, we can easily end up with configuration in properties files for Java software, application settings in databases, Zookeeper for distributed systems, and Puppet for infrastructure. So, understanding, coordinating, and validating change across all configurations is a major challenge that focused models can help you meet.
Administration. Administration models demonstrate how the administrative environment works and relates to the system. Administrative environments are often a complex mix of standard tools, local utilities, people, and processes. To avoid problems and misunderstandings, creating a model of the end-to-end environment will be valuable when you're determining how the operations group will run the system.
Support. Support models show how an incident is recognized, handled, managed, and resolved. Such models are usually process descriptions rather than technical designs, but they're useful for considering various problem scenarios and planning how each will be handled.
Problems and Pitfalls
Many things can go wrong when you're working on your architecture's production aspects. For example, late or poor engagement with the operational staff might result in you using a lot of DevOps tooling that will have little effect because the operations group hasn't bought into it.
Or, back-out planning might be lacking. Much of the discussion about continuous delivery and DevOps addresses getting code to production safely—but not what to do if things go wrong and you need to roll back. Blue-green testing and canary releases are powerful approaches6 but aren't magical. It can be difficult to roll back quickly with a large data base after a high-impact database change.
Another problem is lack of migration planning, which is often as much about the people as the technology. Is everyone moving on one day? Do you have a pilot phase? Can you blue-green deploy? How do you know whether processes are working for particular user groups? How will data get migrated to new databases or schemas? Is there a realistic time frame for all this?
Not engaging early and often with operations can also lead to missing management tools or processes. What are you assuming that operations staff will be doing? Do they have the tools they need?
Poor alerting is often caused by wanting to ensure visibility of problems but not considering all the failure scenarios. Creating a tsunami of alerts is easy to do by mistake but can make it impossible to understand and address the underlying problem. Is your alerting smart enough to let operations pinpoint the underlying problem quickly?
Late collaboration with operations might also lead to a lack of integration into the production environment. What do operations staff expect in terms of tools, documentation, processes, and so on? If you don't get this right, you're probably not going live, but these factors are often difficult to judge in preproduction environments.
Finally, an inadequate backupand-restore strategy can cause problems. If a catastrophic networking failure or major datacenter problem occurs and you must operate from another location, is all the data you need at this alternate location? Will you need to perform any restore operations before commencing? If so, how long will they take?
Systems exist to run in production and deliver value. But you'd be forgiven for missing that point, because the software architecture literature doesn't often emphasize it. The DevOps movement—with its emphasis on the operations group's importance and the need to unite development and operations tools and processes—is a terrific step forward. But software architecture still has its part to play in making DevOps approaches successful. I hope that the architectural viewpoint presented here will guide you in making that contribution to successful projects.
References
1. G. Kim, K. Behr, and G. Spafford, The Phoenix Project: A Novel about IT, DevOps, and Helping Your Business Win, IT Revolution Press, 2014.
2. L. Bass, I. Weber, and Z. Luming, DevOps: A Software Architect's Perspective, Addison-Wesley, 2015.
3. N. Rozanski and E. Woods, Software Systems Architecture, Addison- Wesley, 2011.
4. N. Rozanski and E. Woods, "The Development Viewpoint", Software Systems Architecture, 2016;
5. B. Forrest, "Bing and Google Agree: Slow Pages Lose Users", O’Reilly, 23 June 2009;
6. D. Farley and J. Humble, Continuous Delivery: Reliable Software Releases through Build, Test, and Deployment Automation, Addison-Wesley, 2010.
About the Author
Eoin Woods is the chief technology officer at Endava. Contact him here.
This article first appeared in IEEE Software magazine. IEEE Software offers solid, peer-reviewed information about today's strategic technology issues. To meet the challenges of running reliable, flexible enterprises, IT managers and technical leads rely on IT Pro for state-of-the-art solutions.