Key Takeaways
- Resilient software design is mandatory for today’s distributed system landscapes
- The key challenges are not in the coding domain, but in the “periphery”
- Understanding the consequences of going distributed is extremely hard and most people underestimate it heavily
- Appropriate functional design is key for robust distributed systems, but still poorly understood
- The key challenges of introducing resilient software design in your company are building awareness and sustainability
The trigger for this article was a presentation about the challenges of resilient software design that I gave at GOTO Berlin 2018. I will write briefly about the “why” and “what” of resilient software design. The middle part are the challenges that I met most often in the recent years. In the end, I added a few thoughts on how to implement resilient software design best in your organisation. After reading the article, hopefully you have a better feeling for which challenges await you on your way towards resilient software design, plus some ideas on how to tackle them.
What is resilient software design and why is it important?
Resilient software design (RSD for short) is a topic that cannot be explained in a sentence or two – which sometimes makes it hard to explain it to a person who is used to “elevator pitch” explanations. Still, I try to give you the shortest possible explanation of the “why” and “what” of RSD that I know:
Let me start with the “why”: usually, we try to achieve some kind of business value with our systems: Earning money, making customers happy, and alike. We obviously can only realize this value with systems in production that are available, i.e. work according to their specifications. This is a given for a long time.
What is different these days, is the fact, that almost every system is a distributed system. Systems talk to each other all the time and also usually the systems themselves are split up in remote parts that do the same. Developments like microservices, mobile computing and (I)IoT multiply the connections between collaborating system parts, i.e., take that development to the next level.
The remote communication needed to let the systems and their parts talk to each other implies failure modes that only exist across process boundaries, not inside a process. These failure modes like, e.g., non-responsiveness, latency, incomplete or out-of-order messages will cause all kinds of undesired failures on the application level if we ignore their existence. In other words, ignoring the effects of distribution is not an option if you need a robust, highly available systems.
This leads me to the “what” of RSD: I tend to define resilient software design as “designing an application in a way that ideally a user does not notice at all if an unexpected failure occurs (due to the non-deterministic behavior introduced by distribution) or that the user at least can continue to use the application with a defined reduced functional scope (a.k.a. graceful degradation of service)”.
Note the word “defined” in the definition. It means that if you have a plan, what to do if something unexpected happens, which is very different from the accidental system behavior that you typically experience without resilient software design.
The quests of resilient software design
As so often in our domain, I started my journey with advice on how to design and implement resilience measures on the code level. But over time I realized that understanding how to actually implement that stuff is the easiest part. Also as so often in our domain, the actual challenges were in different areas.
While in hindsight this is not surprising (the biggest project issues are never on the coding level, are they?), I have to admit, I was surprised in the first place. Many of those challenges only become clear if you deal with the topic for a longer time. Thus, I decided to digest some of my learnings in a presentation that I called “The 7 quests of resilient software design“, where each “quest” described a challenge I’ve met. The challenges I picked for the presentation were:
- Understanding the “business case” for RSD. Often, if you try to establish something new as a software engineer, you will be asked for its business case in one way or the other. While it is a valid question, it leads the wrong direction if applied to RSD, as RSD is not about making money, but rather not losing money. Still, it makes sense to define a resilience budget by looking at the immediate losses if your system becomes unavailable and the long-term losses caused by more frequent non-availabilities due to cumulative effects like annoyed customers who will increase your churn rate. This way, we can embed our RSD activities in a sane economic framework.
- Understanding the non-deterministic behavior of distributed systems and its consequences. The problem is that distributed systems are not only hard to master, but also hard to understand. Remote communication adds a probabilistic element to our system’s behavior. Unfortunately, our brains are not wired to deal with probabilistic behavior easily. Additionally, virtually all our IT education is based on in-process settings where we face deterministic behavior which makes it even harder for most people to get their heads wrapped around the consequences of going distributed. I don’t know a simple solution for that problem (and think that it doesn’t exist). Maybe adding more of the distributed system literature to our IT education could help. There are hundreds of computer science papers out there that clearly explain how things that are simple inside a process become really hard or even impossible if you go distributed. But hardly anyone reads those papers. Some people then say, we should simply tackle the effects of remote communication on the infrastructure level and let the application engineers unmolested. Many distributed frameworks of the past have followed this reasoning – and proven that this is no viable solution for anything but very small systems ...
- Avoiding the “100% available” trap. This is related to the previous challenge, but at a much smaller scope. Due to the typical deterministic thinking, people (no matter if in IT or non-IT) tend to assume implicitly that all systems they connect to are available 100% of the time. You find this implicit assumption all over the place - in requirements, in the designs and in the code. This is neither ill intent nor carelessness of the people. It just slips their mind. But in distributed systems the question is not, ifa system is going to fail, but only whenit is going to fail. Availability always is a number smaller than 1 (or smaller than 100% if you prefer that notation). Thus, we always need to remind ourselves of the 100% available trap and check our requirements, designs and code if we fell for it.
- Establishing the Ops-Dev feedback loop. In many companies, we still see that big wall between development and operations that often reaches up to the C-level. While there are reasons why these walls were built in the first place, they cause a big problem regarding RSD. The resilience measures need to be implemented on the application level, i.e., in development. But you only can measure in operations how effective the measures actually are. Also in operations you will detect new robustness shortcomings of the application that need to be treated in development. But if you have a big wall between Dev and Ops, you destroy this vital feedback cycle. Developers shoot in the dark with their resilience measures and operations won’t get their robustness shortcomings fixed. Therefore, you need to establish this feedback loop, no matter if you use an established method like DevOps or SRE, or if you prefer your own home-grown approach.
- Getting the functional design right. If you distribute your functionality the wrong way, no additional resilience measure will create you a more robust system. Here is the problem: assume you create a tight functional dependency between services (a “strong coupling”). If the used service (the service, the functional dependency points to) is not available in this setting, all using services also become unavailable. This happens because the using services need the part of the business logic in the used service to fulfil their tasks. Unfortunately, almost all the techniques that we learn on how to design systems, i.e. how to divide functionality, lead to strong couplings as they focus on in-process design where this kind of coupling does not affect availability. Therefore, we basically need to relearn functional design for distributed systems with a focus a low functional coupling - which is different from low coupling inside a process. I will dive a bit deeper into this topic in the next section.
- Understanding which patterns to use and how to combine them in a reasonable way. When you learn new patterns it is tempting to use them. The problem is that resilience patterns come at a price. They usually increase implementation and operation costs. And even more important, they increase the complexity of the solution. This is critical as complexity is the enemy of robustness. The more complex a solution becomes, the less understandable it becomes and the more likely it becomes that unexpected effects will emerge that are going to affect robustness in a negative way. Therefore, it is important not to implement as many patterns as you can, but to find the robustness sweet spot between too little resilience measures and brimming complexity.
- Not losing our collective community knowledge every few yearsas a result of our technological youthism. This is not specific to RSD, but applies to IT as a whole. As a community, we tend to lose our collective wisdom every few years and start from scratch. Instead of building and maintaining a proven body of knowledge like other engineering disciplines do, we tend to neglect everything we already know and only look for the next hyped silver bullet on the horizon to solve our problems. Again, this is not limited to RSD, but we can observe it here very well, as most of the RSD concepts are not new at all. Some of them are already several decades old. To be honest, I do not have a good idea how to tackle this issue. Thus, the best thing that has come to my mind is to remind the people in my environment of that problem and hope that if enough people become aware of it, we eventually will grow into an actual engineering discipline.
Of course, there are more challenges, but these were the most surprising ones from my point of view.
Creating better robust applications using resilient software design
If we look at the “quests”, based on my experience, understanding distributed systems and how to create a good functional design are the biggest blockers in creating better robust applications. Thus, let us dive a bit deeper into those two topics.
Understanding the implication of distributed failure modes is incredibly hard. Things that are no-brainers inside a process become extremely hard or even impossible in distributed systems, and infrastructure cannot hide all those effects from the application level. On the other hand, most, if not all of our IT education in and after university is based on local computing. Additionally, trying to grasp the non-deterministic effects of distribution gives our brains a hard time.
Based on my experience, it is almost impossible for most people outside IT to actually understand the effects of distribution, as most people outside IT understand computers as “a machine that always does what it is told” and it would require a long time to teach them the concepts and challenges of distributed computing – time we usually do not have.
But it is also extremely hard for most developers. If developers get confronted with all the imponderabilities of distributed systems, they are usually overwhelmed. And as most of their IT education completely ignored distributed systems, quite often they even oppose dealing with those topics. This leads to designs and implementations that ignore the effects of distribution which in turn leads to brittle and slow systems at runtime.
As I wrote before, I do not know a simple fix for this problem and I also think, a simple fix does not exist. The best shot I currently have to offer is to add more distributed system design to our education (in and after university) as our system landscapes become more and more distributed and we need to be better aware of the effects of our designs and coding.
The other huge blocker is functional design. If you spread functionality in the wrong way, you end up with a brittle system at runtime. A simple example that can be observed very often: a service A receives an external request. To respond to the request it needs some information from another service B due to the functional design, i.e., how the functionality is spread between the services. If service B should be down, service A can not answer the external request.
This is a so-called cascading failure. One of the main tasks of resilient software design is to avoid cascading failures. Usually, you would use a simple timeout detection or a circuit breaker to detect that service B is down and then fall back to a backup plan in service A. But due to the way the functionality is spread between the services, there is no backup plan possible, i.e., the circuit breaker will only make the cascading failure visible, but not offer any means to circumvent it.
This was just one example. There are many more. Typically those designs emerge if you apply the usual “design best practices” to distributed systems. In the given example, service B was a “reusable service”. While reusability is a very desirable property inside a process boundary, it also creates a very strong coupling that exhibits very undesirable properties across process boundaries.
In the past, we ended up with hard to maintain systems if we got our functional design wrong – which is bad enough. But today, in a distributed setting, a bad functional design also means brittle, unreliable and slow systems at runtime, which is worse. The problem is that most advice how to “get design right” only applies to design inside a process boundary. Most of those advices do not work well if applied to distributed systems.
What I have learnt over time is that we basically need to re-learn how to design systems, i.e., how to spread the functionality in a distributed environment.
Most people then mention Domain-Driven Design (or “DDD” for short), but based on my experience this also is not the promised panacea. Do not get me wrong: DDD offers a lot of really good advice regarding better design. But when it comes to the design of distributed systems, DDD alone is not enough. Some additional advice is still missing. The good part of the story: as far as I can observe it, people are trying to extend the original ideas of DDD with distributed systems in mind. Hence, I am curious about the future developments in this area.
Developing the skills needed for creating resilient robust applications
If you want to introduce RSD to your own company, you might ask for a plan how to do this best. Based on my experience, there is no perfect plan – at least none, that I’d know about. My recommendation in such a situation is to implement the generic awareness-competence-sustainability pattern, which for RSD basically looks like this:
First it is important to understand why RSD is needed and how to communicate it to people not involved in software development. This involves understanding and accepting the imponderabilities of distributed systems (including avoiding the “100% available” trap) as well as the business case of resilient software design. Additionally, you need to learn how to communicate it to people without deep IT knowledge. It does not help if you know that RSD is needed to create robust systems if you cannot discuss it with your managers or your business owners and help them to understand the topic well enough to make the right decisions.
Building the knowledge is probably the easiest part. Meanwhile, there are some resources and trainings available concerning this topic – just browse the workshop sections of distributed systems or microservices related conferences or, e.g., have a look into [1] or [2] to start with. And of course you need to apply them in your work. Again, getting the functional design right is a hard task, but the resilience patterns themselves are relatively easy to learn and apply.
To build sustainability you first need a working Ops-Dev feedback loop. Without that loop, any resilience initiative is doomed because you lack the feedback if and how well your resilience measures work in practice.
Additionally, you might want to establish a Chaos Engineering initiative. Chaos Engineering is not only helpful to reveal robustness shortcomings in your system landscapes. It especially helps you to make your resilience efforts sustainable by continuously learning in a controlled way, how robust your systems are and how to improve their robustness further.
The term “Chaos Engineering” is a bit misleading, as it is not about creating chaos but about avoiding it. In Chaos Engineering, you design controlled robustness experiments to understand better how robust your system actually is and where additional resilience measures are needed. Chaos engineers always carefully control the potential “blast radius” of the experiments they conduct and discuss them with all affected people before executing any experiments.
It starts with a hypothesis, e.g. “If we cut the connection to this server, an automatic failover will take place and end users will not experience any difference”. Then, this hypothesis is discussed with the affected developers and operations people. Is the hypothesis valid? How can we test it? How can we measure if we were right or not? How can we stop the experiment in a safe way if we were wrong?
After everything is discussed and defined, the actual experiment will be conducted and based on its outcome, and potentially required (RSD) measures will be defined. Besides helping to find formerly undiscovered breaking points in your applications, Chaos Engineering over time also significantly improves the confidence that you have in your systems – which is a good feeling.
Conclusion
Overall, it can be said that in today’s distributed system landscapes, RSD became mandatory. While learning how to design and implement resilience patterns is relatively easy, the actual challenges on the way to successful RSD – as so often – do not lie in the coding domain.
Especially the intricacies of distributed systems themselves and functional design for such systems make it hard to implement sustainable resilience. But also challenges like missing feedback loops between Ops and Dev, overly complex resilience design, or a lack of understanding of the business case of RSD often get in the way. But, knowing the challenges is the first step of successfully mastering them …
References
- Michael T. Nygard, Release It!, 2nd edition, Pragmatic Bookshelf, 2018
- Robert S. Hanmer, Patterns for Fault Tolerant Software, Wiley, 2007
About the Author
Uwe Friedrichsen has traveled the IT world for many years. As CTO and fellow of codecentric, he is always in search of innovative ideas and concepts. His current focus areas are (distributed) system design, deep learning and the IT of (the day after) tomorrow. Often, you can find him on conferences sharing his ideas, or as author of articles, blog posts, tweets and more.