InfoQ Homepage Resilience Content on InfoQ
-
Write More, Talk Less: Building Organizational Resilience through Documentation and InnerSource
Better documentation and knowledge sharing creates transparency that aids onboarding, prevents turnover disruption, and withstands reorganizations. Different practices can help, such as communicating asynchronously, creating incentives for documentation, making docs discoverable, understanding team members' preferences, and providing dedicated writing time. And maybe InnerSource can help too.
-
Debugging Production: eBPF Chaos
This article shares insights into learning eBPF as a new cloud-native technology which aims to improve Observability and Security workflows. You’ll learn how chaos engineering can help, and get an insight into eBPF based observability and security use cases. Breaking them in a professional way also inspires new ideas for chaos engineering itself.
-
How We Improved Application’s Resiliency by Uncovering Our Hidden Issues Using Chaos Testing
This article lists the chaos testing principles which are outlined by Netflix. The readers should be able to understand the advantages and disadvantages that chaos testing offers. This will help them to decide whether they want to perform it or not. The article also explains why we should convince the management to perform chaos tests, considering all benefits over the risks.
-
How Do We Utilize Chaos Engineering to Become Better Cloud-Native Engineers?
Engineers these days are closer to the product and the customer needs—there is still a long way to go and companies are still struggling with how to get engineers closer to their customers to understand in-depth what their business impact is: what do they solve, what’s their influence on the customer, and what is their impact on the product?
-
Chaos Engineering and Observability with Visual Metaphors
This article introduces a new actor for visualising chaos engineering and observability: metaphors. It provides the conceptual foundations of chaos engineering and observability, presents a state of art of visualisation techniques available in the market and shows how treemaps, gauge charts, geocentric and city metaphors can enrich the spectrum of the visual strategies to observe the chaos.
-
DevOps and Cloud InfoQ Trends Report - July 2021
This article summarizes how we see the "cloud computing and DevOps" space in 2021, which focuses on fundamental infrastructure and operational patterns, the realization of patterns in technology frameworks, and the design processes and skills that a software architect or engineer must cultivate.
-
Building Reliable Software Systems with Chaos Engineering
Advances in large-scale, distributed software systems are changing the game for software engineering. As an industry, we are quick to adopt practices that improve flexibility and improve feature velocity. If we can move quickly, can we do so without breaking things? Chaos Engineering practices can be used to navigate complexity and build more reliable systems.
-
Continuous Learning as a Tool for Adaptation
The fifth and capstone article in a series on how software companies adapted and continue to adapt to enhance their resilience explores key themes with a special view on the practicality of organizational resilience. It also provides practical guidance to engineering leadership and recommendations on how to create this investment.
-
Software Architecture and Design InfoQ Trends Report—April 2021
An overview of how the InfoQ editorial team sees the Software Architecture and Design topic evolving in 2021, with a focus on what architects are designing for today.
-
Designing & Managing for Resilience
The fourth article in a series on how software companies adapted and continue to adapt to enhance their resilience explores the strategies used by engineering leaders to help create the conditions for sustained resilience. It provides stories, examples, and strategies towards designing an organizational structure to support resilient performance and managing for resilience.
-
Adaptive Frontline Incident Response: Human-Centered Incident Management
The third article in a series on how software companies adapted and continue to adapt to enhance their resilience zeros in on the sources that comprise most of your company’s adaptive resources: your frontline responders. In this article, we draw on our experiences as incident commanders with Twilio to share our reflections on what it means to cultivate resilient people.
-
Learning from Incidents
Jessica DeVita (Netflix) and Nick Stenning (Microsoft) have been working on improving how software teams learn from incidents in production. In this article, they share some of what they’ve learned from the research community in this area, and offer some advice on the practical application of this work.