InfoQ Homepage Resilience Content on InfoQ
-
Cell-Based Architecture Adoption Guidelines
The challenges in building modern, reliable, and understandable distributed systems continue to grow, and cell-based architecture is a valuable way to accept, isolate, and stay reliable in the face of failures. Organizations must ensure that the cell-based architecture is the right fit for them and that the migration will not cause more problems than it solves.
-
Adaptive Responses to Resiliently Handle Hard Problems in Software Operations
As engineers move into more senior positions such as Staff Engineer, Architect, or Sr Tech Lead roles, their knowledge and experience is often applied across the system. This expertise is increasingly needed for handling novel problems or designing innovative solutions to complex problems. This article discusses strategies for approaching your role as a senior member of your organization.
-
Taking Advantage of Cell-Based Architectures to Build Resilient and Fault-Tolerant Systems
Cell-based architectures offer a robust approach to building resilient systems. They achieve this through the core principles of isolation, autonomy, and replication. Each cell manages its resources and makes decisions autonomously. Observability for cell-based architecture requires a tailored approach to address the unique challenges and opportunities presented by this distributed system design.
-
Article Series: Cell-Based Architectures: How to Build Scalable and Resilient Systems
In this article series, we take readers on a journey of discovery and provide a comprehensive overview and in-depth analysis of many key aspects of cell-based architectures, as well as practical advice for applying this approach to existing and new architectures.
-
How Cell-Based Architecture Enhances Modern Distributed Systems
Cell-based architecture has emerged as a response to many challenges associated with distributed systems. It employs the bulkhead pattern to isolate failures to a fraction of the affected infrastructure footprint and prevent widespread impact. Cells can also help organize large architectures into domain-bound deployment and delivery units, which provides essential sociotechnical benefits.
-
Mastering Impact Analysis and Optimizing Change Release Processes
Dynamic IT professional with a proven track record in optimizing production processes and analyzing outages in complex systems handling millions of TPS. The recent CrowdStrike outage highlights the importance of continuous improvement and adherence to best practices. Passionate about elevating operational excellence through strategic reviews and effective process enhancements.
-
Prepare to Be Unprepared: Investing in Capacity to Adapt to Surprises in Software-Reliant Businesses
Incidents are often perceived as extraordinary aberrations, unconnected to "normal" work. For over twenty years, the field of Resilience Engineering has aimed at flipping this approach around — by understanding what makes incidents so rare (relative to when and how they do not happen) and so minor (relative to how much worse they can be) and deliberately enhancing what makes that possible.
-
Generative AI and Organizational Resilience
Generative AI will profoundly transform communication and information sharing over the next decade, but the change will be uneven across industries and roles. Organizations should empower workers to use AI augmentation thoughtfully, while building literacy on capabilities and limits. A balanced, conscientious integration, using iterations and customer feedback, will produce the best outcomes.
-
Orchestrating Resilience Building Modern Asynchronous Systems
In this article, we will discuss what problems we had to solve at Twilio to efficiently build a resilient and scalable asynchronous system to handle a complex workflow and the advantages we got from adopting a Workflow Orchestration solution, including abstracting away state management and out-of-the-box support for retries, observability, and audibility.
-
The Incident Lifecycle: How a Culture of Resilience Can Help You Accomplish Your Goals
Don’t get stuck with overwhelmed systems that can cause an outage, like what happened with Taylor Swift concert tickets. Build organizational resilience to incidents through improved coordination and communication during the response, and blameless reviews, root cause analysis, and insightful communication afterward to enable meaningful change.
-
Write More, Talk Less: Building Organizational Resilience through Documentation and InnerSource
Better documentation and knowledge sharing creates transparency that aids onboarding, prevents turnover disruption, and withstands reorganizations. Different practices can help, such as communicating asynchronously, creating incentives for documentation, making docs discoverable, understanding team members' preferences, and providing dedicated writing time. And maybe InnerSource can help too.
-
Debugging Production: eBPF Chaos
This article shares insights into learning eBPF as a new cloud-native technology which aims to improve Observability and Security workflows. You’ll learn how chaos engineering can help, and get an insight into eBPF based observability and security use cases. Breaking them in a professional way also inspires new ideas for chaos engineering itself.