BT

InfoQ Software Architects' Newsletter

A monthly overview of things you need to know as an architect or aspiring architect.

View an example

We protect your privacy.

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Logo - Back to homepage

News Articles Presentations Podcasts Guides

Topics

Development

Featured in Development

Rebuilding Prime Video UI with Rust and WebAssembly

Alexandru Ene features details of a new UI SDK in Rust for Prime Video that targets living room devices.

All in development

Architecture & Design

Featured in Architecture & Design

Applying Flow Metrics to Design Resilient Microservices

Software design with resilience is an acknowledgement to the reality that everything fails. We put metrics in place to help us detect and resolve such problems and failures. Flow metrics, commonly used to measure how well teams deliver software, can be used to measure and improve system resilience.

All in architecture-design

AI Infrastructure

Featured in AI, ML & Data Engineering

Act One: From Chatbots to AI Agents

In the "Act One: From Chatbots to AI Agents" eMag we’ve curated a collection of articles that explore the exciting transition from the familiar realm of chatbots to the more dynamic and autonomous world of AI agents. The eMag offers both practical insights and forward-looking perspectives on the challenges and opportunities that lie ahead.

All in ai-ml-data-eng

Culture & Methods

Featured in Culture & Methods

Data, Drugs, and Disruption: Leading High-performance Company in Drug Development

Olga Kubassova shares her journey from mathematician to CEO, detailing how engineering skills translate into business leadership. She discusses building a company, emphasizing team dynamics, strategic growth, and overcoming challenges. Learn how to leverage your technical background for entrepreneurship and navigate business complexities.

All in culture-methods

DevOps

Featured in DevOps

Checklist for Kubernetes in Production: Best Practices for SREs

This article provides SREs with a checklist for managing Kubernetes in production environments. It identifies common challenges including resource management, workload placement, high availability, health probes, storage, monitoring, and cost optimization. By implementing consistent GitOps automation across these areas, teams can significantly reduce complexity, and prevent downtime.

All in devops

Events

Helpful links

Choose your language

Discover emerging trends, insights, and real-world best practices in software development & tech leadership. Join now.

InfoQ Dev Summit Boston

Learn how senior software developers are solving the challenges you face. Register now with early bird tickets.

InfoQ Dev Summit Munich

Learn practical solutions to today's most pressing software challenges. Register now with early bird tickets.

QCon San Francisco

Explore insights, real-world best practices and solutions in software development & leadership. Register now.

InfoQ Homepage Resilience Content on InfoQ

Articles

RSS Feed

Newer Older

Architecture & Design

Applying Flow Metrics to Design Resilient Microservices

Software design with resilience is an acknowledgement to the reality that everything fails. We put metrics in place to help us detect and resolve such problems and failures. Flow metrics, commonly used to measure how well teams deliver software, can be used to measure and improve system resilience.

Mourjo Sen
on Mar 26, 2025
Architecture & Design

Cell-Based Architecture Adoption Guidelines

The challenges in building modern, reliable, and understandable distributed systems continue to grow, and cell-based architecture is a valuable way to accept, isolate, and stay reliable in the face of failures. Organizations must ensure that the cell-based architecture is the right fit for them and that the migration will not cause more problems than it solves.

Guy Coleman
on Nov 04, 2024
Culture & Methods

Adaptive Responses to Resiliently Handle Hard Problems in Software Operations

As engineers move into more senior positions such as Staff Engineer, Architect, or Sr Tech Lead roles, their knowledge and experience is often applied across the system. This expertise is increasingly needed for handling novel problems or designing innovative solutions to complex problems. This article discusses strategies for approaching your role as a senior member of your organization.

Laura Maguire
on Oct 23, 2024
Architecture & Design

Taking Advantage of Cell-Based Architectures to Build Resilient and Fault-Tolerant Systems

Cell-based architectures offer a robust approach to building resilient systems. They achieve this through the core principles of isolation, autonomy, and replication. Each cell manages its resources and makes decisions autonomously. Observability for cell-based architecture requires a tailored approach to address the unique challenges and opportunities presented by this distributed system design.

Yury Niño Roa
on Oct 21, 2024
Architecture & Design

Article Series: Cell-Based Architectures: How to Build Scalable and Resilient Systems

In this article series, we take readers on a journey of discovery and provide a comprehensive overview and in-depth analysis of many key aspects of cell-based architectures, as well as practical advice for applying this approach to existing and new architectures.

Rafal Gancarz
on Oct 14, 2024
Architecture & Design

How Cell-Based Architecture Enhances Modern Distributed Systems

Cell-based architecture has emerged as a response to many challenges associated with distributed systems. It employs the bulkhead pattern to isolate failures to a fraction of the affected infrastructure footprint and prevent widespread impact. Cells can also help organize large architectures into domain-bound deployment and delivery units, which provides essential sociotechnical benefits.

Erica Pisani Rafal Gancarz
on Oct 14, 2024
DevOps

Mastering Impact Analysis and Optimizing Change Release Processes

Dynamic IT professional with a proven track record in optimizing production processes and analyzing outages in complex systems handling millions of TPS. The recent CrowdStrike outage highlights the importance of continuous improvement and adherence to best practices. Passionate about elevating operational excellence through strategic reviews and effective process enhancements.

Tejas Ghadge
on Sep 11, 2024
Culture & Methods

Prepare to Be Unprepared: Investing in Capacity to Adapt to Surprises in Software-Reliant Businesses

Incidents are often perceived as extraordinary aberrations, unconnected to "normal" work. For over twenty years, the field of Resilience Engineering has aimed at flipping this approach around — by understanding what makes incidents so rare (relative to when and how they do not happen) and so minor (relative to how much worse they can be) and deliberately enhancing what makes that possible.

John Allspaw
on Aug 12, 2024
Culture & Methods

Generative AI and Organizational Resilience

Generative AI will profoundly transform communication and information sharing over the next decade, but the change will be uneven across industries and roles. Organizations should empower workers to use AI augmentation thoughtfully, while building literacy on capabilities and limits. A balanced, conscientious integration, using iterations and customer feedback, will produce the best outcomes.

Alex Cruikshank
on Jan 23, 2024
DevOps

Orchestrating Resilience Building Modern Asynchronous Systems

In this article, we will discuss what problems we had to solve at Twilio to efficiently build a resilient and scalable asynchronous system to handle a complex workflow and the advantages we got from adopting a Workflow Orchestration solution, including abstracting away state management and out-of-the-box support for retries, observability, and audibility.

Sai Pragna Etikyala Vikranth Etikyala
on Jan 12, 2024
Culture & Methods

The Incident Lifecycle: How a Culture of Resilience Can Help You Accomplish Your Goals

Don’t get stuck with overwhelmed systems that can cause an outage, like what happened with Taylor Swift concert tickets. Build organizational resilience to incidents through improved coordination and communication during the response, and blameless reviews, root cause analysis, and insightful communication afterward to enable meaningful change.

Vanessa Huerta Granda
on Jan 11, 2024
Culture & Methods

Write More, Talk Less: Building Organizational Resilience through Documentation and InnerSource

Better documentation and knowledge sharing creates transparency that aids onboarding, prevents turnover disruption, and withstands reorganizations. Different practices can help, such as communicating asynchronously, creating incentives for documentation, making docs discoverable, understanding team members' preferences, and providing dedicated writing time. And maybe InnerSource can help too.

David Grizzanti
on Dec 20, 2023

Newer Articles

Older Articles

BT