InfoQ Homepage Failure Content on InfoQ
-
Risk and Failure on the Path to Staff Engineer
Caleb Hyde discusses their career progression and regressions, as well the context they used to figure out what to work on and whom to work with, distilling a framework to utilize in one’s own work.
-
Deconstructing an Abstraction to Reconstruct an Outage
Chris Sinjakli explores the aftermath of a complex outage in a Postgres cluster, retracing the steps taken to reliably reproduce the failure in a local environment.
-
How Did It Make Sense at the Time? Understanding Incidents as They Occurred, Not as They are Remembered
Jacob Scott explores the basics of failure in complex systems, the theory and practice of how it made sense at the time, and actions to take.
-
Managing the Risk of Cascading Failure
Laura Nolan discusses some of the mechanisms that cause cascading failures, what can be done to reduce the risk, and what to do if there is a cascading failure situation.
-
Culturing Resiliency with Data: a Taxonomy of Outages
Ranjib Dey overviews the categorization of outages that happened at Uber in the past few years based on root cause types.
-
Failing over without Falling over
Adrian Cockcroft shows how to use System Theoretic Process Analysis (STPA), as advocated by Professor Nancy Leveson’s team at MIT, to analyze failover hazards.
-
#FAIL
Kevlin Henney keynotes on some of the failures that people had in various projects and the lessons to be learned from them.
-
Rules in Agile Transformation: 80/20 and “Not Everybody Likes to Dance”
Zbigniew Piecuch discusses why some teams do not manage to master Agile.
-
What Breaks Our Systems: A Taxonomy of Black Swans
Laura Nolan talks about Black Swan events - unforeseen, unanticipated, and catastrophic incidents - that may happen in production and can take the system down.
-
How Did Things Go Right? Learning More from Incidents
Ryan Kitchens describes more rewarding ways to approach incident investigation without overly focusing on failure prevention.
-
How Condé Nast Succeeds by Buildling a Culture that Embraces Failure
Crystal Hirschorn talks about learnings found by building a culture that embraced failure through Chaos Engineering practices, what her teams have learned & adapted for their platforms at Condé Nast.
-
Building Resilient Serverless Systems
John Chapin explains how to use serverless technologies and an infrastructure-as-code approach to architect, build, and operate large-scale systems that are resilient to vendor failures.