InfoQ Homepage Resilience Content on InfoQ
-
Josh Wills on Building Resilient Data Engineering and Machine Learning Products at Slack
Josh Wills, a software engineer working on data engineering problems at Slack, discusses the Slack data architecture and how they build and observe their pipelines.
-
Ryan Kitchens on Learning from Incidents at Netflix, the Role of SRE, and Sociotechnical Systems
In today’s podcast, we sit down with Ryan Kitchens, a senior site reliability engineer and member of the CORE team at Netflix. This team is responsible for the entire lifecycle of incident management at Netflix, from incident response to memorialising an issue.
-
Kolton Andrus on Gremlin’s Newly Announced SaaS Chaos Engineering Product and Running Game Days
Gremlin is a Software as a Service that lets you plan, control and undo Chaos engineering experiments built by engineers with experience from Netflix, AWS, Dropbox and others. In this podcast Wes talks to Kolton Andrus about the Gremlin product and architecture and related topics such as running Game Days.
-
Nora Jones on Establishing, Growing, and Maturing a Chaos Engineering Practice
Nora Jones, a senior software engineer on Netflix’ Chaos Team, talks with Wesley Reisz about what Chaos Engineering means today. She covers what it takes to build a practice, how to establish a strategy, defines cost of impact, and covers key technical considerations when leveraging chaos engineering.