InfoQ Homepage Chaos Engineering Content on InfoQ

News

RSS Feed

Newer Older

DevOps

Chaos Engineering Observability: Q&A with Russ Miles

In a new O’Reilly report, “Chaos Engineering Observability: Bringing Chaos Experiments into System Observability”, the author, Russ Miles, explores why he believes the topics of observability and chaos engineering “go hand in hand”. He argues that as engineers begin to run chaos experiments, they will need to be able to ask many questions about the underlying system being experimented on.

Daniel Bryant
on Mar 04, 2019
DevOps

Building Production-Ready Applications: Michael Kehoe Shares Lessons Learned from LinkedIn

At QCon San Francisco, Michael Kehoe presented “Building Production-Ready Applications”. Drawing on his experience with site reliability engineering (SRE), he introduced the tenets of “production-readiness” that all engineers across the organisation should focus on as: stability and reliability; scalability and performance; fault tolerance and disaster recovery; monitoring; and documentation.

Daniel Bryant
on Nov 12, 2018
DevOps

An Evolution of Chaos Experimentation: Kolton Andrus at ChaosConf 2018

At the inaugural ChaosConf, held in San Francisco, USA, Kolton Andrus presented an evolution of chaos experimentation over the past eight years. He argued that the human and organisational aspects of dealing with failure should not be ignored, and also suggested that tooling should support application- and request-level targeting of failure injection tests in order to minimise the blast radius.

Daniel Bryant
on Oct 08, 2018
DevOps

Gremlin Releases Application Level Fault Injection (ALFI) Platform for Targeted Chaos Experiments

Gremlin Inc has released their second product offering in the “Failure-as-a-Service” domain– Application-Level Fault Injection (ALFI). Building upon their initial platform that facilitated engineers in creating and running chaos experiments at the infrastructure level, ALFI enables failure injection at the application level via a native language library.

Daniel Bryant
on Oct 07, 2018
Architecture & Design

Russ Miles: Ignored Architects and Chaos Engineering

At the recent Event-Driven Microservices Conference in Amsterdam, Russ Miles claimed that the biggest challenge for an architect is that you get ignored. You have great ideas like event-driven microservices, but the reaction too often is that it sounds good, but that it’s overly complicated for the needs at hand.

Jan Stenberg
on Sep 30, 2018
DevOps

Learning to Bend But Not Break at Netflix: Haley Tucker Discusses Chaos Engineering at QCon NY

At QCon New York, Haley Tucker presented “UNBREAKABLE: Learning to Bend But Not Break at Netflix” and discussed her experience with chaos engineering while working across a number of roles at Netflix. Key takeaways included: use functional sharding for fault isolation; continually tune RPC calls; run chaos experiments with small iterations; and apply the “principles of chaos”.

Daniel Bryant
on Jul 05, 2018
DevOps

Chaos Engineering at LinkedIn: The “LinkedOut” Failure Injection Testing Framework

The LinkedIn Engineering team has recently discussed their “LinkedOut” failure injection testing framework. Hypotheses about service resilience can be formulated and failure triggers injected via the LinkedIn LiX A/B testing framework or via data in a cookie that is passed through the call stack using the Invocation Context (IC) framework. Failure scenarios include errors, delays and timeouts.

Daniel Bryant
on Jun 24, 2018
DevOps

From Darwin to DevOps: John Willis and Gene Kim Talk about Life after The Phoenix Project

IT Revolution recently published an audiobook with nearly eight hours of conversation between Gene Kim and John Willis; Beyond the Phoenix Project – the Origins and Evolution of DevOps.

Helen Beal
on May 23, 2018
DevOps

Increasing the Resilience of APIs with Chaos Engineering

The Gremlin team has described a simple chaos experiment as a method of validating that an organisation’s APIs are resilient. Using the principles of chaos engineering and techniques like running “game days” (a fire drill for IT systems and people) can provide value, as can the appropriate use of commercial and open source tooling emerging within this space.

Daniel Bryant
on May 20, 2018
DevOps

What Resiliency Means at Sportradar

Pablo Jensen, CTO at Sportradar, talked about practices and procedures in place at Sportradar to ensure their systems meet expected resiliency levels, at this year's QCon London conference. Jensen mentioned how reliability is influenced not only by technical concerns but also organizational structure and governance, client support, and requires on-going effort to continuously improve.

Manuel Pais
on Apr 06, 2018
DevOps

Why the World Needs More Resilient Systems: Tammy Butow Discusses Chaos Engineering at QCon London

At QCon London, Tammy Butow, explained why the world needs more resilient systems, and how this can be achieved with the practice of chaos engineering. Three primary prerequisites for chaos engineering were provided -- high severity “SEV” incident management, monitoring, and measuring the impact -- and a series of guidelines, tools and practices presented.

Daniel Bryant
on Mar 18, 2018
Cloud

Bloomberg Releases Open Source “PowerfulSeal” Kubernetes-Specific Chaos Testing Tool

At the recent KubeCon North America conference, Bloomberg presented their new open source “PowerfulSeal” tool, which enables chaos testing within Kubernetes clusters via the termination of targeted pods and underlying node infrastructure.

Daniel Bryant
on Jan 25, 2018
DevOps

Chaos Engineering at Twilio

The Twilio team describes their foray into Chaos Engineering where they use Gremlin to inject failures into their homegrown queuing system shards to test for automated recovery.

Hrishikesh Barua
on Dec 25, 2017
Cloud

Werner Vogels on “21st Century [Cloud] Architectures”: Availability, Reliability and Resilience

At the AWS re:invent 2017 conference, Werner Vogels, CTO of Amazon, presented a keynote that discussed core concepts required for building “21st Century Architectures” on the cloud. Highlights of the talk included discussion of the emerging practices of evolutionary and “cloud native” architectures, the role of security becoming everyone’s responsibility, and the benefits of chaos engineering.

Daniel Bryant
on Dec 03, 2017
DevOps

Expedia's Journey toward Site Resiliency: Embracing Chaos Testing in Dev and Production at QCon SF

At QCon SF, Sahar Samiei and Willie Wheeler presented “Expedia’s Journey Toward Site Resiliency”, and discussed the building of a community of practice around resilience testing within Expedia. The results have generally been positive: Netflix’s Chaos Monkey has been running daily in production since May 15th; and resilience tests have been added to four Tier 1 service pipelines.

Daniel Bryant
on Nov 19, 2017

Newer News

Older News

InfoQ Software Architects' Newsletter

Login with:

Don't have an InfoQ account?

News