InfoQ Homepage Chaos Engineering Content on InfoQ

News

RSS Feed

Newer Older

DevOps

An Open Source Chaos Engineering Library from AWS

AWS engineers recently wrote about an open source chaos engineering tool called AWSSSMChaosRunner that they used to test fault injection in Prime Video. Built using AWS Systems Manager that can execute arbitrary commands on EC2 instances, the team was able to mitigate latency related issues using it.

Hrishikesh Barua
on Aug 30, 2020
DevOps

Gremlin Announces General Availability of Status Checks

Gremlin recently announced the general availability of Status Checks. This new feature automatically validates systems that are healthy and ready for running chaos experiments in production.

Aditya Kulkarni
on Aug 04, 2020
DevOps

Chaos and Resilience Engineering: Mental Models, Tools and Experiments

In a recent InfoQ podcast, Nora Jones, co-founder and CEO at Jeli, explored the differences between chaos engineering and resilience engineering, and provided advice for planning and running effective chaos experiments, and learning effectively from incidents.

Daniel Bryant
on Jul 09, 2020
DevOps

Improving Incident Management through Role Assignments and Game Days

John Arundel, principal consultant at Bitfield Consulting, shared his thoughts on how to ensure incidents are handled smoothly and quickly. He suggests assigning specific roles to each team member responding to the incident. Red team versus blue team exercises can also be leveraged to ensure the team is prepared to respond accurately and quickly.

Matt Campbell
on Mar 25, 2020
DevOps

Exploring Costs of Coordination During Outages with Laura Maguire at QCon London

Laura Maguire talked at QCon London about how the coordinative efforts during outages cause a high cognitive cost. Maguire found out that coordination during anomaly response is difficult, that existing models can undermine speedy resolution, and that the strategies to control the cost of coordination are adaptive to the type of incident. Moreover, tooling has additional costs of coordination.

Christian Melendez
on Mar 13, 2020
DevOps

Failure Modes and Building Resilient Systems: Adrian Cockcroft at QCon SF

Adrian Cockcroft recently shared his thoughts on how to produce resilient systems that operate successfully in spite of the presence of failures. At the recent QCon San Francisco event, he also shared what he considers are good cloud resilience patterns for building with a continuous resilience mindset.

Matt Campbell
on Dec 18, 2019
DevOps

Gremlin Releases Native Kubernetes Chaos Testing

Chaos engineering platform Gremlin released native Kubernetes support for identifying, targeting, and experimenting on Kubernetes objects in order to proactively identify service weaknesses.

K Jonas
on Dec 12, 2019
DevOps

How to Integrate Infosec and DevOps Using Chaos Engineering

Kelly Shortridge from Capsule8 talked at the Velocity conference in Berlin about how using chaos engineering can help to integrate Infosec within a DevOps culture. Shortridge discussed how distributed, immutable, and ephemeral infrastructure, or the D.I.E. model, is an organizationally friendly way to building security by design. With this model, users can continuously raise the cost of the attack

Christian Melendez
on Nov 25, 2019
DevOps

Gremlin Introduces Scenarios, Enabling Real-World Chaos Experiments

The Gremlin team announced the addition of Scenarios that allow for simulation of real-world outages. Scenarios allow for planning and tracking complex chaos experiments that more closely mimic a real-world outages. The release includes prepared Scenarios that can be run out of the box or used as a starting template to build custom incidents.

Matt Campbell
on Sep 30, 2019
DevOps

How Did Things Go Right? Learning More from Incidents at Netflix: Ryan Kitchens at QCon New York

At QCon New York, Ryan Kitchens presented “How Did Things Go Right? Learning More from Incidents”. Key takeaways from the talk included: recovery is better than prevention; an incident occurs when there is a “perfect storm” of events -- there is no root cause; “stop reporting on the nines”, as user happiness is more important; and there is value in learning how things go right.

Daniel Bryant
on Jul 05, 2019
DevOps

Solo.io Announces Service Mesh Hub and Chaos Engineering Tool

Solo.io, a cloud native software company, launched the first industry service mesh hub. The hub provides resources to help users adopt service mesh technology in hybrid and multi-cloud environments and features tools such as Istio, Linkerd, Envoy, AWS App Mesh, and HashiCorp Consul.

K Jonas
on Jun 20, 2019
DevOps

Summary of Chaos Community Day v4.0: Resilience, Observability, and Gamedays

Earlier in the year, the fourth edition of “Chaos Community Day” was held at Work-Bench in New York City. Key takeaways from the day included: the topic of chaos engineering draws heavily from other domains, which software engineers can also learn from; understanding systems, and communicating and exchanging the related mental models, is vital for establishing resilience.

Daniel Bryant
on Jun 07, 2019
DevOps

Chaos Engineering Kubernetes with the Litmus Framework

Litmus is an open source chaos engineering framework for Kubernetes environments running stateful applications. Created by MayaData, Litmus enables users to run test suites, capture logs, generate reports, and perform chaos experiments.

K Jonas
on May 31, 2019
Development

QCon NY (Jun 24-28): New Talks, a Focus on the Skills That Matter & Why You Should Join Us This Year

In the recent Stack Overflow 9th annual survey of over 90,000 software developers, we learned that non-development work remains a productivity challenge for software managers and leaders. At QCon New York, the conference for senior software developers, we have many sessions to help you learn how others have overcome those challenges.

Diana Baciu
on May 02, 2019
Culture & Methods

Mature Microservices and How to Operate Them: QCon London Q&A

Microservices is an architectural approach to keep systems decoupled for releasing many changes a day, said Sarah Wells in her keynote at QCon London 2019. To build resilient and maintainable systems you need things like load balancing across healthy nodes, backoff and retry, and persistence or fanning out of requests via queues. The best way to know whether your system is resilient is to test it.

Ben Linders
on Apr 25, 2019

Newer News

Older News

InfoQ Software Architects' Newsletter

News