InfoQ Homepage Chaos Engineering Content on InfoQ
-
Chaos and Resilience Engineering: Mental Models, Tools and Experiments
In a recent InfoQ podcast, Nora Jones, co-founder and CEO at Jeli, explored the differences between chaos engineering and resilience engineering, and provided advice for planning and running effective chaos experiments, and learning effectively from incidents.
-
Improving Incident Management through Role Assignments and Game Days
John Arundel, principal consultant at Bitfield Consulting, shared his thoughts on how to ensure incidents are handled smoothly and quickly. He suggests assigning specific roles to each team member responding to the incident. Red team versus blue team exercises can also be leveraged to ensure the team is prepared to respond accurately and quickly.
-
Exploring Costs of Coordination During Outages with Laura Maguire at QCon London
Laura Maguire talked at QCon London about how the coordinative efforts during outages cause a high cognitive cost. Maguire found out that coordination during anomaly response is difficult, that existing models can undermine speedy resolution, and that the strategies to control the cost of coordination are adaptive to the type of incident. Moreover, tooling has additional costs of coordination.
-
Failure Modes and Building Resilient Systems: Adrian Cockcroft at QCon SF
Adrian Cockcroft recently shared his thoughts on how to produce resilient systems that operate successfully in spite of the presence of failures. At the recent QCon San Francisco event, he also shared what he considers are good cloud resilience patterns for building with a continuous resilience mindset.
-
Gremlin Releases Native Kubernetes Chaos Testing
Chaos engineering platform Gremlin released native Kubernetes support for identifying, targeting, and experimenting on Kubernetes objects in order to proactively identify service weaknesses.
-
How to Integrate Infosec and DevOps Using Chaos Engineering
Kelly Shortridge from Capsule8 talked at the Velocity conference in Berlin about how using chaos engineering can help to integrate Infosec within a DevOps culture. Shortridge discussed how distributed, immutable, and ephemeral infrastructure, or the D.I.E. model, is an organizationally friendly way to building security by design. With this model, users can continuously raise the cost of the attack
-
Gremlin Introduces Scenarios, Enabling Real-World Chaos Experiments
The Gremlin team announced the addition of Scenarios that allow for simulation of real-world outages. Scenarios allow for planning and tracking complex chaos experiments that more closely mimic a real-world outages. The release includes prepared Scenarios that can be run out of the box or used as a starting template to build custom incidents.
-
How Did Things Go Right? Learning More from Incidents at Netflix: Ryan Kitchens at QCon New York
At QCon New York, Ryan Kitchens presented “How Did Things Go Right? Learning More from Incidents”. Key takeaways from the talk included: recovery is better than prevention; an incident occurs when there is a “perfect storm” of events -- there is no root cause; “stop reporting on the nines”, as user happiness is more important; and there is value in learning how things go right.
-
Solo.io Announces Service Mesh Hub and Chaos Engineering Tool
Solo.io, a cloud native software company, launched the first industry service mesh hub. The hub provides resources to help users adopt service mesh technology in hybrid and multi-cloud environments and features tools such as Istio, Linkerd, Envoy, AWS App Mesh, and HashiCorp Consul.
-
Summary of Chaos Community Day v4.0: Resilience, Observability, and Gamedays
Earlier in the year, the fourth edition of “Chaos Community Day” was held at Work-Bench in New York City. Key takeaways from the day included: the topic of chaos engineering draws heavily from other domains, which software engineers can also learn from; understanding systems, and communicating and exchanging the related mental models, is vital for establishing resilience.
-
Chaos Engineering Kubernetes with the Litmus Framework
Litmus is an open source chaos engineering framework for Kubernetes environments running stateful applications. Created by MayaData, Litmus enables users to run test suites, capture logs, generate reports, and perform chaos experiments.
-
QCon NY (Jun 24-28): New Talks, a Focus on the Skills That Matter & Why You Should Join Us This Year
In the recent Stack Overflow 9th annual survey of over 90,000 software developers, we learned that non-development work remains a productivity challenge for software managers and leaders. At QCon New York, the conference for senior software developers, we have many sessions to help you learn how others have overcome those challenges.
-
Mature Microservices and How to Operate Them: QCon London Q&A
Microservices is an architectural approach to keep systems decoupled for releasing many changes a day, said Sarah Wells in her keynote at QCon London 2019. To build resilient and maintainable systems you need things like load balancing across healthy nodes, backoff and retry, and persistence or fanning out of requests via queues. The best way to know whether your system is resilient is to test it.
-
Amplifying Sources of Resilience: John Allspaw at QCon London
At QCon London John Allspaw presented “Amplifying Sources of Resilience: What Research Says”. Key takeaways from the talk included: that resilience is something a system does, not what a system has; creating and sustaining “adaptive capacity” within an organisation is resilient action; and learning about how people cope with surprise is the path to finding sources of resilience.
-
Gremlin Announces Free Tier for Their Chaos Experimentation Platform
Gremlin has announced “Gremlin Free”, which provides the ability to run chaos engineering experiments on a free tier of their failure-as-a-service SaaS platform. The current version of the free tier allows the execution of shutdown and CPU attacks on hosts or containers, which can be controlled via a simple web-based user interface, API or CLI.