InfoQ Homepage Resilience Content on InfoQ
AWS US-EAST-1 Outage: Postmortem and Lessons Learned
On December 7th AWS experienced an hours-long outage that affected many services in its most popular region, Northern Virginia. The cloud provider released an analysis of the incident that started threads in the community about redundancy on AWS and multi-region approaches.
Amazon Introduces AWS Resilience Hub to Monitor and Improve RPO and RTO
Amazon recently announced the availability of AWS Resilience Hub, a service designed to help customers define, measure, and manage the resilience of their applications on the cloud.
Real-Time Exactly-Once Event Processing at Uber with Apache Flink, Kafka, and Pinot
Uber faced some challenges after introducing ads on UberEats. The events they generated had to be processed quickly, reliably and accurately. These requirements were fulfilled by a system based on Apache Flink, Kafka, and Pinot that can process streams of ad events in real-time with exactly-once semantics. An article describing its architecture was published recently in the Uber Engineering blog.
Microsoft Announces Azure Chaos Studio in Public Preview
At the recent Ignite, Microsoft announced the public preview of Azure Chaos Studio, a fully-managed experimentation service to help customers track, measure, and mitigate faults with controlled chaos engineering to improve the resilience of their cloud applications.
Litmus 2.0 Release Includes Multi-Tenancy, Chaos Workflows, GitOps, and Observability
Last month, Litmus 2.0 was released for general availability, with the goal of simplifying chaos engineering by adding new features like chaos center, chaos workflows, GitOps for chaos, multi-tenancy, observability, and private chaos hubs. InfoQ interviewed Umasankar Mukkara, CEO of ChaosNative and co-creator and maintainer of Litmus engineering platform.
Why the Most Resilient Companies Want More Incidents
According to John Egan, the incident management process is meant to be a cycle of not just the response, but also the account of root cause and the updating of internal processes and practices across the industry. Lowering the barrier to reporting incidents, holding effective incident review meetings using blameless postmortems, and giving everyone access to postmortems is what he advises.
Gremlin Adds Automated Service Discovery for Targeting Chaos Experiments
Gremlin, a chaos engineering platform, recently announced automated service discovery. This new feature will auto discover services running within dynamic environments. These services are then available to target for chaos experiments. Gremlin has also added role based access control for their API keys.
Cheryl Hung on Trends in Cloud Native and DevOps for 2021
In a recent keynote for The DEVOPS Conference, Cheryl Hung, VP ecosystem for the Cloud Native Computing Foundation (CNCF), shared her top 10 predictions for cloud native in the upcoming year. This includes improvements in cross cloud support, growth in GitOps and chaos engineering practices, and an increase in the adoption of FinOps.
InfoQ Live March 16: Explore Ways of Reducing Uncertainty in Software Delivery
InfoQ Live, the one-day virtual event for software engineers and architects, returns on March 16th with a new edition, this time focusing on ways to reduce the uncertainty of your software development cycle.
Gremlin Aims to Reduce Kubernetes Noisy Neighbours through Chaos Engineering
Gremlin has released enhancements to its Chaos Engineering platform aimed at DevOps engineers interested in future-proofing Kubernetes clusters by isolating "noisy neighbours". On Kubernetes, the noisy neighbour issue occurs when multiple applications sharing a Kubernetes cluster compete for resources leading to degraded performance.
Gremlin Releases State of Chaos Engineering 2021 Report
Gremlin released their State of Chaos Engineering 2021 report based on a community survey and their own product data. The key findings include a positive correlation between running chaos engineering experiments and increased availability.
Uber Implements Disaster Recovery for Multi-Region Kafka
In a recent blog post, Uber engineers highlight how they use a replication platform to implement disaster recovery at scale with a multi-region Kafka deployment. Uber has a large deployment of Apache Kafka, processing trillions of messages and multiple petabytes of data per day. Uber's engineers provided business resilience and continuity in the face of natural and human-made disasters.
AWS Announces Chaos Engineering as a Service Offering
AWS has announced the upcoming release of their chaos engineering as a service offering. The Fault Injection Service (FIS) will provide fully-managed chaos experiments across a number of AWS services. The service includes pre-built templates that generate disruptions mimicking common real-world events. It can be integrated into CI pipelines via API.
Chaos Engineering on Kubernetes : Chaos Mesh Generally Available with v1.0
The Chaos Mesh team announced the general availability (GA) of Chaos Mesh 1.0 after it was accepted as a CNCF sandbox project in July 2020. Chaos Mesh is a tool to perform chaos engineering experiments on Kubernetes applications.
Chaos Conf Q&A: Adrian Cockcroft & Yury Niño Roa
In preparation for ChaosConf 2020, InfoQ sat down with Adrian Cockcroft and Yury Niño Roa to explore topics of interest in the chaos engineering community. Key takeaways included: there are clear benefits to running “game days” to develop psychological safety, and the future of chaos engineering points toward incorporating security and scaling up experiments to test larger failure modes.