InfoQ Homepage Resilience Content on InfoQ
-
Netflix Rolls Out Service-Level Prioritized Load Shedding to Improve Resiliency
Netflix extended its prioritized load-shedding implementation to the individual service level to further improve system resilience. The approach uses cloud capacity more efficiently by shedding low-priority requests only when necessary instead of maintaining separate clusters for failure isolation.
-
Microsoft's Customer Managed Planned Failover Type for Azure Storage Available in Public Preview
Microsoft’s new customer-managed planned failover for Azure Storage enhances disaster recovery by enabling geo-redundancy without data loss or reconfiguration. This proactive solution supports business continuity during outages and large-scale disasters, aligning with competitive offerings from AWS and Google Cloud.
-
Google Cloud Enhances Spanner with Dual-Region Configuration
Google Cloud has introduced a significant update to its fully-managed distributed SQL database service, Spanner, which now offers a dual-region configuration option. The company aims with this enhancement to assist enterprises in complying with data residency norms across countries with limited cloud support while ensuring high availability.
-
Modern Data Architecture, ML, and Resilience Topics Announced for QCon San Francisco 2024
QCon San Francisco returns November 18-22, focusing on innovations and emerging trends you should pay attention to in 2024. With technical talks from international software practitioners, QCon will provide actionable insights and skills you can take back to your teams.
-
QCon London: Scaling Microservices Architecture and Technology Organization at Trainline
During the recent QCon London conference, Trainline’s CTO spoke about the evolution of the company’s system architecture and organizational structure over the last five years. The company had to adapt to market changes and growing customer expectations by improving the performance and reliability of its technology platform.
-
InfoQ & QCon Events: Level up on Generative AI, Security, Platform Engineering, and More Upcoming
As we navigate through these transformative times, the upcoming InfoQ events stand as a platform to help you stay ahead, learn valuable insights, and find practical solutions to your development challenges in 2024 and beyond. The events are carefully curated for senior software engineers, architects, and team leaders, offering practitioner insights into emerging trends, patterns, and practices.
-
Uber Improves Resiliency of Microservices with Adaptive Load Shedding
Uber created a new load-shedding library for its microservice platform, serving over 130 million customers and handling aggregated peaks of millions of requests per second (RPSs). The company replaced the solution based on QALM with Cinnamon library, which, in addition to graceful degradation, can dynamically and continuously adjust the capacity of the service and the amount of load shedding.
-
Zonal Autoshift on AWS: Optimizing Infrastructure Reliability
Zonal autoshift, a new capability of Amazon Route 53 Application Recovery Controller, automatically shifts traffic away from an Availability Zone (AZ) when a potential failure is identified by the cloud provider. The service redirects the traffic back once the AZ failure is resolved.
-
Slack Migrates to Cell-Based Architecture on AWS to Mitigate Gray Failures
Slack migrated most of the critical user-facing services from a monolithic to a cell-based architecture over the last 1.5 years. The move was triggered by the impact of networking outages affecting a single availability zone, causing user-impacting service degradation. The new architecture allows incrementally draining all the traffic away from the affected availability zone within 5 minutes.
-
Roblox Builds New Cellular Infrastructure to Improve Gaming Experience
The online game platform and creation system Roblox has detailed how they have made their infrastructure more efficient and resilient, to support the demands of more than 70 million active daily users engaged in immersive 3D experiences.
-
Chaos Engineering Service Azure Chaos Studio Now Generally Available
Two years after entering public preview, reliability experimentation service Azure Chaos Studio is now generally available. Among its most recent features are experiment templates, dynamic targets, load testing faults, and more.
-
Polly v8 .NET Resilience Library: Resilience Pipelines, Built-in Telemetry, and More
Polly v8 is officially released. This version brings enhancements such as resilience pipelines, built-in telemetry support, and some changes within the configuration for individual resilience strategies.
-
Monzo Employs Targeted Traffic Shedding against Stampeding Herd Effect from the Mobile App
Monzo developed a solution for shedding traffic in case its platform comes under intense and unexpected load that could lead to an outage. Traffic spikes can be generated by the mobile app and triggered by push notifications or other bursts in user activity. The solution can reduce the read traffic by almost 50% with 90% overall accuracy without noticeable customer impact.
-
How Amazon Prime Video Delivers 99.999% Availability While Reducing Costs
Amazon Prime Video created a highly available live video streaming architecture by combining redundant components to achieve the five-nines of availability that they require for their platform. The company optimized the deployment topology and video encoding to reduce costs while ensuring optimal video quality for users.
-
QCon San Francisco 2023 Day 3: Architecting the Cloud, Deep Tech, Frontend Trends, Org Resilience
The 17th annual QCon San Francisco conference was held at the Hyatt Regency San Francisco in San Francisco, California. This five-day event, organized by C4Media, consists of three days of presentations and two days of workshops. Day Three, scheduled on October 4th, 2023, included a keynote address by Will Larson and presentations from four conference tracks and one sponsored track.