InfoQ Homepage Reliability Content on InfoQ
-
Stripe Rearchitects Its Observability Platform with Managed Prometheus and Grafana on AWS
Stripe replaced its observability platform, which used a third-party vendor solution, with a new architecture utilizing managed services on AWS. The company made the move due to scalability limits, reliability issues, and increasing costs while transitioning to microservices. The migration involved dual-writing metrics, translating assets, validation, and user training.
-
Netflix Rolls Out Service-Level Prioritized Load Shedding to Improve Resiliency
Netflix extended its prioritized load-shedding implementation to the individual service level to further improve system resilience. The approach uses cloud capacity more efficiently by shedding low-priority requests only when necessary instead of maintaining separate clusters for failure isolation.
-
Netflix’s Pushy: Evolution of Scalable WebSocket Platform That Handles 100Ms Concurrent Connections
Netflix shared details on the evolution of Pushy, a WebSocket messaging platform that supports push notifications and inter-device communication across many different devices for the company’s products. Netflix’s engineers implemented many improvements across the Pushy ecosystem to ensure the platform's scalability and reliability and support new capabilities.
-
How Google Does Chaos Testing to Improve Spanner's Reliability
To ensure their Spanner database keeps working reliably, Google engineers use chaos testing to inject faults into production-like instances and stress the system's ability to behave in a correct way in the face of unexpected failures.
-
QCon London: Scaling Microservices Architecture and Technology Organization at Trainline
During the recent QCon London conference, Trainline’s CTO spoke about the evolution of the company’s system architecture and organizational structure over the last five years. The company had to adapt to market changes and growing customer expectations by improving the performance and reliability of its technology platform.
-
Decathlon Adopts Backend for Frontend (BFF) Pattern to Empower FE Teams
Decathlon established the Backend For Frontend (BFF) architectural pattern as a company-wide recommendation and provided guidelines for its adoption among engineering teams. The four-part series introduces the pattern and explores its benefits and potential pitfalls. The company also shares available alternatives to using the BFF pattern and reviews architectural considerations.
-
Erlang-Runtime Statically-Typed Functional Language Gleam Reaches 1.0
Gleam, an actor-based highly-concurrent functional language running on the Erlang virtual machine (BEAM), has reached version 1.0, which means it is now ready to be used in production systems with a guarantee of backward compatibility based on semantic versioning.
-
Uber Builds Scalable Chat Using Microservices with GraphQL Subscriptions and Kafka
Uber replaced a legacy architecture built using the WAMP protocol with a new solution that takes advantage of GraphQL subscriptions. The main drivers for creating a new architecture were challenges around reliability, scalability, observability/debugibility, as well as technical debt impeding the team’s ability to maintain the existing solution.
-
Grab Improves Kafka on Kubernetes Fault Tolerance with Strimzi, AWS AddOns and EBS
Grab updated its Kafka on Kubernetes setup to improve fault tolerance and completely eliminate human intervention in case of unexpected Kafka broker terminations. To address the shortcomings of the initial design, the team integrated with AWS Node Termination Handler (NTH), used the Load Balancer Controller for target group mapping, and switched to ELB volumes for storage.
-
Uber Improves Resiliency of Microservices with Adaptive Load Shedding
Uber created a new load-shedding library for its microservice platform, serving over 130 million customers and handling aggregated peaks of millions of requests per second (RPSs). The company replaced the solution based on QALM with Cinnamon library, which, in addition to graceful degradation, can dynamically and continuously adjust the capacity of the service and the amount of load shedding.
-
Zonal Autoshift on AWS: Optimizing Infrastructure Reliability
Zonal autoshift, a new capability of Amazon Route 53 Application Recovery Controller, automatically shifts traffic away from an Availability Zone (AZ) when a potential failure is identified by the cloud provider. The service redirects the traffic back once the AZ failure is resolved.
-
Microsoft Refreshes its Well-Architected Framework
Microsoft recently announced a comprehensive refresh of the Well-Architected Framework (WAF) for designing and running optimized workloads on Azure.
-
AWS Restructures and Consolidates Its Well-Architected Framework
AWS published a new set of updates to its Well-Architected Framework, with changes across all six pillars of the framework. The performance efficiency and operational excellence pillars have been restructured and consolidated to reduce the number of best practices. Other pillars received improved implementation guidance, including recommendations and steps on reusable architecture patterns.
-
Google Delivers Comprehensive Cloud Infrastructure Reliability Guide
Google recently delivered a cloud infrastructure reliability guide combining best practices and expertise from its engineers for its customers.
-
Azure Cosmos DB: Low Latency and High Availability at Planet Scale
Mei-Chin Sei and Vinod Sridharan spoke at QCon San Francisco on Azure Cosmos DB: Low Latency and High Availability at Planet Scale. The talk was part of the "Architectures You've Always Wondered About" track.