InfoQ Software Architects' Newsletter

A monthly overview of things you need to know as an architect or aspiring architect.

Enter your e-mail address

Select your country

We protect your privacy.

InfoQ Homepage Reliability Content on InfoQ

News

RSS Feed

Newer Older

Architecture & Design

Stripe Rearchitects Its Observability Platform with Managed Prometheus and Grafana on AWS

Stripe replaced its observability platform, which used a third-party vendor solution, with a new architecture utilizing managed services on AWS. The company made the move due to scalability limits, reliability issues, and increasing costs while transitioning to microservices. The migration involved dual-writing metrics, translating assets, validation, and user training.

Rafal Gancarz
on Nov 27, 2024
Architecture & Design

Netflix Rolls Out Service-Level Prioritized Load Shedding to Improve Resiliency

Netflix extended its prioritized load-shedding implementation to the individual service level to further improve system resilience. The approach uses cloud capacity more efficiently by shedding low-priority requests only when necessary instead of maintaining separate clusters for failure isolation.

Rafal Gancarz
on Nov 23, 2024
Architecture & Design

Netflix’s Pushy: Evolution of Scalable WebSocket Platform That Handles 100Ms Concurrent Connections

Netflix shared details on the evolution of Pushy, a WebSocket messaging platform that supports push notifications and inter-device communication across many different devices for the company’s products. Netflix’s engineers implemented many improvements across the Pushy ecosystem to ensure the platform's scalability and reliability and support new capabilities.

Rafal Gancarz
on Sep 23, 2024
DevOps

How Google Does Chaos Testing to Improve Spanner's Reliability

To ensure their Spanner database keeps working reliably, Google engineers use chaos testing to inject faults into production-like instances and stress the system's ability to behave in a correct way in the face of unexpected failures.

Sergio De Simone
on May 21, 2024
Architecture & Design

QCon London: Scaling Microservices Architecture and Technology Organization at Trainline

During the recent QCon London conference, Trainline’s CTO spoke about the evolution of the company’s system architecture and organizational structure over the last five years. The company had to adapt to market changes and growing customer expectations by improving the performance and reliability of its technology platform.

Rafal Gancarz
on Apr 17, 2024
Architecture & Design

Decathlon Adopts Backend for Frontend (BFF) Pattern to Empower FE Teams

Decathlon established the Backend For Frontend (BFF) architectural pattern as a company-wide recommendation and provided guidelines for its adoption among engineering teams. The four-part series introduces the pattern and explores its benefits and potential pitfalls. The company also shares available alternatives to using the BFF pattern and reviews architectural considerations.

Rafal Gancarz
on Mar 25, 2024
Development

Erlang-Runtime Statically-Typed Functional Language Gleam Reaches 1.0

Gleam, an actor-based highly-concurrent functional language running on the Erlang virtual machine (BEAM), has reached version 1.0, which means it is now ready to be used in production systems with a guarantee of backward compatibility based on semantic versioning.

Sergio De Simone
on Mar 16, 2024
Architecture & Design

Uber Builds Scalable Chat Using Microservices with GraphQL Subscriptions and Kafka

Uber replaced a legacy architecture built using the WAMP protocol with a new solution that takes advantage of GraphQL subscriptions. The main drivers for creating a new architecture were challenges around reliability, scalability, observability/debugibility, as well as technical debt impeding the team’s ability to maintain the existing solution.

Rafal Gancarz
on Mar 07, 2024
Architecture & Design

Grab Improves Kafka on Kubernetes Fault Tolerance with Strimzi, AWS AddOns and EBS

Grab updated its Kafka on Kubernetes setup to improve fault tolerance and completely eliminate human intervention in case of unexpected Kafka broker terminations. To address the shortcomings of the initial design, the team integrated with AWS Node Termination Handler (NTH), used the Load Balancer Controller for target group mapping, and switched to ELB volumes for storage.

Rafal Gancarz
on Feb 21, 2024
Architecture & Design

Uber Improves Resiliency of Microservices with Adaptive Load Shedding

Uber created a new load-shedding library for its microservice platform, serving over 130 million customers and handling aggregated peaks of millions of requests per second (RPSs). The company replaced the solution based on QALM with Cinnamon library, which, in addition to graceful degradation, can dynamically and continuously adjust the capacity of the service and the amount of load shedding.

Rafal Gancarz
on Feb 06, 2024
Cloud

Zonal Autoshift on AWS: Optimizing Infrastructure Reliability

Zonal autoshift, a new capability of Amazon Route 53 Application Recovery Controller, automatically shifts traffic away from an Availability Zone (AZ) when a potential failure is identified by the cloud provider. The service redirects the traffic back once the AZ failure is resolved.

Renato Losio
on Jan 30, 2024
Cloud

Microsoft Refreshes its Well-Architected Framework

Microsoft recently announced a comprehensive refresh of the Well-Architected Framework (WAF) for designing and running optimized workloads on Azure.

Steef-Jan Wiggers
on Nov 15, 2023
Architecture & Design

AWS Restructures and Consolidates Its Well-Architected Framework

AWS published a new set of updates to its Well-Architected Framework, with changes across all six pillars of the framework. The performance efficiency and operational excellence pillars have been restructured and consolidated to reduce the number of best practices. Other pillars received improved implementation guidance, including recommendations and steps on reusable architecture patterns.

Rafal Gancarz
on Nov 08, 2023
Cloud

Google Delivers Comprehensive Cloud Infrastructure Reliability Guide

Google recently delivered a cloud infrastructure reliability guide combining best practices and expertise from its engineers for its customers.

Steef-Jan Wiggers
on Jan 24, 2023
Cloud

Azure Cosmos DB: Low Latency and High Availability at Planet Scale

Mei-Chin Sei and Vinod Sridharan spoke at QCon San Francisco on Azure Cosmos DB: Low Latency and High Availability at Planet Scale. The talk was part of the "Architectures You've Always Wondered About" track.

Steef-Jan Wiggers
on Oct 30, 2022

Newer News

Older News

InfoQ Software Architects' Newsletter

Login with:

Don't have an InfoQ account?

News