InfoQ Homepage Reliability Content on InfoQ
-
How We Created a High-Scale Notification System at Duolingo
Vitor Pellegrino and Zhen Zhou discuss how they built and tested Duolingo's high-scale on-demand notification system, including what it takes to manage resources and site reliability concurrently.
-
How Netflix Ensures Highly-Reliable Online Stateful Systems
Joseph Lynch discusses the architecture of Netflix's stateful caches and databases, including how they capacity plan, bulkhead, and deploy software to their global, full-active, data topology.
-
Reliable Architectures through Observability
Kent Quirk shows an overview of observability tools and techniques, and specific recommendations for how to fit observability into their system designs and day-to-day development process.
-
How to Build a Reliable Kafka Data Processing Pipeline, Focusing on Contention, Uptime and Latency
Lily Mara shares how OneSignal improved the performance and maintainability of its highest-throughput HTTP endpoints (backed by a Kafka consumer in Rust) by making it an asynchronous system.
-
Architecting a Production Development Environment for Reliability
At Meta, developers use servers (devservers) to perform their daily work. This talk discusses their software architecture and the mechanisms employed to ensure they remain reliable and available.
-
Building Reliability One Step at a Time
Ana Margarita Medina shares how she has been using Chaos Engineering and how it can be used to decouple our system’s weak points, learn from incidents and improve monitoring and observability.
-
Less Mess, Less Stress: the Reliability Benefits of Custom Tools
Daniel Hochman discusses how an overreliance on vendor tooling leads to worse reliability outcomes, how Lyft lowered MTTR for its most common alerts using custom tooling, and how Clutch can help.
-
InfoQ Live Roundtable: Production Readiness: Building Resilient Systems
The panelists discuss observability, security, the software supply chain, CI/CD, chaos engineering, deployment techniques, canaries, blue-green deployments all in the pursuit of production resiliency.
-
Chaos Engineering: the Path to Reliability
Kolton Andrus shares examples of what works, what doesn’t, and what the future holds in using Chaos Engineering to build reliability in a system.
-
Reliability Matters More Than Ever
Tammy Butow discusses why reliability and resilience matter now more than ever, and how one can achieve them.
-
High Performance Cooperative Distributed Systems in Adtech
Stan Rosenberg explores a set of core building blocks exhibited by Adtech platforms and applies them towards building a fraud detection platform.
-
PID Loops and the Art of Keeping Systems Stable
Colm MacCárthaigh shows what PID loops look like in the context of modern systems, and how exponential backoff, flow-control, and other techniques can be wielded to build self-healing systems.