InfoQ Homepage Resilience Content on InfoQ
-
How LinkedIn Serves over 4.8 Million Member Profiles per Second
LinkedIn introduced Couchbase as a centralized caching tier for scaling member profile reads to handle increasing traffic that has outgrown their existing database cluster. The new solution achieved over 99% hit rate, helped reduce tail latencies by more than 60% and costs by 10% annually.
-
How Resilience Can Help to Get Better at Resolving Incidents
Applying resilience throughout the incident lifecycle by taking a holistic look at the sociotechnical system can help to turn incidents into learning opportunities. Resilience can help folks get better at resolving incidents and improve collaboration. It can also give organizations time to realize their plans.
-
Using Code Instrumentation for Fault Injection at the Application Level at eBay
eBay engineers have been using fault injections techniques to improve the reliability of the notification platform and explore its weaknesses. While fault injection is a common industry practice, eBay attempted a novel approach leveraging instrumentation to bring fault injection within the application level.
-
Resilience4j 2.0.0 Delivers Support for JDK 17
Resilience4j, a lightweight fault tolerance library designed for functional programming, has released version 2.0 featuring support for Java 17 and dependency upgrades to Kotlin, Spring Boot and Micronaut. This new version also removes the dependency on Vavr, a functional library for Java, in order to become a more lightweight library.
-
Java News Roundup: Major Spring Releases, Resilience4j, Open Liberty, GlassFish, Kotlin 1.8-Beta
This week's Java roundup for November 21st, 2022, features news from JDK 20, major, point and patch releases for Spring (namely Boot, Web Services, Security, Batch, Authorization Server, REST Docs, Framework, Modulith, GraphQL, Apache Kafka and RabbitMQ), Open Liberty 22.0.0.12, GlassFish 7.0-M10, GraalVM Native Build Tools 0.9.18, Resilience4j 2.0, Apache Tomcat 8.5.84 and Kotlin 1.8-Beta.
-
Filibuster: Automated Fault Injection Tool to Improve DoorDash's Reliability
DoorDash recently revealed how they are using Filibuster, an automated fault injection tool, to identify resilience issues in microservice applications early on and improve platform reliability.
-
Dropbox Unplugs Data Center to Test Resilience
Dropbox has published a detailed account of why and how they unplugged an entire data center to test their disaster readiness. The disaster readiness team began building tools to make performing frequent failovers possible, and ran their first formalized failover in 2019. Eventually, with new tooling and procedures, the data center was unplugged. This provided a significantly reduced RTO.
-
Building Resiliency into the Twitter Ad Pacing Service
Twitter’s ad pacing algorithms were initially part of an ad-serving monolith. Later, Twitter’s engineering extracted them into a separate service to facilitate its development. Being an important service, it needs to be very reliable. An article was published recently describing how they built a reliable service by making economical design choices on managing different failure scenarios.
-
Netflix’s RENO Keeps Experience Consistent across Devices
Netflix has developed the Rapid Event Notification System (RENO) to create a consistent user experience across various platforms and devices. RENO reacts more quickly and consistently than the traditional request/response model to user-generated actions ranging from watching a title to changing profile information.
-
Failsafe 3.2 Released with New Resilience Policies
Failsafe, a lightweight fault tolerance library for Java 8+, launched the major 3.0 release in November 2021. More recently, Failsafe announced the availability of version 3.2 which introduced new Rate Limiter and Bulkhead policies. Failsafe also integrates with asynchronous code like Java’s CompletableFuture.
-
AWS US-EAST-1 Outage: Postmortem and Lessons Learned
On December 7th AWS experienced an hours-long outage that affected many services in its most popular region, Northern Virginia. The cloud provider released an analysis of the incident that started threads in the community about redundancy on AWS and multi-region approaches.
-
Amazon Introduces AWS Resilience Hub to Monitor and Improve RPO and RTO
Amazon recently announced the availability of AWS Resilience Hub, a service designed to help customers define, measure, and manage the resilience of their applications on the cloud.
-
Real-Time Exactly-Once Event Processing at Uber with Apache Flink, Kafka, and Pinot
Uber faced some challenges after introducing ads on UberEats. The events they generated had to be processed quickly, reliably and accurately. These requirements were fulfilled by a system based on Apache Flink, Kafka, and Pinot that can process streams of ad events in real-time with exactly-once semantics. An article describing its architecture was published recently in the Uber Engineering blog.
-
Microsoft Announces Azure Chaos Studio in Public Preview
At the recent Ignite, Microsoft announced the public preview of Azure Chaos Studio, a fully-managed experimentation service to help customers track, measure, and mitigate faults with controlled chaos engineering to improve the resilience of their cloud applications.
-
Litmus 2.0 Release Includes Multi-Tenancy, Chaos Workflows, GitOps, and Observability
Last month, Litmus 2.0 was released for general availability, with the goal of simplifying chaos engineering by adding new features like chaos center, chaos workflows, GitOps for chaos, multi-tenancy, observability, and private chaos hubs. InfoQ interviewed Umasankar Mukkara, CEO of ChaosNative and co-creator and maintainer of Litmus engineering platform.