InfoQ Homepage Resilience Content on InfoQ
-
Disaster Recovery Across a Million Pieces: Michelle Brush at QCon San Francisco
During the second day of QCon San Francisco 2023, Michelle Brush, an engineering director, SRE at Google, discussed challenges, patterns, and practices for disaster recovery actions in massively distributed systems in her session. The session is part of the "Designing for Resilience" track.
-
6 Tracks Not to Miss at QCon San Francisco, October 2-6, 2023: ML, Architecture, Resilience & More!
At InfoQ’s international software development conference, QCon San Francisco (October 2-6) 2023, senior software practitioners driving innovation and change in software development will explore real-world architectures, technology, and techniques to help you solve such challenges.
-
Microsoft Azure Cross-Region (Global) Load Balancer Now Generally Available
Microsoft recently announced the general availability (GA) of Azure cross-region (Global) Load Balancer in all Azure public and national cloud regions.
-
How LinkedIn Serves over 4.8 Million Member Profiles per Second
LinkedIn introduced Couchbase as a centralized caching tier for scaling member profile reads to handle increasing traffic that has outgrown their existing database cluster. The new solution achieved over 99% hit rate, helped reduce tail latencies by more than 60% and costs by 10% annually.
-
How Resilience Can Help to Get Better at Resolving Incidents
Applying resilience throughout the incident lifecycle by taking a holistic look at the sociotechnical system can help to turn incidents into learning opportunities. Resilience can help folks get better at resolving incidents and improve collaboration. It can also give organizations time to realize their plans.
-
Using Code Instrumentation for Fault Injection at the Application Level at eBay
eBay engineers have been using fault injections techniques to improve the reliability of the notification platform and explore its weaknesses. While fault injection is a common industry practice, eBay attempted a novel approach leveraging instrumentation to bring fault injection within the application level.
-
Resilience4j 2.0.0 Delivers Support for JDK 17
Resilience4j, a lightweight fault tolerance library designed for functional programming, has released version 2.0 featuring support for Java 17 and dependency upgrades to Kotlin, Spring Boot and Micronaut. This new version also removes the dependency on Vavr, a functional library for Java, in order to become a more lightweight library.
-
Java News Roundup: Major Spring Releases, Resilience4j, Open Liberty, GlassFish, Kotlin 1.8-Beta
This week's Java roundup for November 21st, 2022, features news from JDK 20, major, point and patch releases for Spring (namely Boot, Web Services, Security, Batch, Authorization Server, REST Docs, Framework, Modulith, GraphQL, Apache Kafka and RabbitMQ), Open Liberty 22.0.0.12, GlassFish 7.0-M10, GraalVM Native Build Tools 0.9.18, Resilience4j 2.0, Apache Tomcat 8.5.84 and Kotlin 1.8-Beta.
-
Filibuster: Automated Fault Injection Tool to Improve DoorDash's Reliability
DoorDash recently revealed how they are using Filibuster, an automated fault injection tool, to identify resilience issues in microservice applications early on and improve platform reliability.
-
Dropbox Unplugs Data Center to Test Resilience
Dropbox has published a detailed account of why and how they unplugged an entire data center to test their disaster readiness. The disaster readiness team began building tools to make performing frequent failovers possible, and ran their first formalized failover in 2019. Eventually, with new tooling and procedures, the data center was unplugged. This provided a significantly reduced RTO.
-
Building Resiliency into the Twitter Ad Pacing Service
Twitter’s ad pacing algorithms were initially part of an ad-serving monolith. Later, Twitter’s engineering extracted them into a separate service to facilitate its development. Being an important service, it needs to be very reliable. An article was published recently describing how they built a reliable service by making economical design choices on managing different failure scenarios.
-
Netflix’s RENO Keeps Experience Consistent across Devices
Netflix has developed the Rapid Event Notification System (RENO) to create a consistent user experience across various platforms and devices. RENO reacts more quickly and consistently than the traditional request/response model to user-generated actions ranging from watching a title to changing profile information.
-
Failsafe 3.2 Released with New Resilience Policies
Failsafe, a lightweight fault tolerance library for Java 8+, launched the major 3.0 release in November 2021. More recently, Failsafe announced the availability of version 3.2 which introduced new Rate Limiter and Bulkhead policies. Failsafe also integrates with asynchronous code like Java’s CompletableFuture.
-
AWS US-EAST-1 Outage: Postmortem and Lessons Learned
On December 7th AWS experienced an hours-long outage that affected many services in its most popular region, Northern Virginia. The cloud provider released an analysis of the incident that started threads in the community about redundancy on AWS and multi-region approaches.
-
Amazon Introduces AWS Resilience Hub to Monitor and Improve RPO and RTO
Amazon recently announced the availability of AWS Resilience Hub, a service designed to help customers define, measure, and manage the resilience of their applications on the cloud.