InfoQ Homepage Fault Tolerance Content on InfoQ
-
Designing Fault-Tolerant Software with Control System Transparency
Jon Moore discusses four principles from the architectural paper "GN&C [Guidance, Navigation, and Control] Fault Protection Fundamentals" by Robert D. Rasmussen for building fault-tolerant software.
-
How to Test Your Fault Isolation Boundaries in the Cloud
Jason Barto discusses fault isolation boundaries and ways to take advantage of fault isolation in AWS, demonstrating initial tests used to ensure a system has successfully isolated faults.
-
Fault Tolerance at Speed
Todd Montgomery discusses the techniques and lessons learned from implementing Aeron Cluster. His focus is on how Raft can be implemented on Aeron.
-
Drinking from the Elixir Fountain of Resilience
Jearvon Dharrie talks about the factors that contribute to Elixir's perfect match for fault tolerance and resiliency, besides the Open Telecom Platform (OTP).
-
Orchestrating Chaos: Applying Database Research in the Wild
Peter Alvaro describes LDFI’s (Lineage-driven Fault Injection) theoretical roots in database research, presenting early results from the field and opportunities for near and long-term future research.
-
Fault-Tolerant Sensor Nodes with Erlang/OTP and Arduino
Kenji Rikitake discusses using Erlang/OTP for IoT, covering communication protocols, design principles and overcoming hardware limitations for endpoint devices in fault-tolerant systems.
-
Monkeys in Lab Coats: Applying Failure Testing Research @Netflix
The authors present how lineage-driven fault injection evolved from a theoretical model into an automated failure testing system that leverages Netflix’s fault injection and tracing infrastructures.
-
Scaling Distributed Systems
Natalia Chechina outlines features of actor and functional programming models, and the reason these models attract so much interest in parallel, concurrent, and scaling world.
-
Distributed Eventually Consistent Computations
Christopher Meiklejohn looks at applying two techniques together, deterministic data flow programming and conflict-free replicated data types, to create highly available and fault-tolerant systems.
-
Distributed Scheduling with Apache Mesos in the Cloud
Diptanu Choudhury discusses the design of Netflix’ distributed scheduler based on Mesos and Titan, focusing on bin packing algorithms, scaling in and out of clusters, fault tolerance, and redundancy.
-
Thinking in a Highly Concurrent, Mostly-functional Language
Francesco Cesarini illustrates how the Erlang way of thinking about problems leads to scalable and fault-tolerant designs, describing 3 ways of clustering Erlang nodes within the server side domain.
-
Tumblr - Bits to Gifs
John Bunting talks about different services Tumblr has built and how their architecture helps them be fault tolerant as they continue to grow.