InfoQ Homepage Reliability Content on InfoQ
-
Adopting Continuous Deployment: Tom Wanielista at QCon San Francisco 2022
At QCon San Francisco 2022, Tom Wanielista, a staff engineer on infrastructure at Lyft, presented on Adopting Continuous Deployment at his company. The talk is part of one of the editorial tracks called "Architecting Change at Scale."
-
Filibuster: Automated Fault Injection Tool to Improve DoorDash's Reliability
DoorDash recently revealed how they are using Filibuster, an automated fault injection tool, to identify resilience issues in microservice applications early on and improve platform reliability.
-
Google Introduces Cloud Backup and Disaster Recovery
Google recently introduced Cloud Backup and Disaster Recovery (DR), allowing customers to enable centralized backup management directly from the Google Cloud console. The new backup and recovery service is designed to work with cloud storage repositories, databases, and applications.
-
Developing and Evolving SaaS Infrastructures for Enterprises
SaaS companies that are focused on the enterprise market need to evolve their infrastructure to meet the security, reliability, and other IT requirements of their customers. IT admins and large customers are two important sources of requirements to drive development.
-
Building Resiliency into the Twitter Ad Pacing Service
Twitter’s ad pacing algorithms were initially part of an ad-serving monolith. Later, Twitter’s engineering extracted them into a separate service to facilitate its development. Being an important service, it needs to be very reliable. An article was published recently describing how they built a reliable service by making economical design choices on managing different failure scenarios.
-
AWS Increases the Availability and Reliability of Amazon EventBridge with Global Endpoints
Recently, AWS introduced a new capability called global endpoints for its serverless event bus service Amazon EventBridge to improve availability and reliability.
-
Measuring the Environmental Impact of Software and Cloud Services
Software has an influence on the limitation of the service life or the increased energy consumption. It’s possible to measure the environmental impacts that are caused by cloud services. The design of the software architecture determines how much hardware and electrical power is required. Software can be economical or wasteful with hardware resources.
-
Real-Time Exactly-Once Event Processing at Uber with Apache Flink, Kafka, and Pinot
Uber faced some challenges after introducing ads on UberEats. The events they generated had to be processed quickly, reliably and accurately. These requirements were fulfilled by a system based on Apache Flink, Kafka, and Pinot that can process streams of ad events in real-time with exactly-once semantics. An article describing its architecture was published recently in the Uber Engineering blog.
-
How GitHub Partitioned Its Relational Database to Improve Reliability at Scale
GitHub has been working for the last couple of years on partitioning their relational database and moving the data to multiple independent clusters. This effort led to a 50% load reduction and a significant reduction of database-related incidents, explains GitHub engineer Thomas Maurer.
-
Reviewing the Eight Fallacies of Distributed Computing
In a recent article on Ably Blog, Alex Diaconu reviewed the eight fallacies of distributed computing and provided a number of hints at how to handle them. InfoQ has taken the chance to talk with Diaconu to learn more about how Ably engineers deal with the fallacies.
-
Artificial Intelligence for IT Operations: an Overview
Artificial intelligence for IT operations (AIOps) combines sophisticated methods from deep learning, data streaming processing, and domain knowledge to analyse infrastructure data from internal and external sources to automate operations and detect anomalies (unusual system behavior) before they impact the quality of service.
-
Auth0's Move to a Single-Cloud Architecture on AWS
Auth0, a provider of authentication, authorization and single sign on services, moved their infrastructure from multiple cloud providers (AWS, Azure and Google Cloud) to just AWS. An increasing dependency on AWS services necessitated this, and today their systems are spread across four AWS regions, with services replicated across zones.
-
How DevOps Principles Are Being Applied to Networking
Practices from the DevOps world are being adopted into managing networking services. Vendor hardware, configuration tools and deployment modes have eased programmable configuration and automation of network devices and functions.
-
Using Models in Developing Software for Self-Driving Cars
Models play an important role in developing software for autonomous systems like self-driving cars; they are used to simulate and verify behavior, document the system, and generate code. Jonathan Sprinkle explains how to model software used in autonomous systems, the benefits of modeling, using test data to validate the software that drives a car and techniques for writing reliable code.
-
GitHub’s DGit Improves Reliability, Performance, and Availability
GitHub has been quietly rolling out DGit, short for “distributed Git”, a new distributed storage system built on top of Git with the aim of improving reliability, availability, and performance of using GitHub.