InfoQ Homepage Distributed Systems Content on InfoQ
-
ClickHouse Keeper: Efficient Apache ZooKeeper Alternative Created with C++ and Raft
ClickHouse project team created an in-house replacement for Apache Zookeeper as it needed a more efficient implementation that would also address some of Zookeeper's shortcomings. Now, ClickHouse Keeper is an essential part of the ClickHouse project and a cornerstone of this open-source analytical database, but can also be used independently for many distributed coordination use cases.
-
Microsoft Releases DeepSpeed-FastGen for High-Throughput Text Generation
Microsoft has announced the alpha release of DeepSpeed-FastGen, a system designed to improve the deployment and serving of large language models (LLMs). DeepSpeed-FastGen is the synergistic composition of DeepSpeed-MII and DeepSpeed-Inference . DeepSpeed-FastGen is based on the Dynamic SplitFuse technique. The system currently supports several model architectures.
-
How DoorDash Rearchitected its Cache to Improve Scalability and Performance
DoorDash rearchitected the heterogeneous caching system they were using across all of their microservices and created a common, multi-layered cache providing a generic mechanism and solving a number of issues coming from the adoption of a fragmented cache.
-
Disaster Recovery Across a Million Pieces: Michelle Brush at QCon San Francisco
During the second day of QCon San Francisco 2023, Michelle Brush, an engineering director, SRE at Google, discussed challenges, patterns, and practices for disaster recovery actions in massively distributed systems in her session. The session is part of the "Designing for Resilience" track.
-
LinkedIn's Open-Source "iris-message-processor" Achieves 86.6x Faster Escalation Management Speeds
LinkedIn developed a new open-source service called "iris-message-processor" to enhance the performance and reliability of its existing Iris escalation management system. "iris-message-processor" significantly improves processing speeds, being ~4.6x faster under average loads and ~86.6x faster under high loads than its predecessor.
-
Pinterest Revamps Its Asynchronous Computing Platform with Kubernetes and Apache Helix
Pinterest created the next-generation asynchronous computing platform, Pacer, to replace the older solution, Pinlater, which the company outgrew, resulting in scalability and reliability challenges. The new architecture leverages Kubernetes for scheduling job-execution workers and Apache Helix for cluster management.
-
Cadence 1.0: Uber Releases Its Scalable Workflow Orchestration Platform
Uber released a major version of its workflow orchestration platform named Cadence after six years in development. Uber and other companies use Cadence to build stateful services at scale using native programming languages.
-
Apache Pulsar 3.0 Delivers a New LTS Version and Efficiency Improvements
The Apache Software Foundation has released version 3.0 of Apache Pulsar, the distributed messaging and streaming platform. Pulsar 3.0 introduces the Long-Term Support release and many performance and scalability improvements.
-
Preventing Serverless Vendor Lock-in with Design Patterns
Gregor Hohpe recently published an article proposing a paradigm shift to address vendor lock-in concerns on serverless cloud applications. Designing a solution using well-known patterns decouples its functional characteristics from the underlying cloud implementation, making it easier to avoid lock-in or to go multi-cloud.
-
A Distributed System is Knowable: an Impossible Thing for Developers
Failure in distributed systems is normal. Distributed systems can provide only two of the three guarantees in consistency, availability, and partition tolerance. According to Kevlin Henney, this limits how much you can know about how a distributed system will behave. He gave a keynote about Six Impossible Things at QCon London 2022 and at QCon Plus May 10-20, 2022.
-
Cloudflare D1 Provides Distributed SQLite for Cloudflare Workers
Soon to enter beta, D1 is Cloudflare's first step into the Cloud-based SQL storage arena. D1 is built on top of SQLite with the addition of a distributed replication mechanism, batch operation support, embedded compute, automatic backups and redundancy, and more.
-
Dealing with Thundering Herd at Braintree
Braintree engineer Anthony Ross explained in a recent article how introducing some random jitter into retry intervals for failed tasks solved a thundering herd issue which was impacting the efficiency of their payment dispute management API.
-
Managing Complex Dependencies with Distributed Architecture at eBay
The eBay engineering team recently outlined how they came up with a scalable release system. The release solution leverages distributed architecture to release more than 3,000 dependent libraries in about two hours. The team is using Jenkins to perform the release in combination with Groovy scripts.
-
Microservice Calls’ Critical Path Analysis with Jaeger and Uber’s CRISP
Discovering which services need to be optimised to reduce end-to-end latency in a microservices-based system can be challenging because call graphs may be too complicated to read. Uber described an open-source tool called CRISP built to solve this problem by finding the critical paths in these graphs. These paths identify those operations whose optimisation benefits the overall system.
-
Dapr Joins CNCF Incubator: Q&A with Yaron Schneider
The Cloud Native Computing Foundation (CNCF) recently announced that it accepted the Distributed Application Runtime (Dapr) as a CNCF incubating project. This statement follows an earlier announcement by Dapr, announcing the formation of the Dapr project's Steering and Technical Committee (STC).