ClickHouse project team created an in-house replacement for Apache Zookeeper as it needed a more efficient implementation that would also address some of Zookeeper's shortcomings. Now, ClickHouse Keeper is an essential part of the ClickHouse project and a cornerstone of this open-source analytical database, but can also be used independently for many distributed coordination use cases.
ClickHouse, an open-source real-time analytical database system using column-orientated storage, required a coordination mechanism for its control plane to manage metadata and synchronize distributed operations. The initial architecture of ClickHouse leveraged Apache ZooKeeper for that purpose as quite an established solution, offering a powerful API and reasonable performance. However, after using ZooKeeper for a while, the team observed that the subsystem was not very efficient and was also missing a key feature of linearizable reads, just to name a few deficiencies.
Tom Schreiber, senior product marketing engineer, and Derek Chia, technical support engineer, share the reasons for writing a Zookeeper-compatible system in C++:
ZooKeeper, being a Java ecosystem project, did not fit into our primarily C++ codebase very elegantly, and as we used it at a higher and higher scale, we started running into resource usage and operational challenges. In order to overcome these shortcomings of ZooKeeper, we built ClickHouse Keeper from scratch, taking into account additional requirements and goals our project needed to address.
Initially, ClickHouse Keeper was implemented as an embedded service inside ClickHouse in February 2021, but later that year, a standalone mode was introduced. Since then, Keeper was marked as production-ready in April 2022 and has been used at scale in ClickHouse Cloud, the SaaS product running managed ClickHouse clusters.
The Architecture of ClickHouse with Keeper Used for Coordination (Source: ClickHouse Blog)
Keeper is used extensively in ClickHouse whenever strong consistency is required between database cluster nodes, most notably supporting storage metadata, replication, backups, access control, task scheduling, and a highly consistent key-value store used for data ingestion from Apache Kafka and AWS S3.
Keeper relies on Raft consensus algorithm, unlike ZooKeeper, which uses ZAB. Despite ZAB being a more established option (in development since at least 2008), the team chose Raft because of its relative simplicity and the ease of integrating it using the C++ library.
The team is planning to invest effort in bringing in multi-group Raft, where multiple Raft instances are used instead of a single one to help with scalability. This approach requires data partitioning and addressing some challenges to avoid wasting compute and network resources and ensure the correctness of transaction processing in all cases.
Parallel Execution of Transactions Using Multi-Group Raft (Source: ClickHouse Blog)
The team behind Keeper created the performance benchmark suite that they use to compare Keeper's performance against Apache ZooKeeper. Based on the results between recent versions of both solutions on identical infrastructure, they observed that Keeper uses approximately 46 times less main memory than ZooKeeper to run the same number of processed requests.