The Teads Engineering team evaluated BigGraphite - Graphite with a Cassandra backend - as an alternative to vanilla Graphite, but had to abandon it due to performance issues. They fell back on a custom architecture with the standard WhisperDB, but using go-graphite - a Go implementation of Graphite components. Using AWS's load balancers, custom tooling and spreading nodes across availability zones (AZs), they were able to achieve a highly available setup. InfoQ reached out to Vincent Miszczak, cloud infrastructure engineer at Teads.tv, to learn more about the migration and its challenges.
Teads' first physical Graphite server was a single point of failure and ran into hardware limits as the number of metrics increased. The team considered options like Graphite clusters, Prometheus and BigGraphite. Scaling Graphite using its standard Whisper storage is hard. The Teads team did not choose Prometheus to avoid breaking compatibility with existing customers, as the metrics ingestion formats are different. BigGraphite is an open source project by Criteo, which replaces Whisper with Cassandra as a backend for storing metrics. Teads already ran Cassandra clusters and was familiar with its operational aspects. However, they ran into performance issues in indexing. They "were not able to scale the indexing workload properly and the data workload would cost a lot of computing resources", and abandoned BigGraphite.
Moving back to Whisper, the team designed a multi-relay-node architecture, with an AWS Network Load Balancer for writes and Application Load Balancer for reads. Nodes used ephemeral storage, as EBS was not sufficient. The other major decision they made was to adopt the go-graphite stack, a Go implementation of Graphite, to get around the default Python-based Graphite's process management and performance issues. Booking.com had originally open sourced some parts of go-graphite.
Image used with explicit permission.
Miszczak notes that their current solution, as well as BigGraphite, are possibly less efficient than a couple of other solutions available today - ClickHouse and MetricTank:
They are three main issues with both solutions. They write points individually, meaning the storage performance required very high. For Whisper, one point means at least one IO, for BG/Cassandra, one row per point, even if it uses batching. The solution is to use batching compressed/encoded chunks. The second is they do not shard by time. If you query for metrics that have a given pattern, it won't restrict the search to the time range you are using. Sharding by time solves this. The third issue is that they do not compress/encode.
Miszczak says that their "Graphite stack is spread over three AZs and everything is active at all times. The writes are spread across two (AZs) and reads are dispatched to all servers." They implemented a replication algorithm to ensure each datapoint is replicated across AZs. However, node failures and maintenance downtimes can still lead to loss of data. The team wrote shell script wrappers around a tool called Carbonatet to mitigate this. Miszczak elaborates on how they handle node failures:
We tolerate a single node failure at anytime. When this happens, we manually handle it - we use Terraform, but we must start it manually. We replace the node with a new one at the same position in the servers list. When bootstrapped, the new node takes incoming writes. We start a script on it (using Carbonate) to gather data it should hold. As a reminder, a data destined for one node is replicated to the next. Because of this replication logic, the data the new node must hold is the combination of the previous neighbor data that belongs to that node and the next neighbor data that does not belong to that node (it's the replication data from the previous node = the node we are replacing).
Stateful backend systems - Cassandra, Kafka and Graphite - at Teads are monitored with Datadog. CloudWatch monitors cloud components like load balancers and RDS. Graphite metrics are used for all stateless systems (web services) and business monitoring. Miszczak adds that they have alerting on Graphite for some business metrics, but have decided not to promote it as a first class citizen as they already use two others services.