Booking.com's engineering team scaled their Graphite deployment from a small cluster to one that handles millions of metrics per second. Along the way, they modified and optimized Graphite's core components - carbon-relay, carbon-cache, and the rendering API.
Booking.com tracks both technical and business metrics in Graphite. They started using graphite in 2012, on a RAID 1+0 setup, which was also used for their databases, but it did not scale well for Graphite. They sharded the requests to distribute the traffic across storage nodes. However, this was hard to maintain and they switched to SSDs in a RAID 1 configuration.
The default carbon-relay, written in Python, ran into CPU bottlenecks and became a single point of failure. The team rewrote it in C, and also changed the deployment model so that each monitored host had a relay. This would send the metrics to endpoints in multiple datacenters backed by bigger relays and buffer data locally when there was a failure. To get around uneven balancing of metrics across servers, they implemented a consistent hashing algorithm. However, they continued to face issues with adding new storage nodes, used shell scripts to sync data between datacenters, and had to keep replacing SSDs (each lasted 12-18 months) due to the frequent writes (updates) made to disk. At some point in time, the team considered HBase, Riak and Cassandra for storage backends, but it’s unclear if they pursued those efforts. Other engineering teams have successfully utilized Cassandra as a scalable Graphite backend.
One of the optimizations the team did early on was carbonzipper, which could proxy requests from the web front end to a cluster of storage backends, according to Vladimir Smirnov, former System Administrator at Booking.com. Ultimately, the team had to replace the standard carbon cache and their rewritten relay with Golang based implementations. The go-carbon project is the current implementation of the graphite/carbon stack in Go.
Booking.com has a distributed "Event Graphite Processor" that generates graphite metrics from event streams, processing more than 500k events per second. Event streams are generated across the stack and stored in Riak. "Booking.com heavily uses graphs at every layer of its technical stack. A big portion of graphs are generated from events", says Damien Krotkine, senior software engineer and Ivan Paponov, senior software developer, at Booking.com. Apart from events, various collectors and scripts collect metrics from Booking.com's systems. Initially starting with collectd, they switched to Diamond, a Python-based system metrics collector. Did they standardize on metric naming conventions? To an extent - they started by reserving sys.* (for system metrics) and user.* (for testing etc), and left everything else to the developers to determine the metric names they wanted to use.
Apart from capacity planning and troubleshooting, Booking.com uses Graphite to "correlate business trends with system, network and application level trends". They use Grafana as the front-end for visualization.