One of the outcomes of Bloomberg's adoption of Site Reliability Engineering (SRE) practices across its development teams was the creation of a new monitoring system, backed by the Metrictank time-series database. The system provided new functionality to derive metric calculations, configurable retention periods, metadata queries, and scalability.
Bloomberg’s infrastructure is spread across 200 node sites in two self-operated datacenters, catering to around 325,000 customers, with a development team of 5000 engineers. For a long time, developers were responsible for the production monitoring of the products they built and deployed. However, monitoring was often added as an afterthought. This resulted in a lack of standardization - there were multiple data collectors, leading to duplication for measuring the same thing.
There was also no global view of the systems. According to Stig Sorensen, head of telemetry at Bloomberg, the role of operations ranges across "everything from our commercial website to market data feeds, to our main product, the Bloomberg Professional Terminal which hundreds of thousands of the key influencers around the world rely on." Various different tech stacks compounded the complexity.
Sorensen started leading the SRE initiative at Bloomberg in 2016. Along with pushing SRE principles and practices, his team aimed to build monitoring and alerting as a company-wide service. The first iteration was a homegrown StatsD agent with support for tags, that focused on getting the metrics out to the central systems as fast as possible. Once the metrics were collected, most of the validation, aggregation, rules and persistence was done on machines that were behind a Kafka cluster. This system soon faced issues with scale, as Sean Hanson, software developer at Bloomberg, noted in his talk:
After these two years, we’re at two and a half million data points per second, 100 million time series. Some metrics have high cardinality, like 500,000. So our initial solution did scale fairly well for us. We were able to push that to 20 million data points a second sustained. But we couldn’t actually query anything out of it while it was doing that, and it still was really poor at handling high cardinality metrics, which was a pretty common use case.
The new system that the team built also had a new set of requirements - functions to derive metric calculations, configurable retention periods, metadata queries, and scalability. Metrictank, a multi-tenant timeseries database backed by Cassandra that can be used by Graphite, met most of their requirements. Based on Facebook's Gorilla paper, it was orders of magnitude faster than their previous system for high-cardinality data. It paved the way to do queries that spanned metrics from across the organization.
The Bloomberg team optimized a few resource-intensive areas and contributed the code back to Metrictank. Other organizations have also used Cassandra as a backend to scale Graphite.
Along with the monitoring system, the adoption of SRE has been focused on standardizing the way that things are done. Sorensen elaborates:
We don’t actually have a centralized SRE team today. We rolled it out in a way where we aligned the SRE teams with the application teams. SRE teams are pulled from both app and core infra teams. It’s either people within operational or system admin background that’s sort of picked up programming and moving that way, or we have application engineers with a more active view towards systems and towards availability – because we see SREs as software engineers just doing something – building a different type of software.
With the adoption of a standardized monitoring system, there is a parallel need to track progress. This is something that the team is working on, Sorensen says, because "measuring availability is not black and white. It’s not how many failures you had on a website, because if you are a certain market player and the real-time market data is delayed by a few – by one millisecond or hundreds of milliseconds, it could make a big difference for you."