Matt Bostock’s SREcon 2017 Europe talk covers how Prometheus, a metric-based monitoring tool, is used to monitor CDN, DNS and DDoS mitigation provider CloudFlare’s globally distributed infrastructure and network.
The Prometheus metrics-based open source monitoring project has been around since 2012. It is a Cloud Native Computing Foundation (CNCF) member. Prometheus’s dynamic configuration and query language PromQL lets users write complex queries in alerts. CloudFlare provides a content delivery network (CDN), distributed DNS and DDoS mitigation services. This means that its infrastructure is spread across the globe. Monitoring such an infrastructure and its network is complex and the talk describes the role Prometheus plays in this. At CloudFlare, Prometheus has replaced 87% of what Nagios used to do previously.
CloudFlare provides services similar to that of a CDN by using Anycast. Anycast DNS allows DNS queries to be served from the server nearest to the user, whereas Anycast HTTP allows serving content from the server nearest to the user. Acting as an intermediary between the original website and the user, CloudFlare also checks if the visitor’s traffic has threat patterns. It has 116 datacenters across 150 countries, and handles 5 million HTTP requests and 1.2 million DNS requests per second, which add up to 10% of global internet requests. Each point-of-presence (PoP) provides HTTP, DNS, attack mitigation and a key-value store. To monitor this, 188 Prometheus servers are in production at the time of the talk.
Image Courtesy - https://promcon.io/2017-munich/talks/monitoring-cloudflares-planet-scale-edge-network-with-prometheus/
Prometheus is metrics-based - which is to say that it collects time series metrics and the rest of its features are built on top of metrics. It works on the pull model where each monitored server runs a process called an exporter which exposes its collected metrics over HTTP. CloudFlare deploys one exporter per service domain and uses exporters that collect system (CPU, memory, TCP, disk), network (HTTP, ping), log matches (error messages) and container/namespace metrics. cadvisor, an open source project by Google, is used for the last one. Data retention in Prometheus is not forever, as it focuses more on the here-and-now monitoring aspect. Data is retained for 15 days in CloudFlare’s setup with no downsampling.
The list of services in a CloudFlare core datacenter includes log access, analytics and APIs built with a stack consisting of Marathon, Mesos, Chronos, Docker, Sentry, Ceph (for storage), Kafka, Spark, Elasticsearch and Kibana. In each PoP, Prometheus queries the servers/services for metrics via the exporters. High-availability per PoP is provided by using multiple Prometheus servers.
Prometheus’s alert manager is called, simply, Alertmanager. CloudFlare’s deployment has a single Alertmanager to which individual Prometheus servers push events. High availability of this setup is in the works. Alerts are tested on past data to ensure that they are performing correctly. This feature is part of newer monitoring tools like Bosun also. Other features of a good alert are a descriptive name, simplicity and information to enable the recipient to take immediate action.
The CloudFlare team uses jiralerts to integrate the JIRA ticketing system with Alertmanager. JIRA allows for customizable workflows, so for monitoring alerts, they can include monitoring workflow-specific custom states. Another tool called alertmanagere2s receives alerts and integrates them into an Elasticsearch index for later search and analysis. CloudFlare has built their own dashboard for the Alertmanager, called unsee.
How is Prometheus itself monitored? There are two approaches to it. One is a mesh-like approach where each Prometheus monitors the others in the same datacenter. The other is a top down approach where top-level Prometheus servers monitor datacenter-level Prometheus servers.
Some of the CloudFlare SRE team’s learnings are early standardization of labels and identification of groups - like environments and clusters. Other aspects are related to creating visibility and generating buy-in from peers and stakeholders. Early engagement of teams helps in faster integration of services with the monitoring system. The alerts themselves need multiple iterations to tune and improve, and it’s an ongoing process.