BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage News Twitter Open Sources Its Telemetry Tool Rezolus for Detection of Short-Lived Anomalies

Twitter Open Sources Its Telemetry Tool Rezolus for Detection of Short-Lived Anomalies

The Twitter engineering team open sourced their telemetry tool called Rezolus, which can detect anomalies in system performance metrics by sampling them at a higher rate than typical tools. Rezolus works by collecting samples at a high rate which is configurable. It generates histograms from this data and the percentile data is exported to the backend time series storage. The brief anomalies can be detected "in the tail portion of the percentiles."

Rezolus was borne out of a need to detect anomalies that occurred in periods that were shorter than the period at which telemetry tools were sampling data. For example, most tools are configured to sample data at 15 second or 1 minute intervals since any faster can lead to too many data points and hence higher storage costs. At Twitter some high throughput benchmarks produced anomalies that were not being detected by existing tools.

Rezolus can collect data from multiple sources using plugins called "samplers". Some samplers read data from Linux kernel sources. These include metrics for CPU, network and disk usage - obtained from traditional sources like procfs and sysfs. Apart from these traditional sources, Rezolus also supports perf_events, a tool embedded in the Linux kernel. perf_events can be used to gather data on both hardware and software events - page faults, context switches and cache hits/misses. Extended Berkeley Packet Filter (eBPF) is also supported for collecting metrics on kernel components. eBPF is a set of low level APIs that make it easy to build profiling tools for the Linux kernel. In addition, Rezolus can act as a bridge between an app and another metrics collector by converting high frequency metrics into histogram-based ones.

According to the documentation, Rezolus, "by collecting telemetry in-kernel, can gather data about events that happen at extremely high rates - e.g., task scheduling - with minimal performance overhead for collecting the telemetry." It typically uses 15% of a CPU core to sample around at a frequency of 10Hz (10/sec), says Brian Martin, one of the authors of Rezolus and a staff site reliability engineer at Twitter. Disabling eBPF reduces the resource consumption.

Rezolus is written in Rust and has been released under the Apache 2.0 license. It reuses a Twitter metrics library called rpc-perf. The Twitter engineering team has been using it for over a year.

Rate this Article

Adoption
Style

BT