SoundCloud's engineering team wrote about their exception monitoring software called Periskop, which collects and aggregates exceptions across servers and reports to a central server for analysis.
Periskop started as an internal project at SoundCloud. It collects and aggregates exceptions in the client libraries, uses a pull-based model to scrape clients, and then aggregates the same across instances on the server side. The SoundCloud team previously used Airbrake but moved to a custom-built solution to get around cost issues caused by spikes in the exception rate. One of the motivations was to easily discover and scrape services for exceptions, and the current version of Periskop supports DNS-based service discovery.
InfoQ reached out to Jorge Creixell, tech lead at SoundCloud, to learn more about it.
Periskop's architecture consists of client libraries and a server component. The client library has an "exception collector". Exceptions can be added to the collector in two ways - at the place where they are handled (like a try-catch block) or at a top level exception handler for all unhandled exceptions. The collector generates a unique key with the exception message and a hash of the stack trace. These are aggregated in memory and kept in a queue, with the most recent exceptions. The exception aggregates are exported on an HTTP endpoint, similar to how Prometheus exporters expose metrics, and uses a published schema. Periskop clients maintain the total number of exceptions per instance, and the counter gets reset when there is a new deployment (or the process is restarted).
Image courtesy : https://developers.soundcloud.com/blog/periskop-exception-monitoring-service, used with explicit permission.
On the server side, service discovery modules discover all instances of a service. The server scrapes the exported exception aggregate data and further aggregates it across instances. Periskop can auto-discover services and uses Prometheus’ library for service discovery. Creixell says that:
The initial version uses only the DNS module, as that's what we needed internally at SoundCloud at the time. We are currently working on implementing the remaining service discovery mechanisms supported by Prometheus. We expect the configuration formats to mirror as close as possible the ones found in Prometheus.
Periskop's discovery configuration mimics that of Prometheus. A typical DNS-based configuration looks like:
services:
- name: myservice
dns_sd_configs:
names:
- telemetry.api.prod.myservice
refresh_interval: 45s
type: SRV
scraper:
endpoint: /-/exceptions
refresh_interval: 30s
Periskop does not support Kubernetes-native service discovery capabilities yet, but "this will be possible as soon as we finish integrating the remaining service discovery mechanisms from Prometheus. Kubernetes is our top priority as we will be using it internally at SoundCloud", says Creixell. Environments and components are disambiguated via unique service names in the configuration, so tagging the exceptions based on services is not a feature yet.
A Periskop server scrapes the configured (or discovered) instances at periodic intervals. If there are a large number of them, it might end up "skipping scrape cycles". There might be some data loss in this case, but SoundCloud has not faced any issues with this approach. Creixell elaborates:
Periskop only maintains a snapshot of service errors, not a time series. For this reason, we opted for a much simpler design (compared to Prometheus), using a timer to schedule a new scrape every time the previous one is completed, using the configured refresh interval for each service.
Periskop has a UI for browsing through the exceptions. It was initially designed to support troubleshooting and incident response, and to provide an accurate picture of what is happening inside services at SoundCloud.
The Periskop source code is available on GitHub.