The Miro Data Engineering team recently discussed how they systematised alerts and incident management. Along with standardising the observability metrics and alert(s) definitions, the team started using OpsGenie for incident management. This helped the team address challenges with scaling such as standard format for metric labelling, alert definitions, on-call duties, etc.
Gonçalo Costa and Ricardo Souza, data engineers at Miro, posted a Medium blog describing the journey of the team as Miro entered "hypergrowth" mode. Costa and Souza elaborated on the architecture and components at Miro: the infrastructure and data processing ran on the AWS Stack; Prometheus pushed the metrics from different components to different recipients via Alertmanager; and Grafana was used to show the visual dashboard of metrics from Prometheus and the alerts corresponding to anomalies were routed to a channel in Slack.
Source: Miro Data Engineering team’s journey to monitoring
There were three metrics that the team observed initially through Prometheus: System, Middleware and Application. The metrics had labels, providing their information such as where they come from, who owns them, etc. Using relabeling, the team transformed the received labels into understandable information in alert definitions and dashboards.
Source: Miro Data Engineering team’s journey to monitoring
The Alertmanager, routed the alerts to Slack. However, as the team size grew, so did the challenges like defining on-call schedules, missed incidents, etc. To address these challenges, the team came up with priority levels for different components and alert significance for these components.
In the next step, the team created a clear and actionable message specifying the function of the affected component and the related resources. With the help of Site Reliability Engineers (SREs), the data engineering team defined a standard set of labels for the existing Prometheus metrics. Some examples are:
- miro_environment: environment associated with the component
- miro_service: the service highlighted by the alert
- miro_owner: the team owning the metric
- miro_function: function of the component related to the metric
- [component]_id: component idetnfier
To handle the amount of information related to a metric, the team defined a set of steps to write a standard alert definition. This helped understand the issue with components in detail, as it included definitions for metric name, its description, threshold values for triggering alerts, importance levels, filling the runbook, etc.
To address failures of the important components, the data engineering team chose OpsGenie as their incident management platform. Using OpsGenie, the team could configure a multiple-rotation supported on-call schedule and overrides. In related news, we are seeing that a transition from Slack channels to tools like OpsGenie or PagerDuty helps mature the SRE practice in an agile manner.
With standard metric labeling and definitions, the identification of alerts became easy, making the team more responsive in tackling incidents. As opposed to earlier ways of working, the team started maintaining a runbook that sped up the troubleshooting.
As a side, we are seeing Data Observability securing its place on Gartner’s 2022 Hype Cycle. Gartner has included OpenTelemetry as it defines a standard for how logs, traces, and metric data should be extracted from the servers, infrastructure, and applications.
Changing the approach with observability has helped the Miro data engineering team to recognize failures of vital components and plan for fast mitigation. The team acknowledges that there’s still a lot to improve as the data volume, variety and velocity at Miro continues to grow.