Recently, Pinterest’s Mobile Builds team discussed how they utilized Honeycomb, a data observability platform, to enhance the efficiency and stability of its Continuous Integration (CI) processes. The team adopted Honeycomb in 2021 enabling them to monitor build metrics, analyze trends, and address performance bottlenecks.
Oliver Koo, staff software engineer at Pinterest, discussed the data-driven approach to observability in a blog post. Honeycomb provides Pinterest with advanced data visualization tools which provide enhanced CI monitoring capabilities. Using Honeycomb's features such as derived columns and fast query performance, the team can process millions of daily events and identify anomalies in build times or pipeline performance in near real-time.
This allows them to diagnose issues quickly and implement targeted improvements. For example, Koo discussed the use of Honeycomb's trace view to analyze specific builds and identify problematic jobs or processes that slow down the CI pipeline. In an analysis of their Continuous Integration (CI) pipeline, two scenarios stood out when examining build counts and p95 and p50 build times. One instance showed a spike in build count with no change in build times, allowing the team to focus on other tasks. However, another instance revealed a consistent build volume but a noticeable spike in p95 build time, needing further investigation.
Using Honeycomb's trace view, the team identified a job labeled "super secretive tests" as the bottleneck that was causing the spike in p95 build time. This led to the assumption that similar slowdowns might be occurring in other builds. To investigate further, the team used Honeycomb's web_url attribute to analyze more builds directly in Buildkite.
Source: How Pinterest Leverages Honeycomb to Enhance CI Observability and Improve CI Build Stability
Honeycomb's trace view, though similar to Buildkite's Waterfall View introduced in 2023, is preferred for its seamless integration and flexibility. It allows for breaking down builds into detailed segments such as agent wait times and script execution, enabling the logging and analysis of critical build and job processes.
The below image shows how each Buildkite job can be broken down into execution sequences and how Bazel build scripts can be instrumented to log specific execution times. This approach helps answer key questions like average repo cloning times and build stages' p50 and p95 times.
Source: How Pinterest Leverages Honeycomb to Enhance CI Observability and Improve CI Build Stability
The blog post was shared on LinkedIn, and it caught the attention of the tech community. Christine Yen, Honeycomb’s CEO, shared this post, mentioning,
Love seeing how our friends over at Pinterest Engineering rely on Honeycomb for ensuring that builds build fast, and engineers can troubleshoot when behavior isn't what they expect!
Koo also discussed another use case – identification of bottlenecks in CI jobs. As they analyzed build traces, Pinterest engineers discovered that certain jobs were causing spikes in build times. They used Honeycomb’s correlation features to overlay data from multiple dashboards and pinpoint root causes, such as increased wait times for CI agents. This holistic approach significantly reduced the time required for troubleshooting compared to manual analysis.
Additionally, Pinterest implemented error categorization using Honeycomb to streamline on-call workflows and improve failure management. Categorizing errors in real time enables the team to automate alerts and route them to the appropriate teams. This further reduces the noise and improves response efficiency. The system has helped prioritize critical issues like flaky tests or network failures while minimizing unnecessary interruptions for the engineering team.
Beyond CI workflows, Pinterest has used Honeycomb to analyze local build metrics for iOS developers to optimize hardware upgrades and track Android build performance data for further insights.