BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage News Prezi's Journey from Prometheus to VictoriaMetrics

Prezi's Journey from Prometheus to VictoriaMetrics

This item in japanese

Listen to this article -  0:00

Prezi’s engineering team recently discussed their transition from a Prometheus-based monitoring system to VictoriaMetrics, focusing on cost optimization, performance improvements, and architectural simplicity. This transition reduced the costs by approximately 30%, and the speed of completion for heavy queries was reduced to 3-7 seconds from 30+ seconds.

Grzegorz Skołyszewski, senior site reliability engineer at Prezi, summarised this journey in a blog post. By 2024, Prezi’s Prometheus setup had become outdated and costly, running on a deprecated internal platform that required significant resources to maintain. The team sought to modernize its metrics collection and storage system by reducing complexity, transitioning to Kubernetes, and lowering operational costs.

However, the existing Prometheus system posed challenges, including high resource requirements due to its scale (5 million active series), complexity in managing multiple instances for dashboarding and alerting, and reliance on legacy infrastructure.

To address these issues, the team explored alternatives, evaluating both managed and self-hosted solutions. Managed options were ruled out for being expensive, while self-hosted solutions like Thanos, Cortex/Mimir, and VictoriaMetrics were considered. The Prezi engineering team chose VictoriaMetrics due to its simplicity, cost-efficiency, and performance advantages.

Unlike other tools relying on object storage like AWS S3, VictoriaMetrics uses block storage, which is cheaper and more performant, eliminating the need for an external caching subsystem. A proof of concept was done to confirm its benefits: queries that previously timed out in Prometheus were completed in 3–7 seconds in VictoriaMetrics, while storage usage dropped by 70%, memory by 60%, and CPU time by 30%.

Initially, the team deployed a clustered version of VictoriaMetrics across multiple AWS Availability Zones (AZs) to ensure high availability. However, this setup significantly increased costs due to inter-zone network traffic.

Each metric write or query involved extra hops between components like VMInsert and VMStorage, amplifying data transfer fees. To resolve this issue, they replaced the clustered setup with two separate single-instance deployments of VictoriaMetrics Single in different AZs.

Source: How using Availability Zones can eat up your budget — our journey from Prometheus to VictoriaMetrics

A load balancer was introduced for failover redundancy, and agents were configured to buffer data during instance downtimes to prevent data loss. This architecture minimized inter-AZ traffic while maintaining reliability.

The team also made additional enhancements to improve the system further. For long-term storage of metrics without incurring extra costs from enterprise licenses or external services like Grafana Cloud, they deployed another VictoriaMetrics Single instance with custom retention settings.

To simplify configuration management, they adopted the VictoriaMetrics Kubernetes Operator, enabling product teams to manage alerting configurations directly from their repositories. For non-Kubernetes workloads, they deployed additional agents with static configurations. They also consolidated Grafana instances using Grafana Private Data Connect to integrate self-hosted metrics with Grafana Cloud seamlessly.

We saw the tech community on Hacker News and Reddit engaging in interesting discussions. A Hacker News discussion debated the trade-offs in cloud computing between high inter-AZ data transfer costs, seen as distorting best practices, and the justification that such pricing reflects the expense of scaling inter-datacenter bandwidth.

While on Reddit, this conversation debated trade-offs in monitoring system design. One perspective warned that remote write systems introduce delays and dependencies on central rule evaluation, risking failure during backlogs. The other argued that stateless agents with delayed rule evaluation offer scalability, consistency, and easier maintenance compared to stateful systems like Prometheus, leaving the trade-offs to user preference.

The migration resulted in significant benefits for Prezi’s engineering operations. Aside from the impact on cost and query time, metrics became more accessible through Kubernetes-native tools. The new system is also better equipped to handle future growth with improved scalability and reliability.

About the Author

Rate this Article

Adoption
Style

BT