InfoQ Homepage Observability Content on InfoQ
-
How Observability Can Improve the UX of LLM Based Systems: Insights of Honeycomb's CEO at KubeCon EU
During her KubeCon Europe keynote, Christine Yen, CEO and co-founder of Honeycomb, provided insights on how observability can help cope with the rapid shifts introduced by the integration of LLMs in software systems, which transformed not only the way we develop software but also the release methodology. She explained how to adapt your development feedback loop based on production observations.
-
Grafana Loki Introduces v3.4 with Standardized Storage and Unified Telemetry
Grafana Loki recently introduced their version 3.4, which includes enhancements aimed at improving the efficiency and log management standardization. One of the key updates is the integration of the Thanos Object Storage Client, which aligns Loki's storage configuration with other Grafana databases, such as Mimir and Pyroscope.
-
Traefik v3.3 Release: Enhanced Observability and Documentation
TraefikLabs recently announced the latest release of Traefik Proxy v3.3 (codenamed "saint-nectaire” after a French cheese). This release focuses primarily on two critical areas: observability capabilities and improved documentation structure. These enhancements aim to make the popular open-source reverse proxy even more powerful for platform engineers working in complex cloud-native environments.
-
Most Companies Experience Weekly Outages: The State of Resilience 2025 Report
According to The State of Resilience 2025 Report, published by Cockroach Labs, outages are commonplace in most organizations, with 55% of companies reporting weekly and 14% reporting daily outages. Staggering 100% of survey participants experienced revenue losses due to outages, with some companies (8%) reporting losses of USD $1 million or higher over the last 12 months.
-
Prezi's Journey from Prometheus to VictoriaMetrics
Prezi’s engineering team recently discussed their transition from a Prometheus-based monitoring system to VictoriaMetrics, focusing on cost optimization, performance improvements, and architectural simplicity. This transition reduced the costs by approximately 30%, and speed of completion for heavy queries reduced to 3-7 seconds from 30+ seconds.
-
AWS Adds Container Insights with Enhanced Observability to Elastic Container Service
AWS recently announced the launch of Container Insights with Enhanced Observability for Amazon Elastic Container Service (ECS). This goes on the lines of a similar feature previously introduced for Amazon Elastic Kubernetes Service (EKS). This new capability aims to improve monitoring and troubleshooting for container workloads.
-
Prometheus 3.0 Brings New UI, OpenTelemetry Support and More
Version 3.0 of the popular open-source monitoring system Prometheus has been released, marking the tool's first major update in seven years. A variety of new features have been added, with improvements aimed at enhancing the user experience and streamlining workflows have been made.
-
Pinterest's Use of Honeycomb for Enhanced CI Observability and Build Stability
Recently, Pinterest’s Mobile Builds team discussed how they utilized Honeycomb, a data observability platform, to enhance the efficiency and stability of its Continuous Integration (CI) processes. The team adopted Honeycomb in 2021 enabling them to monitor build metrics, analyze trends, and address performance bottlenecks.
-
Stripe Rearchitects Its Observability Platform with Managed Prometheus and Grafana on AWS
Stripe replaced its observability platform, which used a third-party vendor solution, with a new architecture utilizing managed services on AWS. The company made the move due to scalability limits, reliability issues, and increasing costs while transitioning to microservices. The migration involved dual-writing metrics, translating assets, validation, and user training.
-
Netflix Rolls Out Service-Level Prioritized Load Shedding to Improve Resiliency
Netflix extended its prioritized load-shedding implementation to the individual service level to further improve system resilience. The approach uses cloud capacity more efficiently by shedding low-priority requests only when necessary instead of maintaining separate clusters for failure isolation.
-
Google Cloud Boosts Observability Capabilities with Log Scopes
Google Cloud recently introduced log scopes for Cloud Logging, aimed at improving how organizations manage and analyze their logs. This enhancement addresses the common challenge of finding relevant data from the vast amount of information in observability tools.
-
How Allegro Reduced the Cost of Running a GCP Dataflow Pipeline by 60%
Allegro achieved significant savings for one of the Dataflow Pipelines running on GCP Big Data. The company continues working on improving the cost-effectiveness of its data workflows by evaluating resource utilization, enhancing pipeline configurations, optimizing input and output datasets, and improving storage strategies.
-
Planning, Automation and Monorepo: How Monzo Does Code Migrations Across 2800 Microservices
Monzo products are supported by an extensive microservice-based platform of over 2800 services. The company relies on planning and heavy automation to drive code migrations at scale and leverages config service to support gradual roll forwards and quick rollbacks in case of issues. Migrations are managed by a central team rather than service owner teams to avoid delays and inconsistencies.
-
OpenTelemetry Adopts Continuous Profiling; Elastic Donates Their Agent
OpenTelemetry has announced that it has incorporated continuous profiling as a core telemetry signal, and Elastic has donated its continuous profiling agent to the OpenTelemetry project.
-
Ngrok Traffic Inspector Provides Observability for Network Traffic
The ngrok Traffic Inspector provides observability for traffic towards APIs or services to better understand what is happening and help identify any issues. Since it was previewed earlier this year, the Traffic Inspector has acquired new capabilities based on user feedback and is now officially available through the ngrok dashboard.