InfoQ Homepage Observability Content on InfoQ
-
Most Companies Experience Weekly Outages: The State of Resilience 2025 Report
According to The State of Resilience 2025 Report, published by Cockroach Labs, outages are commonplace in most organizations, with 55% of companies reporting weekly and 14% reporting daily outages. Staggering 100% of survey participants experienced revenue losses due to outages, with some companies (8%) reporting losses of USD $1 million or higher over the last 12 months.
-
AWS Adds Container Insights with Enhanced Observability to Elastic Container Service
AWS recently announced the launch of Container Insights with Enhanced Observability for Amazon Elastic Container Service (ECS). This goes on the lines of a similar feature previously introduced for Amazon Elastic Kubernetes Service (EKS). This new capability aims to improve monitoring and troubleshooting for container workloads.
-
Prometheus 3.0 Brings New UI, OpenTelemetry Support and More
Version 3.0 of the popular open-source monitoring system Prometheus has been released, marking the tool's first major update in seven years. A variety of new features have been added, with improvements aimed at enhancing the user experience and streamlining workflows have been made.
-
Pinterest's Use of Honeycomb for Enhanced CI Observability and Build Stability
Recently, Pinterest’s Mobile Builds team discussed how they utilized Honeycomb, a data observability platform, to enhance the efficiency and stability of its Continuous Integration (CI) processes. The team adopted Honeycomb in 2021 enabling them to monitor build metrics, analyze trends, and address performance bottlenecks.
-
Stripe Rearchitects Its Observability Platform with Managed Prometheus and Grafana on AWS
Stripe replaced its observability platform, which used a third-party vendor solution, with a new architecture utilizing managed services on AWS. The company made the move due to scalability limits, reliability issues, and increasing costs while transitioning to microservices. The migration involved dual-writing metrics, translating assets, validation, and user training.
-
Netflix Rolls Out Service-Level Prioritized Load Shedding to Improve Resiliency
Netflix extended its prioritized load-shedding implementation to the individual service level to further improve system resilience. The approach uses cloud capacity more efficiently by shedding low-priority requests only when necessary instead of maintaining separate clusters for failure isolation.
-
Google Cloud Boosts Observability Capabilities with Log Scopes
Google Cloud recently introduced log scopes for Cloud Logging, aimed at improving how organizations manage and analyze their logs. This enhancement addresses the common challenge of finding relevant data from the vast amount of information in observability tools.
-
How Allegro Reduced the Cost of Running a GCP Dataflow Pipeline by 60%
Allegro achieved significant savings for one of the Dataflow Pipelines running on GCP Big Data. The company continues working on improving the cost-effectiveness of its data workflows by evaluating resource utilization, enhancing pipeline configurations, optimizing input and output datasets, and improving storage strategies.
-
Planning, Automation and Monorepo: How Monzo Does Code Migrations Across 2800 Microservices
Monzo products are supported by an extensive microservice-based platform of over 2800 services. The company relies on planning and heavy automation to drive code migrations at scale and leverages config service to support gradual roll forwards and quick rollbacks in case of issues. Migrations are managed by a central team rather than service owner teams to avoid delays and inconsistencies.
-
OpenTelemetry Adopts Continuous Profiling; Elastic Donates Their Agent
OpenTelemetry has announced that it has incorporated continuous profiling as a core telemetry signal, and Elastic has donated its continuous profiling agent to the OpenTelemetry project.
-
Ngrok Traffic Inspector Provides Observability for Network Traffic
The ngrok Traffic Inspector provides observability for traffic towards APIs or services to better understand what is happening and help identify any issues. Since it was previewed earlier this year, the Traffic Inspector has acquired new capabilities based on user feedback and is now officially available through the ngrok dashboard.
-
Microsoft Enhances Azure Monitor with Query Editor and Support for PromQL
Microsoft has recently released the public preview of the Query Editor in Azure Monitor Metrics, enabling users to create and execute PromQL queries directly within their Azure Monitor workspace. This eliminates the need to switch between tools, streamlining the workflow and boosting productivity when working with various types of metric data.
-
Combatting Alert Fatigue at Cloudflare
In a detailed blog post, Monika Singh at Cloudflare explores the stressful environment on-call personnel face. On-call staff frequently deal with numerous alerts, leading to alert fatigue—a state of exhaustion caused by responding to non-prioritised or unclear alerts. To combat this, Cloudflare teams conduct periodic alert analyses to enhance the accuracy and actionability of alerts.
-
Falco 0.38.0 Released with Enhanced Driver Selection, Configurations and Real-Time Monitoring
The maintainers of Falco announced its latest version: 0.38.0. This is the first release since its graduation within CNCF.
-
Google Cloud Introduces Customizable Dashboards
Google Cloud has recently expanded its customizable observability dashboards to over 10 services, including Google Kubernetes Engine (GKE), Compute Engine, Cloud Run, Cloud Functions, Cloud Storage, Dataproc, Dataflow, MySQL System Insights, and a few others.