InfoQ Homepage Articles
-
Stragglers, Not Failures: How Adaptive Hedged Requests Reduce p99 Latency by 74 Percent
n fan-out microservice architectures, slow-but-completing requests accumulate across services and drive p99 latency far higher than per-service metrics suggest. This article presents an adaptive hedging mechanism that uses DDSketch for real-time quantile estimation, windowed rotation to handle distribution drift, and a token-bucket budget to prevent load amplification.
-
Architecting Cloud-Native Kafka: From Tiered Storage Towards a Diskless Future
This article explores Kafka's transition toward a cloud-native architecture, examining how tiered storage, FinOps telemetry, elastic consumer scaling, virtual clusters, and Share Groups reshape the operational and economic model of event streaming platforms. It also analyzes emerging diskless-storage proposals and their architectural trade-offs.
-
The Schema Proliferation Problem in Kafka and Flink Pipelines: How to Solve It
Schema proliferation builds slowly and gets expensive fast. One schema per event type feels right until there are ten tables, union queries spanning all of them, and a single field rename touching every schema. Discriminator-based schema consolidation collapses that to two tables, turning multi-table unions into a single query, while new variants are additive and don't break existing consumers.
-
The Mathematics of Backlogs: Capacity Planning for Queue Recovery
Backlogs in distributed systems are arithmetic problems, not mysteries. This article provides practical formulas for calculating backlog drain time, sizing consumer headroom, and setting auto-scaling triggers. It covers key failure modes — retry amplification, metastable states, and cascading pipeline bottlenecks — plus when to shed load instead of draining.
-
Kernel-Level Ground Truth: Why eBPF is Replacing User-Space Agents for Security Observability
eBPF is emerging as a preferred method for security observability over traditional user-space agents. By attaching probes directly to the Linux kernel's syscall interface, it provides consistent visibility even during container-level compromises. eBPF reduces security-related CPU consumption and limits data volume by performing filtering at the kernel level, enhancing operational efficiency.
-
Building a Secure MCP Server on AWS for a Million-Company B2B Platform
We wanted to expose a B2B intelligence platform built on more than one million company profiles to an LLM client through an MCP server so a user can ask “find SaaS companies in Germany with 50-200 employees” and receive results through the LLM client. The engineering problem was: how do you make that workflow useful without creating an unsafe bridge between an LLM and production data?
-
Time-Series Storage: Design Choices That Shape Cost and Performance
Every time-series database makes a set of storage design decisions: how to lay out rows, when to compress, what to partition on. These decisions determine cost and query performance more than the choice of database itself. This article works through those fundamentals from first principles, using widely available tools like PostgreSQL and Apache Parquet to make each trade-off measurable.
-
Local-First AI Inference: a Cloud Architecture Pattern for Cost-Effective Document Processing
The Local-First AI Inference pattern routes 70–80% of documents to deterministic local extraction at zero API cost, reserving Azure OpenAI calls for edge cases and flagging low-confidence results for human review. Deployed on 4,700 engineering drawing PDFs, it cut API costs by 75% and processing time by 55%, while bounding errors through a human review tier.
-
Implementing the Sidecar Pattern in Microservices-Based ASP.NET Core Applications
Today's applications require monitoring, logging, configuration, etc. Each of these concerns can be implemented as a component or a service. These cross-cutting concerns can be tightly integrated into the application. While this tight coupling ensures effective use of shared resources, an outage in any of these components can take your application down. Enter the sidecar design pattern.
-
Beyond the Benchmark: a Metrics-Driven Approach to Sustained iOS Performance on Real Devices
iOS performance engineering often defaults to a mental model where performance is a property of a component. Performance is instead an emergent behavior of the interaction between application code, device hardware, OS resource management, network conditions, and user behavior patterns over time. This article gives a direct, first-party path to capturing performance issues using Xcode Instruments.
-
Three Pillars of Platform Engineering: a Virtuous Cycle
Platform engineering succeeds when reliability and ergonomics reinforce each other rather than compete. This article explores three foundational pillars: automated reliability, developer ergonomics, and operator ergonomics. Together, they establish a virtuous cycle that strengthens system stability, reduces operational burden, and empowers teams to scale infrastructure with confidence.
-
From Batch to Micro-Batch Streaming: Lessons Learned the Hard Way in a Delta Index Pipeline
This article describes how a production delta-index pipeline migrated from scheduled batch to micro-batch Spark Structured Streaming. It covers why record-level streaming was rejected, how partition-based watermarks replaced fragile S3 completion markers, overlap-window correctness, and restart-as-design strategies for better predictability in object-store–based ingestion systems.