Datadog created a dedicated data ingestion architecture offering exactly-once semantics for their third-generation event store, Husky. The event-driven architecture (EDA) can accommodate bursts in traffic in the multi-tenant platform with reasonable ingestion latency and acceptable operational costs.
Datadog launched Husky in 2022, learning from their experiences running two different architectures previously, which the company has outgrown with more clients using the platform and new products with specific data storage and query requirements.
The architecture of Husky separated data ingestion, data compaction, and data reading workloads, which allows them to be scaled independently. All three workloads leverage a shared metadata store built on FoundationDB and a blob storage service that uses AWS S3. The data ingestion workload uses Apache Kafka to deliver events into the storage platform and route them internally to data writers.
Source: https://www.datadoghq.com/blog/engineering/introducing-husky/
Daniel Intskirveli, a senior software engineer at Datadog, explains the unique challenges for efficient data ingestion solution:
While it [Husky] can perform point lookups and run needle-in-the-haystack search queries, it’s not designed to perform point lookups at high volume and with low latency. This design posed a challenge on the ingestion side: how can we guarantee that data is ingested into Husky exactly once, ensuring that there are never duplicate events?
Exactly-once ingestion semantics is crucial for Datadog as duplicate events can cause false positives or false negatives in alerting monitor evaluations and would skew usage reporting that drives customer billing.
The solution is an internal routing mechanism that deterministically splits the incoming stream of events into multiple shards for each tenant. Events from tenant shards can then be ingested in the storage engine by downstream write workers (or writers) responsible for in-memory event deduplication. This approach makes deduplication easy due to the locality of tenant data within the shard which offers better performance. Since the tenant data is isolated at the storage level (stored in separate files), the routing mechanism limits the number of tenants included in a shard to lower storage costs.
Source: https://www.datadoghq.com/blog/engineering/husky-deep-dive/
Write workers (or writers) consume events from assigned shards and persist these events to make them queryable. Based on previous experience, the team opted for a stateless design of writers to enable auto-scaling and load-balancing. To support event deduplication, stateless writers must save previously processed event IDs into a persistent datastore. Event IDs are inserted into separate tables in FoundationDB and committed in a single transaction with event metadata, ensuring atomicity and consistency. Additionally, they are cached in memory by writers using an LRU (least recently used) cache.
The design supports conflict detection and resolution when a shard gets reassigned to another worker in case of a scale-up event, redeployment, or restart due to an infrastructure issue. Using an optimistic concurrency control, updates to event-ID tables are versioned, and any out-of-order updates are rejected. When the worker detects a conflict, it refreshes the event-ID cache from the FoundationDB table and resets the offset in the Kafka topic.