InfoQ Homepage Data Lake Content on InfoQ

News

RSS Feed

Newer Older

Architecture & Design

350PB, Millions of Events, One System: inside Uber’s Cross-Region Data Lake and Disaster Recovery

Uber’s HiveSync is a sharded, cross-region batch replication system keeping Hive/HDFS data consistent across multiple regions. Handling 5M daily Hive events and 8PB of data replication, it uses event-driven jobs, hybrid RPC and DistCp strategies, DAG-based orchestration, and dynamic sharding, enabling disaster recovery, horizontal scaling, and 99.99% cross-region data accuracy.

Leela Kumili
on Jan 16, 2026
Cloud

Cloudflare Introduces Data Platform with Zero Egress Fees

Cloudflare has recently announced the open beta of Cloudflare Data Platform, a managed solution for ingesting, storing, and querying analytical data tables using open standards such as Apache Iceberg.

Renato Losio
on Nov 01, 2025
AI, ML & Data Engineering

How Netflix is Reimagining Data Engineering for Video, Audio, and Text

Netflix has introduced a new engineering specialization—Media ML Data Engineering, alongside a Media Data Lake designed to handle video, audio, text, and image assets at scale. Early results include richer ML models trained on standardized media, faster evaluation cycles, and deeper insights into creative workflows.

Matt Foster
on Aug 25, 2025
AI, ML & Data Engineering

Apache Hudi 1.0 Now Generally Available

The Apache Software Foundation has recently announced the general availability of Apache Hudi 1.0, the transactional data lake platform with support for near real-time analytics. Initially introduced in 2017, Apache Hudi provides an open table format optimized for efficient writes in incremental data pipelines and fast query performance.

Renato Losio
on Jan 18, 2025
Cloud

AWS Introduces S3 Tables Bucket: Is S3 Becoming a Data Lakehouse?

AWS has recently announced S3 Tables Bucket, managed Apache Iceberg tables optimized for analytics workloads. According to the cloud provider, the new option delivers up to 3x faster query performance and up to 10x higher transaction rates for Apache Iceberg tables compared to standard S3 storage.

Renato Losio
on Jan 04, 2025
Culture & Methods

How Data Mesh Platforms Connect Data Producers and Consumers

A challenge that companies often face when exploiting their data in data warehouses or data lakes is that ownership of analytical data is weak or non-existent, and quality can suffer as a result. A data mesh is an organizational paradigm shift in how companies create value from data where responsibilities go back into the hands of producers and consumers.

Ben Linders
on Jun 27, 2024
Cloud

Data Solutions Framework: an Open Source Project for Building Data Solutions on AWS

AWS recently released the Data Solutions Framework (DSF), an opinionated open-source framework designed to accelerate the creation of data solutions on AWS. Built using the AWS CDK, the framework exposes abstractions and patterns as building blocks for constructing data solutions and is available in TypeScript (npm) and Python (PyPi).

Renato Losio
on Mar 02, 2024
Cloud

B2B Data Interchange: Managed Electronic Data Interchange (EDI) on AWS

AWS recently introduced B2B Data Interchange, a platform allowing organizations to automate and monitor the transformation of EDI-based business transactions. The service provides a low-code interface for managing trading partners and translating EDI documents into JSON and XML formats.

Renato Losio
on Jan 20, 2024
AI, ML & Data Engineering

Grammarly Replaces its in-House Data Lake with Databricks Platform Using Medallion Architecture

Grammarly adopted the medallion architecture while migrating from their in-house data lake, storing Parquet files in AWS S3, to the Delta Lake lakehouse. The company created a new event store for over 6000 event types from 40 internal and external clients and, in the process, improved data quality and reduced the data-delivery time by 94%.

Rafal Gancarz
on Jul 24, 2023
AI, ML & Data Engineering

Databricks Unveils Lakehouse AI and MosaicML Acquisition at Data + AI Summit

The Data and AI company Databricks recently unveiled Lakehouse AI, a suite of tools for building and governing generative AI models, including large language models (LLMs), within the Databricks platform. Among the tools were LakehouseIQ, a "knowledge engine" that uses AI to understand a company's unique data, culture, and language in order to improve natural language interfaces like chatbots.

Andrew Hoblitzell
on Jul 18, 2023
Cloud

Amazon Security Lake for Centralized Security Data Management Now GA

AWS recently announced the general availability of Security Lake, a managed service to automate the sourcing, aggregation, normalization, and data management of security data. The new service centralizes data from AWS environments, SaaS providers, on-premises, and cloud sources into a data lake stored in an AWS account.

Renato Losio
on Jun 11, 2023
Cloud

Unified Analytics Platform: Microsoft Fabric

At the recent annual Build Conference, Microsoft introduced a unified analytics platform with Microsoft Fabric that brings together all the data and analytics that organizations need.

Steef-Jan Wiggers
on Jun 01, 2023
Cloud

Amazon Athena Now Supports Apache Spark Engine

Amazon Athena now supports the open-source distributed processing system Apache Spark to run fast analytics workloads. Data analysts and engineers can use Jupyter Notebook in Athena to perform data processing and programmatically interact with Spark applications.

Renato Losio
on Jan 22, 2023
Cloud

AWS Announces Preview Release of Amazon Security Lake

At re:Invent, AWS announced the preview release of Amazon Security Lake. This managed service automatically centralizes an organization’s security data from the cloud and on-premises sources into a purpose-built data lake stored in their account.

Steef-Jan Wiggers
on Dec 06, 2022
Cloud

Google Launches a New Cross-Platform Data Storage Engine BigLake in Preview

At the recent Cloud Data Summit, Google recently announced the preview of BigLake, a new data lake storage engine that makes it easier for enterprises to analyze the data in their data warehouses and data lakes.

Steef-Jan Wiggers
on Apr 19, 2022

Newer News

Older News

InfoQ Software Architects' Newsletter

News