InfoQ Homepage ETL Content on InfoQ

News

RSS Feed

Newer Older

Architecture & Design

350PB, Millions of Events, One System: inside Uber’s Cross-Region Data Lake and Disaster Recovery

Uber’s HiveSync is a sharded, cross-region batch replication system keeping Hive/HDFS data consistent across multiple regions. Handling 5M daily Hive events and 8PB of data replication, it uses event-driven jobs, hybrid RPC and DistCp strategies, DAG-based orchestration, and dynamic sharding, enabling disaster recovery, horizontal scaling, and 99.99% cross-region data accuracy.

Leela Kumili
on Jan 16, 2026
Cloud

Google Spanner Unifies OLTP and OLAP with Columnar Engine

Google Spanner now features a columnar engine, allowing its distributed database to handle both OLTP and OLAP workloads on a single platform. This hybrid architecture eliminates the need for separate data warehouses and ETL pipelines. The engine's columnar storage and vectorized execution accelerate analytical queries up to 200x on live data, which is especially beneficial for AI applications.

Steef-Jan Wiggers
on Sep 05, 2025
Architecture & Design

Inside Atlassian Lithium: How a Dynamic ETL Platform is Transforming Data Movement and Cutting Costs

Atlassian recently introduced Lithium, an in-house ETL platform designed to meet the requirements of dynamic data movement. Lithium streamlines tasks such as cloud migrations, scheduled backups, and in-flight data validations by supporting ephemeral pipelines and tenant-level isolation while ensuring efficiency and scalability, resulting in significant cost savings.

Eran Stiller
on Jan 29, 2025
Cloud

Amazon RDS for MySQL Zero-ETL Integration with Amazon Redshift

Amazon RDS for MySQL's zero-ETL integration with Amazon Redshift is now generally available, enabling near real-time analytics and machine learning on transactional data. This powerful feature allows customized data replication from a single RDS database and facilitates seamless scalability, ensuring businesses gain insights while controlling costs and maintaining data freshness.

Steef-Jan Wiggers
on Sep 30, 2024
Cloud

Amazon Q Data Integration in AWS Glue Simplifies Data Transformation on AWS

Recently, AWS announced the preview of a new feature for AWS Glue, enabling customers to use natural language for authoring and troubleshooting data integration jobs. With Amazon Q data integration in AWS Glue, developers can provide a description of their data integration workload, and the service will generate an ETL script.

Renato Losio
on Feb 25, 2024
DevOps

Amazon OpenSearch Zero ETL with S3 and New OR1 Instances

Amazon has announced the preview of the Amazon OpenSearch Service's zero-extraction, transformation, and loading (ETL) integration with Amazon S3, offering a novel method to analyze operational logs in Amazon S3 and S3-based data lakes without the need to switch between services. Amazon also announced the new OR1 instances for Amazon OpenSearch Service.

Claudio Masolo
on Jan 12, 2024
Cloud

Confluent Announces Apache Flink on Confluent Cloud in Open Preview

Confluent recently announced the open preview of Apache Flink on Confluent Cloud as a fully-managed service for stream processing. The company claims that the managed service will make it easier for companies to filter, join, and enrich data streams with Flink.

Steef-Jan Wiggers
on Sep 29, 2023
AI, ML & Data Engineering

Grammarly Replaces its in-House Data Lake with Databricks Platform Using Medallion Architecture

Grammarly adopted the medallion architecture while migrating from their in-house data lake, storing Parquet files in AWS S3, to the Delta Lake lakehouse. The company created a new event store for over 6000 event types from 40 internal and external clients and, in the process, improved data quality and reduced the data-delivery time by 94%.

Rafal Gancarz
on Jul 24, 2023
Cloud

AWS Glue Now Supports Crawler History

AWS recently launched support for histories of AWS Glue Crawlers, which allows the interrogation of Crawler executions and associated schema changes for the last 12 months.

Nsikan Essien
on Sep 19, 2022
Cloud

Google Introduces Zero-ETL Approach to Analytics on Bigtable Data Using BigQuery

Recently, Google announced the general availability of Bigtable federated queries, with BigQuery allowing customers to query data residing in Bigtable via BigQuery faster. Moreover, the querying is without moving or copying the data in all Google Cloud regions with increased federated query concurrency limits, closing the longstanding gap between operational data and analytics.

Steef-Jan Wiggers
on Aug 11, 2022
Architecture & Design

Data Mesh Principles and Logical Architecture Defined

The concept of a data mesh provides new ways to address common problems around managing data at scale. Zhamak Dehghani has provided additional clarity around the four principles of a data mesh, with a corresponding logical architecture and organizational structure.

Thomas Betts
on Dec 14, 2020
Cloud

Google Announces a New, More Services-Based Architecture Called Runner V2 to Dataflow

Google Cloud Dataflow is a fully-managed service for executing Apache Beam pipelines within the Google Cloud Platform(GCP). In a recent blog post, Google announced a new, more services-based architecture called Runner v2 to Dataflow – which will include multi-language support for all of its language SDKs.

Steef-Jan Wiggers
on Aug 30, 2020
Cloud

Amazon Announces the General Availability of AWS Glue 2.0

AWS Glue is a fully-managed extract, transform, and load (ETL) service that makes it easy for customers to prepare and load their data for analytics. With AWS Glue, customers don’t have to provision or manage any resources, and only pay for resources when the service is running.

Steef-Jan Wiggers
on Aug 19, 2020
AI, ML & Data Engineering

Boosting Apache Spark with GPUs and the RAPIDS Library

At the 2019 Spark AI Summit Europe conference, NVIDIA software engineers Thomas Graves and Miguel Martinez hosted a session on Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS Library. InfoQ recently talked with Jim Scott, head of developer relations at NVIDIA, to learn more about accelerating Apache Spark with GPUs and the RAPIDS library.

Carol McDonald
on Feb 25, 2020
Architecture & Design

The Distributed Data Mesh as a Solution to Centralized Data Monoliths

Instead of building large, centralized data platforms, corporations and data architects should create distributed data meshes.

Thomas Betts
on Jan 31, 2020

Newer News

Older News

InfoQ Software Architects' Newsletter

News