InfoQ Homepage ETL Content on InfoQ
-
Amazon RDS for MySQL Zero-ETL Integration with Amazon Redshift
Amazon RDS for MySQL's zero-ETL integration with Amazon Redshift is now generally available, enabling near real-time analytics and machine learning on transactional data. This powerful feature allows customized data replication from a single RDS database and facilitates seamless scalability, ensuring businesses gain insights while controlling costs and maintaining data freshness.
-
Amazon Q Data Integration in AWS Glue Simplifies Data Transformation on AWS
Recently, AWS announced the preview of a new feature for AWS Glue, enabling customers to use natural language for authoring and troubleshooting data integration jobs. With Amazon Q data integration in AWS Glue, developers can provide a description of their data integration workload, and the service will generate an ETL script.
-
Amazon OpenSearch Zero ETL with S3 and New OR1 Instances
Amazon has announced the preview of the Amazon OpenSearch Service's zero-extraction, transformation, and loading (ETL) integration with Amazon S3, offering a novel method to analyze operational logs in Amazon S3 and S3-based data lakes without the need to switch between services. Amazon also announced the new OR1 instances for Amazon OpenSearch Service.
-
Confluent Announces Apache Flink on Confluent Cloud in Open Preview
Confluent recently announced the open preview of Apache Flink on Confluent Cloud as a fully-managed service for stream processing. The company claims that the managed service will make it easier for companies to filter, join, and enrich data streams with Flink.
-
Grammarly Replaces its in-House Data Lake with Databricks Platform Using Medallion Architecture
Grammarly adopted the medallion architecture while migrating from their in-house data lake, storing Parquet files in AWS S3, to the Delta Lake lakehouse. The company created a new event store for over 6000 event types from 40 internal and external clients and, in the process, improved data quality and reduced the data-delivery time by 94%.
-
AWS Glue Now Supports Crawler History
AWS recently launched support for histories of AWS Glue Crawlers, which allows the interrogation of Crawler executions and associated schema changes for the last 12 months.
-
Google Introduces Zero-ETL Approach to Analytics on Bigtable Data Using BigQuery
Recently, Google announced the general availability of Bigtable federated queries, with BigQuery allowing customers to query data residing in Bigtable via BigQuery faster. Moreover, the querying is without moving or copying the data in all Google Cloud regions with increased federated query concurrency limits, closing the longstanding gap between operational data and analytics.
-
Data Mesh Principles and Logical Architecture Defined
The concept of a data mesh provides new ways to address common problems around managing data at scale. Zhamak Dehghani has provided additional clarity around the four principles of a data mesh, with a corresponding logical architecture and organizational structure.
-
Google Announces a New, More Services-Based Architecture Called Runner V2 to Dataflow
Google Cloud Dataflow is a fully-managed service for executing Apache Beam pipelines within the Google Cloud Platform(GCP). In a recent blog post, Google announced a new, more services-based architecture called Runner v2 to Dataflow – which will include multi-language support for all of its language SDKs.
-
Amazon Announces the General Availability of AWS Glue 2.0
AWS Glue is a fully-managed extract, transform, and load (ETL) service that makes it easy for customers to prepare and load their data for analytics. With AWS Glue, customers don’t have to provision or manage any resources, and only pay for resources when the service is running.
-
Boosting Apache Spark with GPUs and the RAPIDS Library
At the 2019 Spark AI Summit Europe conference, NVIDIA software engineers Thomas Graves and Miguel Martinez hosted a session on Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS Library. InfoQ recently talked with Jim Scott, head of developer relations at NVIDIA, to learn more about accelerating Apache Spark with GPUs and the RAPIDS library.
-
The Distributed Data Mesh as a Solution to Centralized Data Monoliths
Instead of building large, centralized data platforms, corporations and data architects should create distributed data meshes.
-
Microsoft Announces Azure Synapse for Data Warehousing and Analytics
During Microsoft's annual Ignite conference the company announced a new analytics service called Azure Synapse. The service, which is a continuation of Azure SQL Data Warehouse, focuses on bringing enterprise data warehousing and big data analytics into a single service.
-
High-Performance Data Processing with Spring Cloud Data Flow and Geode
Cahlen Humphreys and Tiffany Chang spoke recently at the SpringOne Platform 2019 Conference about data processing with Spring Cloud Data Flow and Apache Geode frameworks.
-
Simplifying ETL in the Cloud, Microsoft Releases Azure Data Factory Mapping Data Flows
In a recent blog post, Microsoft announced the general availability (GA) of their serverless, code-free Extract-Transform-Load (ETL) capability inside of Azure Data Factory called Mapping Data Flows. This tool allows organizations to embrace a data-driven culture without the need to manage large infrastructure footprints while having the ability to dynamically scale data processing workloads.