InfoQ Homepage Data Pipelines Content on InfoQ
-
How Agoda Unified Multiple Data Pipelines into a Single Source of Truth
Agoda recently described how it consolidated multiple independent data pipelines into a centralized Apache Spark-based platform to eliminate inconsistencies in financial data. The company implemented a multi-layered quality framework that combines automated validations, machine-learning-based anomaly detection, and data contracts, while processing millions of daily booking transactions.
-
Solving Fragmented Mobile Analytics: Uber’s Platform-Led Approach
Uber Engineering outlines its platform-led mobile analytics redesign, standardizing event instrumentation across iOS and Android to improve cross-platform consistency, reduce engineering effort, and provide reliable insights for product and data teams.
-
Cloudflare Workflows Adds Python Support for Durable AI Pipelines
Innovative Cloudflare Workflows now supports both TypeScript and Python, enabling developers to orchestrate complex applications seamlessly. With durable execution and state persistence, it simplifies the development of robust data pipelines and AI/ML models. Experience enhanced concurrency and intuitive design, making orchestration effortless for Python enthusiasts.
-
Inside Atlassian Lithium: How a Dynamic ETL Platform is Transforming Data Movement and Cutting Costs
Atlassian recently introduced Lithium, an in-house ETL platform designed to meet the requirements of dynamic data movement. Lithium streamlines tasks such as cloud migrations, scheduled backups, and in-flight data validations by supporting ephemeral pipelines and tenant-level isolation while ensuring efficiency and scalability, resulting in significant cost savings.
-
Netflix Enhances Metaflow with New Configuration Capabilities
Netflix has introduced a significant enhancement to its Metaflow machine learning infrastructure: a new Config object that brings powerful configuration management to ML workflows. This addition addresses a common challenge faced by Netflix's teams, which manage thousands of unique Metaflow flows across diverse ML and AI use cases.
-
How Allegro Reduced the Cost of Running a GCP Dataflow Pipeline by 60%
Allegro achieved significant savings for one of the Dataflow Pipelines running on GCP Big Data. The company continues working on improving the cost-effectiveness of its data workflows by evaluating resource utilization, enhancing pipeline configurations, optimizing input and output datasets, and improving storage strategies.
-
Canva Opts for Amazon KDS over SNS+SQS to Save 85% with 25 Billion Events per Day
Canva evaluated different data massaging solutions for its Product Analytics Platform, including the combination of AWS SNS and SQS, MKS, and Amazon KDS, and eventually chose the latter, primarily based on its much lower costs. The company compared many aspects of these solutions, like performance, maintenance effort, and cost.
-
Local Emulator for Azure Event Hubs in Preview: Offering Developers a Local Development Experience
Microsoft recently launched the local emulator's preview release for Azure Event Hubs. This emulator is designed to give developers a local development experience for Azure Event Hubs, allowing them to develop and test code against the services in isolation.
-
Yelp Overhauls Its Streaming Architecture with Apache Beam and Apache Flink
Yelp reworked its data streaming architecture by employing Apache Beam and Apache Flink. The company replaced a fragmented set of data pipelines for streaming transactional data into its analytical systems, like Amazon Redshift and in-house data lake, using Apache data streaming projects to create a unified and flexible solution.
-
Netflix Creates Incremental Processing Solution Using Maestro and Apache Iceberg
Netflix created a new solution for incremental processing in its data platform. The incremental approach reduces the cost of computing resources and execution time significantly as it avoids processing complete datasets. The company used its Maestro workflow engine and Apache Iceberg to improve data freshness and accuracy and plans to provide managed backfill capabilities.
-
Goldsky’s Streaming-First Architecture for Blockchain Data with Flink, Redpanda and Kubernetes
Goldsky created a platform for the real-time processing of blockchain data. The platform allows clients to extract data from blockchains into their own databases to support product features, but without running the data pipeline infrastructure. The event-driven architecture (EDA) of Goldsky leverages Apache Flink, Redpanda, Kubernetes, and cloud provider services.
-
A Modern Compute Stack for Scaling Large AI, ML, & LLM Workloads at QCon SF
Jules Damji, a lead developer advocate at Anyscale Inc., discussed the difficulties data scientists encounter when managing infrastructure for machine learning models. He emphasized the necessity for a framework that supports the latest machine learning libraries, is easily manageable, and can scale to accommodate large datasets and models. Damji introduced Ray as a potential solution.
-
Confluent Announces Apache Flink on Confluent Cloud in Open Preview
Confluent recently announced the open preview of Apache Flink on Confluent Cloud as a fully-managed service for stream processing. The company claims that the managed service will make it easier for companies to filter, join, and enrich data streams with Flink.
-
Running Apache Flink Applications on AWS KDA: Lessons Learnt at Deliveroo
Deliveroo introduced Apache Flink into its technology stack for enriching and merging events consumed from Apache Kafka or Kinesis Streams. The company opted to use AWS Kinesis Data Analytics (KDA) service to manage Apache Flink clusters on AWS and shared its experiences from running Flink applications on KDA.
-
Pfizer Uses Serverless Architecture on AWS to Scale Processing of Digital Biomarkers
Pfizer upgraded the serverless architecture for processing digital biomarker data at scale to make it more flexible and configurable. They created a framework that uses a file processing pipeline built with AWS Step Functions and other serverless services, as well as a custom Python package for data ingestion and processing.