InfoQ Homepage Data Pipelines Content on InfoQ
-
Yelp Rebuilds Corrupted Cassandra Cluster Using Its Data Streaming Architecture
Yelp created a solution to sanitize data from the corrupted Apache Cassandra cluster utilizing its data streaming architecture. The team explored many potential options to address the data corruption issue, however, ultimately had to move the data into a new cluster to remove corrupted records in the process.
-
Instacart Creates a Self-Serve Apache Flink Platform on Kubernetes
Instacart moved their Apache Flink workloads from AWS EMR to Kubernetes to meet the high demand for data processing use cases using Flink within the organization, as using EMR became problematic for many teams with different requirements. As a result, they made the platform easier to use and reduced their operational and infrastructure costs.
-
Strategies and Principles to Scale and Evolve MLOps - at QCon London
At the QCon London conference, Hien Luu, senior engineering manager for the Machine Learning Platform at DoorDash, discussed strategies and principles for scaling and evolving MLOps. With 85% of ML projects failing, understanding MLOps at an engineering level is crucial. Luu shared three core principles: "Dream Big, Start Small," "1% Better Every Day," and "Customer Obsession."
-
AWS Publishes Reference Architecture and Implementations for Deployment Pipelines
AWS recently released a reference architecture and a set of reference implementations for deployment pipelines. The recommended architectural patterns are based on best practices and lessons collected at Amazon and customer projects.
-
AWS Glue Now Supports Crawler History
AWS recently launched support for histories of AWS Glue Crawlers, which allows the interrogation of Crawler executions and associated schema changes for the last 12 months.
-
Data Collection, Standardization and Usage at Scale in the Uber Rider App
Uber Engineering recently published how it collects, standardises and uses data from the Uber Rider app. Rider data comprises all the rider's interactions with the Uber app. This data accounts for billions of events from Uber's online systems every day. Uber uses this data to deal with top problem areas such as increasing funnel conversion, user engagement, etc.
-
QCon Plus November 2021 is Now Hybrid. Attend Online and In-Person (NY & SF)
The QCon Plus software development conference will be back November 1-5, 2021 - online and in-person. Get the chance to engage and network with professionals driving change and innovation inside the world’s most innovative software organizations.
-
Designing for Failure in the BBC's Analytics Platform
Last week at InfoQ Live, Blanca Garcia-Gil, principal systems engineer at BBC, gave a session on Evolving Analytics in the Data Platform. During this session, Garcia-Gil focused on how her team prepared and designed for two types of failure - "known unknowns" and "unknown unknowns."
-
PayPal Standardizes on Apache Airflow and Apache Gobblin for Its Next-Gen Data Movement Platform
PayPal recently described how it standardized on Apache Airflow and Apache Gobblin for implementing its next-gen data movement platform. In a recent blog post, PayPal engineers detail how the existing data movement platform evolved into many tools & platforms in a complex and unmanageable ecosystem and their shift towards a new implementation.
-
Data Mesh Principles and Logical Architecture Defined
The concept of a data mesh provides new ways to address common problems around managing data at scale. Zhamak Dehghani has provided additional clarity around the four principles of a data mesh, with a corresponding logical architecture and organizational structure.
-
Accelerating Machine Learning Lifecycle with a Feature Store
Feature Store is a core part of next generation ML platforms that empowers data scientists to accelerate the delivery of ML applications. Mike Del Balso and Geoff Sims recently spoke at Spark AI Summit 2020 Conference about the feature store driven ML development.
-
Amazon Introduces the New Streaming ETL Feature on AWS Glue
Recently, Amazon announced AWS Glue now supports streaming ETL. With this new feature, customers can easily set up continuous ingestion pipelines that prepare streaming data on the fly and make it available for analysis in seconds.
-
KSQL Now Available on Confluent Cloud
KSQL is the streaming SQL engine for Apache Kafka, and it is currently available as a fully-managed service on the Confluent Cloud Platform for all its customers on usage-based billing plans. In a recent blog post, Confluent announced the availability of Confluent Cloud KSQL.
-
Michael Berthold on End-to-End Data Science Using KNIME Software
Open source data analytics platform KNIME CEO and co-founder Michael Berthold gave the keynote presentation at this year's KNIME Fall Summit 2019 Conference. He spoke about the end-to-end data science cycle. The data science process lifecycle mainly involves create and productionize categories.
-
High-Performance Data Processing with Spring Cloud Data Flow and Geode
Cahlen Humphreys and Tiffany Chang spoke recently at the SpringOne Platform 2019 Conference about data processing with Spring Cloud Data Flow and Apache Geode frameworks.