InfoQ Homepage Data Pipelines Content on InfoQ
-
AWS Publishes Reference Architecture and Implementations for Deployment Pipelines
AWS recently released a reference architecture and a set of reference implementations for deployment pipelines. The recommended architectural patterns are based on best practices and lessons collected at Amazon and customer projects.
-
AWS Glue Now Supports Crawler History
AWS recently launched support for histories of AWS Glue Crawlers, which allows the interrogation of Crawler executions and associated schema changes for the last 12 months.
-
Shopify’s Practical Guidelines from Running Airflow for ML and Data Workflows at Scale
Shopify engineering shared its experience in the company's blog post on how to scale and optimize Apache Airflow for running ML and data workflows. They shared practical solutions for the challenges they faced like slow file access, insufficient control over DAG, irregular level of traffic, resource contention among workloads, and more.
-
Data Collection, Standardization and Usage at Scale in the Uber Rider App
Uber Engineering recently published how it collects, standardises and uses data from the Uber Rider app. Rider data comprises all the rider's interactions with the Uber app. This data accounts for billions of events from Uber's online systems every day. Uber uses this data to deal with top problem areas such as increasing funnel conversion, user engagement, etc.
-
QCon Plus November 2021 is Now Hybrid. Attend Online and In-Person (NY & SF)
The QCon Plus software development conference will be back November 1-5, 2021 - online and in-person. Get the chance to engage and network with professionals driving change and innovation inside the world’s most innovative software organizations.
-
Airbnb Builds Himeji - a Scalable Centralized Authorization System
Airbnb recently described how it built Himeji, a scalable centralized authorization system. Himeji stores permissions data and performs permission checks as a central source of truth. It uses a sharded and replicated in-memory cache to improve performance and lower latencies and has served checks in production for about a year.
-
Designing for Failure in the BBC's Analytics Platform
Last week at InfoQ Live, Blanca Garcia-Gil, principal systems engineer at BBC, gave a session on Evolving Analytics in the Data Platform. During this session, Garcia-Gil focused on how her team prepared and designed for two types of failure - "known unknowns" and "unknown unknowns."
-
PayPal Standardizes on Apache Airflow and Apache Gobblin for Its Next-Gen Data Movement Platform
PayPal recently described how it standardized on Apache Airflow and Apache Gobblin for implementing its next-gen data movement platform. In a recent blog post, PayPal engineers detail how the existing data movement platform evolved into many tools & platforms in a complex and unmanageable ecosystem and their shift towards a new implementation.
-
Data Mesh Principles and Logical Architecture Defined
The concept of a data mesh provides new ways to address common problems around managing data at scale. Zhamak Dehghani has provided additional clarity around the four principles of a data mesh, with a corresponding logical architecture and organizational structure.
-
AWS Introduces Amazon Managed Workflows for Apache Airflow
Recently, AWS introduced Amazon Managed Workflows for Apache Airflow (MWAA), a fully-managed service simplifying running open-source versions of Apache Airflow on AWS and build workflows to execute extract-transform-load (ETL) jobs and data pipelines.
-
Accelerating Machine Learning Lifecycle with a Feature Store
Feature Store is a core part of next generation ML platforms that empowers data scientists to accelerate the delivery of ML applications. Mike Del Balso and Geoff Sims recently spoke at Spark AI Summit 2020 Conference about the feature store driven ML development.
-
Amazon Introduces the New Streaming ETL Feature on AWS Glue
Recently, Amazon announced AWS Glue now supports streaming ETL. With this new feature, customers can easily set up continuous ingestion pipelines that prepare streaming data on the fly and make it available for analysis in seconds.
-
KSQL Now Available on Confluent Cloud
KSQL is the streaming SQL engine for Apache Kafka, and it is currently available as a fully-managed service on the Confluent Cloud Platform for all its customers on usage-based billing plans. In a recent blog post, Confluent announced the availability of Confluent Cloud KSQL.
-
Michael Berthold on End-to-End Data Science Using KNIME Software
Open source data analytics platform KNIME CEO and co-founder Michael Berthold gave the keynote presentation at this year's KNIME Fall Summit 2019 Conference. He spoke about the end-to-end data science cycle. The data science process lifecycle mainly involves create and productionize categories.
-
High-Performance Data Processing with Spring Cloud Data Flow and Geode
Cahlen Humphreys and Tiffany Chang spoke recently at the SpringOne Platform 2019 Conference about data processing with Spring Cloud Data Flow and Apache Geode frameworks.