InfoQ Homepage Data Pipelines Content on InfoQ
-
Airbnb Builds Himeji - a Scalable Centralized Authorization System
Airbnb recently described how it built Himeji, a scalable centralized authorization system. Himeji stores permissions data and performs permission checks as a central source of truth. It uses a sharded and replicated in-memory cache to improve performance and lower latencies and has served checks in production for about a year.
-
Designing for Failure in the BBC's Analytics Platform
Last week at InfoQ Live, Blanca Garcia-Gil, principal systems engineer at BBC, gave a session on Evolving Analytics in the Data Platform. During this session, Garcia-Gil focused on how her team prepared and designed for two types of failure - "known unknowns" and "unknown unknowns."
-
PayPal Standardizes on Apache Airflow and Apache Gobblin for Its Next-Gen Data Movement Platform
PayPal recently described how it standardized on Apache Airflow and Apache Gobblin for implementing its next-gen data movement platform. In a recent blog post, PayPal engineers detail how the existing data movement platform evolved into many tools & platforms in a complex and unmanageable ecosystem and their shift towards a new implementation.
-
Data Mesh Principles and Logical Architecture Defined
The concept of a data mesh provides new ways to address common problems around managing data at scale. Zhamak Dehghani has provided additional clarity around the four principles of a data mesh, with a corresponding logical architecture and organizational structure.
-
AWS Introduces Amazon Managed Workflows for Apache Airflow
Recently, AWS introduced Amazon Managed Workflows for Apache Airflow (MWAA), a fully-managed service simplifying running open-source versions of Apache Airflow on AWS and build workflows to execute extract-transform-load (ETL) jobs and data pipelines.
-
Accelerating Machine Learning Lifecycle with a Feature Store
Feature Store is a core part of next generation ML platforms that empowers data scientists to accelerate the delivery of ML applications. Mike Del Balso and Geoff Sims recently spoke at Spark AI Summit 2020 Conference about the feature store driven ML development.
-
Amazon Introduces the New Streaming ETL Feature on AWS Glue
Recently, Amazon announced AWS Glue now supports streaming ETL. With this new feature, customers can easily set up continuous ingestion pipelines that prepare streaming data on the fly and make it available for analysis in seconds.
-
KSQL Now Available on Confluent Cloud
KSQL is the streaming SQL engine for Apache Kafka, and it is currently available as a fully-managed service on the Confluent Cloud Platform for all its customers on usage-based billing plans. In a recent blog post, Confluent announced the availability of Confluent Cloud KSQL.
-
Michael Berthold on End-to-End Data Science Using KNIME Software
Open source data analytics platform KNIME CEO and co-founder Michael Berthold gave the keynote presentation at this year's KNIME Fall Summit 2019 Conference. He spoke about the end-to-end data science cycle. The data science process lifecycle mainly involves create and productionize categories.
-
High-Performance Data Processing with Spring Cloud Data Flow and Geode
Cahlen Humphreys and Tiffany Chang spoke recently at the SpringOne Platform 2019 Conference about data processing with Spring Cloud Data Flow and Apache Geode frameworks.
-
Data Lakes and Modern Data Architecture in Clinical Research and Healthcare
Dr. Prakriteswar Santikary, chief data officer at ERT, spoke at Data Architecture Summit 2018 Conference last month about data lake architecture his team developed at their clinical research organization. He discussed the data platform deployed in the cloud to streamline data collection, aggregation and clinical reporting and analytics, using concepts like serverless computing and data services.
-
Confluent Cloud, Apache Kafka as a Service in AWS
Apache Kafka is a distributed, fault-tolerant pub sub messaging soltuion, originally developed by LinkedIn and open sourced. Confluent was formed by former LinkedIn engineers in the Kafka development group and today announced Confluent Cloud, a fully hosted and managed Apache Kafka as a Service in AWS. We also take a look at Confluent's second annual Streaming Data report and its findings.
-
Yelp Open-Sources Latest in Data Pipeline Project, Data Pipeline Client Library
Yelp open sources latest component in its data pipeline initiative, a python-based data pipeline client library.
-
Reactive Summit 2016 Conference: Reactive Microservices and Staging Data Pipelines
Reactive microservices, data center scale operating system (DCOS), and staging reactive data pipelines were the highlighted topics at Reactive Summit 2016 Conference held this week. InfoQ team attended the conference and this post is a summary of the first day's events at the conference.