InfoQ Homepage Spark Content on InfoQ
From Spark to Elasticsearch and Back - Learning Large-Scale Models for Content Recommendation
Sonya Liberman shares an algorithmic architecture that enables running complex models under difficult scale constraints and shortens the cycle between research and production.
Streaming for Personalization Datasets at Netflix
Shriya Arora discusses challenges faced with stream processing unbounded datasets, comparing microbatch with event-based approaches using Spark and Flink.
When Streams Fail: Kafka Off the Shore
Anton Gorshkov discusses how to evaluate and architect a resilient streaming platform, focusing on Kafka and Spark streaming and sharing his experience of using them to process financial transactions.
Data Preparation for Data Science: A Field Guide
Casey Stella presents a utility written with Apache Spark to automate data preparation, discovering missing values, values with skewed distributions and discovering likely errors within data.
Real-Time Recommendations Using Spark Streaming
Elliot Chow discusses the data pipeline that they built with Kafka, Spark Streaming, and Cassandra to process Netflix user activities in real time for the Trending Now row.
Data Science in the Cloud @StitchFix
Stefan Krawczyk discusses how StitchFix used the cloud to enable over 80 data scientists to be productive and have easy access, covering prototyping, algorithms used, keeping schema in sync, etc.
Machine Learning and End-to-End Data Analysis Processes in Spark Using Python and R
Debraj GuhaThakurta discusses ML and data analysis processes in Spark using examples written in Python and R.
MLeap: Release Spark ML Models
Hollin Wilkins discusses the reasons behind MLeap, outes the programming time saved by using it, shows benchmarks of several online models, and provides a demo and examples of using it in practice.
Hydrator: Open Source, Code-Free Data Pipelines
Jonathan Gray introduces Hydrator, an open source framework and user interface for creating data lakes for building and managing data pipelines on Spark, MapReduce, Spark Streaming and Tigon.
Exploring Wikipedia with Apache Spark: A Live Coding Demo
Sameer Farooqui demos connecting to the live stream of Wikipedia edits, building a dashboard showing what’s happening with Wikipedia datasets and how people are using them in real time.
Ingest & Stream Processing - What Will You Choose?
Pat Patterson and Ted Malaska talk about current and emerging data processing technologies, and the various ways of achieving "at least once" and "exactly once" timely data processing.
Monitoring and Troubleshooting Real-Time Data Pipelines
Alan Ngai and Premal Shah discuss best practices on monitoring distributed real-time data processing frameworks and how DevOps can gain control and visibility over these data pipelines.