InfoQ Homepage Big Data Content on InfoQ
-
Developing Real-time Data Pipelines with Apache Kafka
Joe Stein makes an introduction for developers about why and how to use Apache Kafka. Apache Kafka is a publish-subscribe messaging system rethought of as a distributed commit log.
-
Apache Spark for Big Data Processing
Ilayaperumal Gopinathan and Ludwine Probst discuss Spark and its ecosystem, in particular Spark Streaming and MLlib, providing a concrete example, and showing how to use Spark with Spring XD.
-
The Lego Model for Machine Learning Pipelines
Leah McGuire describes the machine learning platform Salesforce wrote on top of Spark to modularize data cleaning and feature engineering.
-
Tuning Java for Big Data
Scott Seighman discusses causes of common performance issues in Big Data environments, heap size, garbage collection, JVM reuse tuning guidelines and Big Data performance analysis tools.
-
Ground-up Introduction to In-memory Data
Viktor Gamov covers In-Memory technology, distributed data topologies, making in-memory reliable, scalable and durable, when to use NoSQL, and techniques for Big In-Memory Data.
-
Pulsar: Real-time Analytics at Scale
Sharad Murthy & Tony Ng present Pulsar, a real-time streaming system which can scale to millions of events per second with high availability and 4GL language support.
-
Exploratory Data Analysis with R
Matthew Renze introduces the R programming language and demonstrates how R can be used for exploratory data analysis.
-
Spreadsheets for Developers
Felienne Hermans presents various algorithms that outlining the power of Excel, showing that spreadsheets are fit for TDD and rapid prototyping.
-
The Many Faces of Apache Kafka: How is Kafka Used in Practice
Neha Narkhede discusses how companies are using Apache Kafka and where it fits in the Big Data ecosystem.
-
Financial Modeling with Apache Spark: Calculating Value at Risk
Sandy Ryza aims to give a feel for what it is like to approach financial modeling with modern big data tools, using the Monte Carlo method for a a basic VaR calculation with Spark.
-
Lightning Fast Cluster Computing with Spark and Cassandra
Piotr Kołaczkowski discusses how they integrated Spark with Cassandra, how it was done, how it works in practice and why it is better than using a Hadoop intermediate layer.
-
Translating Imperative Code to MapReduce
The authors present an approach for automatic translation of sequential, imperative code into a parallel MapReduce framework using Mold, translating Java code to run on Apache Spark.