InfoQ

News

Cascading - Data Processing API for Hadoop MapReduce

Posted by R.J. Lorimer on Oct 10, 2008 01:09 PM

Community
Java
Topics
Cloud Computing
Tags
Hadoop ,
MapReduce
Cascading is a new processing API for data processing on Hadoop clusters, and supports building complex processing workflows using an expressive API as opposed to directly implementing Hadoop MapReduce algorithms.
The processing API lets the developer quickly assemble complex distributed processes without having to "think" in MapReduce. And to efficiently schedule them based on their dependencies and other available meta-data.
The core concepts of the cascading API are pipes and flows. A pipe is a series of processing steps (parsing, looping, filtering, etc) that defines the data processing to be done, and a flow is the association of a pipe (or set of pipes) with a data-source and data-sink. In other words, a flow is a pipe with data flowing through it. Going one step further, a cascade is the chaining, branching and grouping of multiple flows.

There are a number of key features provided by this API:
  • Dependency-Based 'Toplogical Scheduler' and MapReduce Planning - Two key components of the cascading API are its ability to schedule the invocation of flows based on dependency; with the execution order being independent of construction order, often allowing for concurrent invocation of portions of flows and cascades. In addition, the steps of the various flows are intelligently converted into map-reduce invocations against the hadoop cluster.
  • Event Notification - The various steps of the flow can perform notifications via callbacks, allowing for the host application to report and respond to the progress of the data processing.
  • Scriptable - The Cascading API has scriptable interfaces for Jython, Groovy, and JRuby - making it readily accessible for popular dynamic JVM languages
There are a number of documents available for learning the concepts and implementation of the cascading API. There is an introductory overview presentation PDF that walks through, at a high level, the core concepts of the cascading API. There is also a 'gentle introduction' example available that walks through creating a simple Apache log parser. Lastly, there is a full Javadoc of the Cascading API.

No comments

Reply

Exclusive Content

The Maxine VM

Bernd Mathiske discusses Maxine VM, Java compatibility, swapping major VM components, research areas, Object handling, code examples, optimizing compiler, snippets, bytecode generation, JNI and JIT.

Joe Armstrong About Erlang

Joe Armstrong speaks on various aspects of the Erlang language, presenting its roots, how it compares with other languages and why it has become popular these days.

The Limits of Code Optimization: a new Singleton Pattern Implementation

The java double-check singleton pattern is not thread safe and can’t be fixed. In this article, Dr. Alexey Yakubovich provides an implementation of the Singleton pattern that he claims is thread-safe.

Pressure and Performance – The CTO's Dilemma

Diana and Jim talk about patterns observed in CTOs' activity. CTOs emerge as real people caring for other people in their organization, and are put under a lot of pressure and constraints.

Biztalk Services in the Cloud

Cloud computing feels like a tomorrow technology. Simon Thurman shows how developers can use Biztalk to create an Internet Service Bus which can be deployed locally or in the cloud.

Java FX Technology Preview

InfoQ takes a look at the JavaFX preview build and talks to Sun Staff Engineer Joshua Marinacci about the upcoming version 1 release expected this autumn.

Jeff Sutherland: Reaching Hyper-Productivity with Outsourced Development Teams

Jeff Sutherland, co-creator of Scrum, and Guido Schoonheim, CTO of Xebia, present an actual case of reaching hyper-productivity with a large distributed team using XP and Scrum.

Steven "Doc" List About Open Spaces

In this interview made by InfoQ's Greg Young, Steven "Doc" List talks about Open Space conferences, a way of running meetings of groups of various sizes by facilitating self organizing the sessions.