Twitter has open sourced their MapReduce streaming framework, called Summingbird. Available under the Apache 2 license, Summingbird is a large-scale data processing system enabling developers to uniformly execute code in either batch-mode (Hadoop/MapReduce-based) or stream-mode (Storm-based) or a combination thereof, called hybrid mode.
In order for Twitter to be able to keep up processing 500 millions tweets and growing, they had to find a replacement for their existing stack that required manually integrating MapReduce (Pig/Scalding) and streaming-based (Storm) code. The main motivation to create Summingbird, mentioned by the Twitter engineers, came from the realization that running a fully real-time system on Storm was difficult due to:
- Re-computation over months of historical logs must be coordinated with Hadoop or streamed through Storm with a custom log-loading mechanism.
- Storm is focused on message passing and random-write databases are harder to maintain.
This insight led to Summingbird, a flexible and general solution addressing the engineers’ practical issues with the existing approach:
- Two sets of aggregation logic have to be kept in sync in two different systems
- Keys and values must be serialized consistently between each system and the client
- The client is responsible for reading from both datastores, performing a final aggregation and serving the combined results
Summingbird is also one of the first openly available Lambda Architecture compliant systems. Similar projects include Yahoo’s Storm-YARN and a Spanish start-up’s upcoming Lambdoop, a Java framework for developing Big Data applications in a Lambda Architecture conformant way. The characteristics of the Lambda Architecture - immutable master dataset and the combination of batch, serving, and speed layer - enables people to build robust large-scale data processing systems that can deal with both batch and stream processing and has use cases from social media platforms (such as Twitter, LinkedIn, etc.) over the Internet of Things (smart city, wearables, manufacturing, etc.) to the financial sector (fraud detection, recommendations).
The main authors of Summingbird - Oscar Boykin, Sam Ritchie (nephew of computer science legend Dennis Ritchie) and Ashutosh Singhal - have further revealed the roadmap for Summingbird:
- support for Apache Spark and the columnar data storage format Parquet
- libraries of higher-level mathematics and machine learning code on top of Summingbird’s Producer primitives, and
- deeper integration with related open source projects such as Algebird or Storehaus.