Apache Hadoop is de facto standard for Big Data storage and batch processing, while Tweeter Storm is quickly becoming a standard for large-scale event processing implementations. Unfortunately, up until recently, Storm and Hadoop required two physically different clusters for their implementation. Last week Yahoo! announced open sourcing Storm running on a Hadoop cluster.
According to Yahoo!, collocating real-time processing (Storm) with batch processing offers a number of advantages over segregated clusters:
- It provides a huge potential for elasticity. Real-time processing will rarely produce a constant and predictable load. As such, Storm needs more resources to keep up with spikes in demand. Collocating Storm with batch processing allows Storm to steal resources from batch jobs when needed and give them back when demand subsides. The Storm-YARN effort lays the groundwork to make this possible.
- Many applications use Storm for low-latency processing and Map/Reduce for batch processing while sharing data between Storm and Map/Reduce. By placing Storm physically closer to the data source and/or other components in the same pipeline we can reduce network transfers and in turn the total cost of acquiring the data.
The Storm-Hadoop integration is leveraging Hadoop’s new resource manager YARN
Storm-on-YARN enables Storm applications to utilize the computational resources [of] tens of thousands of Hadoop computation nodes. YARN is used to launch the Storm application master (Nimbus) on demand, and enables Nimbus to request resources for Storm application slaves (Supervisors).
Storm-YARN integration provides a standard Storm configuration file including YARN specific parameters, enabling the configuration of the initial number of supervisors to be launched and the memory size of the container to be allocated for each supervisor.
In addition, Yahoo! has enhanced Storm to support Hadoop style security mechanisms, which enables Storm applications to directly access Hadoop data stored on both HDFS and HBase.
According to Loraine Lawson:
One of the more promising applications of Hadoop and other Big Data solutions is the ability to deliver information in real time. It’s not mentioned often, which is unfortunate, because it’s really a game-changer for some organizations, with far-reaching implications for the rest of us.
The integration of real-time event processing implemented by Storm plus Hadoop along with real-time Hadoop queries brings us one step closer to that vision.