Version 1.0 is "a major milestone in the evolution of Apache Storm", writes Apache Software Foundation VP for Apache Storm P. Taylor Goetz, and it includes many new features and improvements. In particular, Goetz claims a 3x–16x boost in performance.
Storm is an event processor that enables distributed processing of streaming data. A Storm application is made of “spouts” and “bolts” configured in a direct acyclic graph to represent information sources and data handlers. The main trait of Storm is its ability to process real time data, as opposed to batch processing as allowed, e.g., by Hadoop.
According to Goetz, Storm 1.0 is up to 16 times faster than previous versions, whereas for most use cases a 3x boost can be expected. In particular, it seems that a big performance improvements came from the following changes:
- reimplementing Clojure
reduce
function in Java inside of theSpoutOutputCollector.emit()
call; - introducing batching in
DisruptorQueue
instead of at the spout level, which provide huge throughput improvements at the cost of some more latency.
Performance, in particular, has traditionally been one of the main competitive edges of Storm against two other popular distributed processing frameworks such as Apache Flink and Apache Spark, as extensive benchmarking by Yahoo engineers showed.
Additionally, Storm 1.0 includes many notable new features, such as:
- Pacemaker, a heartbeat daemon to process worker heartbeats that should provide better performance than ZooKeeper thanks to its being in-memory;
- Distributed cache and its related API, which allows to share files among topologies. Files can be updated at any moment without requiring to redeploy the affected topologies. This is an improvement on current practice of including resource files within the topology jar, with the effect that a file update requires a redeploy.
- High Availability Nimbus, which replace the single Nimbus instance with a dynamic cluster of Nimbus nodes where a new “leader” is elected when the current one fails.
- Streaming Window API, which adds supports for the definition of windows to be applied to data processing, such as when computing the top trending topic in the last hour. Previously, developers had to build their own windowing logic’s.
- automatic backpressure, which will automatically slow down a topology’s spouts when a given limit expressed as a percentage of a task’s buffer size has been reached.
- Resource aware scheduler, a new scheduler implementation that will take into account memory and CPU resources available in a cluster to schedule tasks to the workers that best meet specified requirements.
- Dynamic worker profiling, aimed to make it possible to get worker profile data, such as heap dumps and JStack output, from Storm UI.
Apache Storm 1.0 can be downloaded from GitHub or in various packaged formats from the Storm download page.