Real time processing of BigData seems to be one of the hottest topics today. Nokia has just released a new open-source project - Dempsy. Dempsy is comparable to Storm, Esper, Streambase, HStreaming and Apache S4. The code is released under the Apache 2 license
Dempsy is meant to solve the problem of processing large amounts of "near real time" stream data with the lowest lag possible; problems where latency is more important that "guaranteed delivery." This class of problems includes use cases such as:
- Real time monitoring of large distributed systems
- Processing complete rich streams of social networking data
- Real time analytics on log information generated from widely distributed systems
- Statistical analytics on real-time vehicle traffic information on a global basis
The important properties of Dempsy are:
- It is Distributed. That is to say a Dempsy application can run on multiple JVMs on multiple physical machines.
- It is Elastic. That is, it is relatively simple to scale an application to more (or fewer) nodes. This does not require code or configuration changes but done by dynamic insertion or removal of processing nodes.
- It implements Message Processing. Dempsy is based on message passing. It moves messages between Message processors, which act on the messages to perform simple atomic operations such as enrichment, transformation, etc. In general, an application is intended to be broken down into more smaller simpler processors rather than fewer large complex processors.
- It is a Framework. It is not an application container like a J2EE container, nor a simple library. Instead, like the Spring Framework, it is a collection of patterns, the libraries to enable those patterns, and the interfaces one must implement to use those libraries to implement the patterns.
Dempsy’ programming model is based on message processors communicating via messages and resembles a distributed actor framework . While not strictly speaking an actor framework in the sense of Erlang or Akka actors, where actors explicitely direct messages to other actors, Dempsy’s Message Processors are "actor like POJOs" similar to Processor Elements in S4 and to some extent Bolts in Storm. Message processors are similar to actors in that they operate on a single message at a time, and need not deal with concurrency directly. Unlike actors, Message Processors also are relieved of the the need to know the destination(s) for their output messages, as this is handled inside by Dempsy based on the message properties.
In short Dempsy is a framework to enable the decomposing of a large class of message processing problems into flows of messages between relatively simple processing units implemented as POJOs.
The Dempsy Tutorial contains more information.
InfoQ had a chance to discuss with Dempsy’s creator, NAVTEQ Fellow Jim Carroll.
InfoQ: What is Nokia planning to use Dempsy for?
Jim: Dempsy has a several potential use cases within Nokia and several departments are considering it. It was originally built for implementation of the next generation vehicle traffic processing engine responsible for taking in billions of discrete raw traffic data-points per day including roadway sensors, traffic incidents, and GPS locations, and providing roadway traffic information to various end products like in vehicle navigation systems and web based map displays, among many others.
We recognized that the analytics required to produce these end products from the enormous amount of raw input data in near-real-time with the least amount of lag constituted essentially a “real time BigData” problem.
InfoQ: With different existing implementations, why did you decide to write your own implementation?
Jim: We originally evaluated different approaches for our new traffic engine including CEP (which Streambase supports), and various actors models like S4 and Akka.
CEP is really trying to solve a different problem. If you have a large stream of data you want to mine by separating it into subsets, or isolating a particular subset, and then performing analytics on the result, then CEP solutions make sense. If, however, you’re going to do the same thing to every data-point in the stream, then you will be underutilizing the power of CEP. Underutilized functionality usually means an increased total cost of ownership, and Dempsy is all about reducing the total cost of ownership for systems that do this type of processing.
That leaves other distributed stream processing systems. At the time of our evaluation S4 and Akka were both available.
Akka appears to be focused on a different problem. The Akka team is really focusing on being an implementation of a pure Actors Model for the JVM, an “Erlang for the JVM” if you will, rather than on a distributed stream processing engine. As a result, for example, the Akka team is focused more on resolving issues like Software Transactional Memory, rather than on simplifying a distributed deployment, an effort that makes sense to pure actors model implementations but it wasn’t what we were looking for.
That left S4. S4 was exactly what we were looking for as far as the processing model goes. The problem we found with S4 was that it was too immature for us to get consistently running in a large production environment. This view was confirmed when the S4 team themselves forked the Apache codebase and began building a new version of S4 (S4-Piper).
When Storm came out at the end of September, 2011, we were well into development. We liked Storm. However, we found that our framework had some advantages over Storm.
First, Storm isn’t a “fine grained” actors model. Because of this, when implementing the same use case for both systems we found that the Dempsy implementation was smaller.
Also, because of Dempsy’s emphasis on “Inversion of Control” the resulting applications were easier to test. With the exception of annotations, Message Processors, which are the atomic unit of work in Dempsy, have no dependency on the framework itself. Bolts in Storm (and Processing Elements in S4) need to be written to use the framework’s APIs.
Also, in following the adage to never require the system to be told something that it can deduce, in Dempsy, the topology of an application’s pipeline is discovered at runtime and doesn’t need to be preconfigured. This is primarily a byproduct of the fact that Dempsy was designed from the ground up to be “elastic” and as a result, the topology can morph dynamically.
This means that applications with complicated topologies with many branches and merges can be trivially configured since the dependency relationship between stages is discovered at runtime by the framework.
InfoQ: In several places you describe Dempsy as intentionally not an application server. What do you mean by that?
Jim: Application servers, as you would expect, serve applications. You tend to “deploy” an application to an application server. An application server will then provide access to a set of cross-cutting services, maintain application isolation, and provide for communication between applications. A distributed application server may add computational resource management and support for fault tolerance.
But why would a framework that’s been built following the old Unix adage to “do one thing, and do it well,” implement these things, when they are already readily available, or when these features are likely to conflict with what’s already available? Hasn’t the industry noticed that the majority of these tasks has been handled for years by the operating system itself? And when the system is deployed into a cloud environment with a set of robust IaaS tools, who’s responsible for the automatic provisioning of new resources or reprovisioning of failed resources? Should it be specialized software built into an application server, or the IaaS tool responsible for it across the enterprise?
Dempsy is therefore built to cooperate with these services through the implementation of “elasticity,” so that as the IaaS tools provision new system or reprovision old systems in response to load, the application responds automatically. This keeps Dempsy simple and focused while cooperating with, rather than conflicting or competing with, tools built for their specific purpose in a synergy that ultimately produces a more robust ecosystem with a lower total cost of ownership.
InfoQ: What does Dempsy use as a messaging transport?
Jim: Dempsy is built on a set of abstractions that allow the framework itself to be easily extended and adapted. These include things like routing strategy, monitoring, serialization, and cluster management. It also includes message transport. Final default implementations of these abstractions has been done in an order that the team deemed most important. Both serialization and message transport are toward the bottom of that list.
Therefore, while we plan on using Netty for the transport (and there has been discussion of using Zero-MQ) prior to a 1.0 release, currently there’s a simple TCP implementation in place.
InfoQ: How is routing implemented in Dempsy?
Jim: Dempsy routes messages to message processors based on a message key. The scheme for determining what key a message has is specified by the application. But that key becomes the address for a specific message processor instance distributed on a cluster of machines. Routing is then a two stage process. First, determine which node of a cluster that message processor lives in by using what we call a “routing strategy.” Second, once the message is received at that node, find the specific message processor responsible for that message.
The default routing strategy divides a message’s key space (that is, the collection of all possible keys for a particular message class) across a set of “bins.” These bins, along with their meta information, are dynamically assigned to running nodes through a distributed negotiation scheme. That is, there’s no central manager or broker involved. This is similar to the “leader election” negotiation use case for Apache ZooKeeper but in the case of Dempsy it’s “bin ownsership” for each and every bin.
InfoQ: How does Dempsy support scaling of individual messaging processors?
Jim: Message processor granularity and independence is the key, which is really part of the application design. If the key-space is in the millions rather than in the teens, then Dempsy will be able to distribute it linearly among the set of computational resources to the point where the network infrastructure becomes the only limit to linear scalability.
InfoQ: What mechanism is Dempsy using to redistribute message processors when topology of Dempsy cluster changes?
Jim: Dempsy relies on the notification capabilities of its cluster information management implementation. As mentioned, cluster information management is one of the abstractions within the Dempsy framework so the implementation can be swapped out. By default, when running in a distributed mode, the cluster information management is implemented leveraging Apache ZooKeeper.
When the topology changes, every upstream node that’s affected is notified of the change and adjusts its understanding of which keys correspond to which nodes. Other nodes within the affected cluster are notified and renegotiate for bin ownership.
InfoQ: What will happen when one of the message processing nodes fails?
Jim: Message processing node failures are a special case of physical topology changes. When a node fails other nodes in the same cluster are notified and renegotiate for the newly available bins. This negotiation leads to upstream notification of the physical topology changes and they adjust accordingly.