DataTorrent is a real-time streaming and analyzing platform that can process over 1B real-time events/sec.
Compared to Twitter’s average of ~6,000 tweets/sec, the recently released DataTorrent 1.0 with its over 1 billion real-time events/sec seems to exceed the needs. In their tests on a 37-node cluster, each with 256GB of RAM and 12 hyper-threaded CPU cores, DataTorrent claims to have achieved linear scalability until 1.6B events/sec when the CPUs reached saturation. Phu Hoang, DataTorrent cofounder and CEO, told InfoQ that their solution is “orders of magnitude” more performant than Apache Spark on the same hardware.
DataTorrent is a real-time fault tolerant data streaming and analysis platform built on Hadoop 2.x and using native Hadoop applications that can coexist with apps doing other tasks such as batch processing. The platform’s architecture is depicted in the following picture:
StrAM (Streaming Application Master) is a native YARN Application Master responsible for managing the logical DAG (Directed Acyclic Graph) that is to be executed on top of a Hadoop cluster, including resource allocation, partitioning, scalability, scheduling, web services, run time changes, statistics, SLA enforcement, security, etc.
User applications exist at the upper level of the diagram as connected Operators and/or application templates. Examples of operators are InputReceiver – simulate receiving input data, Average – calculating data average for a key in a specified dimension, RedisAverageOutput – writes the calculated average to a Redis data store, SmtpAvgOperator – sends emails on alerts. These operators are part of the Malhar library which includes over 400 such items and it is open sourced on GitHub. The user can write other operators as needed.
We asked Hoang what makes DataTorrent faster than Spark:
PH: There are two important architectural differences that stem from DataTorrent’s focus on enabling enterprises to take action in real-time via streaming and Spark’s desire to generalize the Spark engine to process streams. The two two key areas of focus are performance and stateful fault-tolerance.
1. Performance -- DataTorrent RTS was designed and built from the ground up as a native Hadoop 2.0 offering focusing on performance with high availability and as a result does it’s processing event-by-event with sub-second latency. DataTorrent RTS schedules its application onto the Hadoop containers on start, and that mapping is fixed until the application needs to change and does not introduce any scheduling overhead. Spark, on the other hand was built pre-Hadoop 2.0 and utilizes the Spark engine to effectively run many “map reduce” jobs as small batches or “mini-batches”. This design decision requires that Spark (via the application master) now has to schedule each min-batch onto the cluster, which represents tremendous overhead, and slows down the system.
2. Stateful fault tolerance – DataTorrent RTS was designed to enable complex, stateful computation at high performance, with fault tolerance. This is a key enterprise requirement, where the ability to recover from outages without any data loss, any state loss, is a mandatory. The DataTorrent RTS design center here was to use Java for programming and to remove the “burden” of fault tolerant design from the enterprise developer/ISV (namely, its handled by DataTorrent RTS for the developer). Spark, does provide fault tolerance, but only for stateless processing. The Spark design center was to use the Scala language, which is a functional language, and the Operators that process streams are stateless. If an enterprise wanted to add stateful processing to Spark, they would need to write that code as part of their application, which is difficult to do and would impact performance.
According to Hoang, DataTorrent is certified for “all major Hadoop distributions for both on-premise and cloud-based deployment (such as Cloudera, Hortonworks, and MapR and on the cloud such as Amazon AWS and Google Cloud), enabling Enterprises the flexibility to change both Hadoop vendors and deployment options frictionlessly.”
Although DataTorrent is a commercial application, it comes with a free tier for small to medium applications including all capabilities.