Datameer, a big data analytics application for the Hadoop ecosystem, introduced Datameer 5.0 with Smart Execution to dynamically select the optimal compute framework at each step in the big data analytics process.
Datameer Smart Execution selects the best compute framework for each step of an analytics job, orchestrating multiple compute frameworks as necessary. Once the selection is made, Smart Execution uses a snapshot of total and currently available system resources to allocate and run each concurrent workload for maximum performance.
With the introduction of new big data analytics frameworks like Apache Spark that help with analytics beyond Apache Hadoop's MapReduce paradigm, end-to-end data pipelines sometimes require switching between different big data computation frameworks. Smart Execution engine works by examining data set characteristics, analytics tasks and available system resources to determine the most efficient compute framework for the task at hand.
Some of the advantages of using this solution are performance improvement and better Hadoop hardware utilization.
InfoQ spoke with Matt Schumpert, Director of Product Management at Datameer, about the new product and how it works to help with big data analytics needs.
InfoQ: How does the new Smart Execution engine work in terms of the technical architecture?
Matt Schumpert: It’s a 100% YARN-native Hadoop application that leverages multiple computation frameworks at runtime, with an intelligent decision engine that examines data set characteristics, analytics tasks and available system resources to match the analytics task to the optimum computation framework.
More details on the architecture are in this article.
InfoQ: How does the execution engine determine the efficient compute framework for the task work? Can you give a real-world example of the big data analytics use case and how Datameer execution engine helps with efficiency of the process?
Matt: See above for part 1 of your question. A real-world analytics use case would be any kind of analysis that “cooks” big data down into consumable insights. For example, if you’re looking for a segment of customers who exhibit a particular behavior on your web site, the clickstream data for all your visitors is huge, but the final results may be a segment of a few thousand known visitors or customers (small data). You might do filtering, sessionization, path analysis, and even clustering along the way to build your segment. But the last few steps of that process typically work on highly aggregated data that can fit into Datameer’s single-node execution framework (meaning computation that would previously run as multiple MapReduce jobs now runs in a single process on a single machine).
A second example would be mapping the entire customer journey across different touch point (web, mobile, social media, call center, and retail store), using a variety of data sources. Defining certain behaviors based on a stream of complex interactions with the customer requires big data and a scheduled batch process (daily, for example). However, once you’ve honed in on a particular behavior and want to go deeper in analyzing that segment of your user population, the big data is suddenly small. A different type of analyst can work on that segment and apply machine learning techniques like decision trees in Datameer to help explain the "why" of these behaviors or outcomes, and for him/her, the entire process might run in-memory in just a few seconds, despite being executed on Hadoop.
Other examples include iterative algorithms like cross-correlation or product recommendation, which benefit even more from in-memory processing than other analytics tasks. The SmartExecution engine can again run these tasks within a single process when possible but these algorithms were especially inefficient with traditional MapReduce making performance gains huge.
InfoQ: Can you talk about the new computation frameworks that are coming out to complement the traditional MapReduce and Hadoop data analytics solutions? Are there other framework than for example Apache Spark and Storm, that developers need to be aware of?
Matt: We're continually looking at new projects, most recently Apache Samza. Developers need to be aware of all of these frameworks, and Spark certainly has some momentum and the potential to be very influential in Hadoop long-term. That said, with this proliferation, people should take care to select frameworks that both match their use case and are tightly integrated with the rest of their technology stack. Hybrid architectures are brittle and can be difficult to support. One thing to look for in a computation framework is strong YARN support.
With Datameer, the user need not worry, and that’s precisely the point of SmartExecution, to take the guesswork out of picking the right framework.
InfoQ: Datameer's execution engine dynamically selects the compute frameworks for each step of an analytics job and orchestrates multiple compute frameworks as necessary. This sounds like an ESB (Enterprise Service Bus) for big data analytics. Can you explain how Datameer dynamically selects the better computing solution? What algorithms does it use?
Matt: There is a bit of that notion of an orchestration engine there, yes. That said, the flexibility to switch frameworks automatically, seamlessly and reliably WITHIN a single pipeline, and the fact that it uses Hadoop resources much more efficiently (resulting in better performance and infrastructure costs savings) is the bigger win here, more than the actual selection algorithm itself. We don’t publish the algorithm itself at this point. There’s also intelligent, adaptive resource utilization techniques at play within each execution framework (think pre-heating an oven) that make jobs run much faster. The beauty of YARN is it gives us that freedom.
InfoQ: Is the selection based on the type of the analytics requirement involved in each step of the process? Or are there other metrics the execution engine uses for the selection of computation frameworks?
Matt: There are other metrics, including data set characteristics, analytics tasks and total system resources.
InfoQ: How does it work with different types of data sets (e.g. Streaming Data or Graph Data) to decide which analytics solution to select for the data analysis steps?
Matt: Large event stream data (e.g. a clickstream) has always been well-suited for Hadoop, due to the size and sequential nature of processing it. Most of our customers have this type of data. However, that doesn’t mean it’s processed in a streaming fashion, or by using a graph database. Those are completely different architectures and usually require that you pre-position your data in a purpose-built stream processing solution (e.g. Storm/Spark for streaming data), or use a specific storage layer based on a graph layout (e.g. Neo4j). There's often a lot of a priori knowledge of the data required, which isn't the nature of exploratory analytics. Additionally, for those systems to work side-by-side with other compute frameworks, they need to be tightly woven into the storage and compute fabrics of Hadoop, and applied for a particular purpose that they’re good at. Finally, those systems often require copying your data around, and that administrators deploying new infrastructure. At this point, it’s hard to think of spinning them up on the fly for an analytics sub-task, but it’s theoretically possible to do so.
InfoQ: Smart Execution will execute analyses on large data in a Hadoop cluster using Apache Tez, an optimized form of MapReduce. What is the rationale for choosing Apache Tez for the execution of analytics jobs? Did you consider other frameworks for this purpose?
Matt: We considered everything. Tez provided a few things that were important to us (not to say that others don’t):
A. Proven reliability at scale (we use Tez all the way to the top of the data/compute scales, to the Petabyte level on 1000+ node clusters of our largest customers)
B. A smooth transition from traditional MapReduce (it's much easier to port existing jobs and then take iterative steps to deliver on performance improvements without sacrificing the stability of your system)
C. A zero-install, zero-deployment solution. Tez is deployed at runtime by our system onto any Hadoop 2.2 or later cluster running YARN
D. Tight integration with YARN. This was a must, as resource management across the thousands of jobs our customers run daily must be a centralized, coordinated affair if we’re able to deliver and guarantee reliably good performance across a large user base.
InfoQ: In the new execution engine, small data analysis will be executed on a single Hadoop node or using in-memory technology. What in-memory technology is used for the analysis of small data sets? Are you leveraging existing in-memory data grids like Hazelcast or do you have your own in-memory technology?
Matt: We have some technology from within Datameer we’ve been hardening for over four years for this. It requires no separate product, install, config, etc., runs within YARN, and safely and fairly shares resources with other workloads.
InfoQ: What level of monitoring and metrics support is available in Datameer? This is critical when there are different big data analytics solutions are used in a single analytics job.
Matt: We provide the job flow graph for each job (see screenshot here), and because we’ve built on YARN, administrators can continue to monitor jobs from their existing Hadoop administration console.
InfoQ: What is the best way for developers to learn more about Datameer Smart Execution engine and try it out in their applications to assess its capabilities?
Matt: The product will be available for use in Q4. There are also some details on our web site here and here.
It’s important to note that our primary audience is business analysts and the IT partners that support them in doing self-service analytics, not software developers building special-purpose custom applications at the code level. Though we do have an SDK to provide extensibility, the core part of Datameer’s mission is to avoid writing code when at all possible, because analytics is a naturally fluid area of work in any business.
Matt also talked about the current state of the big data analytics space and how technologies like Apache YARN help with performing data analytics on diverse workloads and data sets.
Matt: We’re really still at the beginning of the impact that Smart Execution and YARN are going to have on the big data analytics space. YARN opened the door architecturally for mixed workloads and mixed-size data sets to be supported well on Hadoop. That opens up Hadoop to run a variety of traditional EDW workloads, to come in and complement the machine learning, ETL, batch analytics and storage purposes Hadoop has traditionally served. That saves organizations millions of dollars in operational costs, gives them a flexible, elastic, scale-out model for their data infrastructure, and lets them deploy and govern a wide variety of data-centric applications centrally and close to the data itself, which is a key reason Hadoop was invented in the first place.
About the Interviewee
Matt Schumpert is Director of Product Management at Datameer. He has been working in enterprise software of over 10 years in various capacities, including sales engineering, strategic alliances and consulting and holds a BS in Computer Science from the University of Virginia.