Hortonworks Data Platform (HDP) version 2.2, with over a hundred new features, including vertical integration of new engines with Apache Hadoop YARN and horizontal integration of enterprise services such as governance, security and operations, represents the latest for Hadoop innovation.
At the heart of Hadoop is YARN, the cluster resource management platform that enables engines like Apache Hive for enterprise SQL at scale, Apache Storm for real-time data processing, Apache Spark for iterative processing, Apache Kafka for pub-sub messaging and so on.
HDP 2.2 was released for early access beta with widespread industry support and will be GA shortly. A full list of features and a preview to download are currently available.
InfoQ caught up with Vinod Kumar Vavilapalli, the YARN Development Lead at Hortonworks as well as the Project lead for the YARN project at the Apache Software Foundation.
InfoQ: Since the GA of Apache Hadoop 2.2.0 almost a year ago, how do you see the adoption of YARN by the developer community?
It’s been a fantastic year since the first GA release of Apache Hadoop 2 and YARN by the community. For those of us immersed in the daily evolution of Hadoop it seems like the initial release was ages ago, but for the general public I guess it seems like it was yesterday. The accelerating pace of innovation within the community is astonishing.
Over the course of the first year of Hadoop 2, we have seen massive number of enterprise and other migrations of Hadoop clusters to YARN.
Furthermore, we have seen broad adoption of YARN throughout the ecosystem of Hadoop related projects. New frameworks have sprung up like Apache Tez and Apache Samza that solely work on top of YARN. There has been tremendous activity over the past year in the framework layers that eventually expose a programming model or an API to end users, but use the common resource management substrate. We have also seen power users going natively to YARN to implement special purpose applications that build direct integration with the resource management layer.
InfoQ: Can you summarize the advantage(s) of running an existing Map/Reduce program on YARN? Is there a performance benefit?
In order to understand the benefits of YARN, a short backgrounder on Map/Reduce on YARN is appropriate. When somebody is running a MapReduce application on YARN, they are really interacting with the MapReduce framework that is built to run on top of YARN – we refer to this as MRv2-on-YARN. Last year when we released YARN, we made sure that every MapReduce application that was already written against Hadoop 1’s public stable APIs could run with MR on YARN with absolutely no changes. Over the course of the past year, I’ve personally seen how much this backwards compatibility matters to our customers and existing users migrating to YARN.
To quantify this, we had run Hadoop standard benchmarks long before even the first alpha release of YARN. Almost all of even these early benchmarks performed significantly better on YARN when compared to the latest stable Hadoop-1.0. Most of them ran twice as fast, hence our pseudo math of “Hadoop 2.x >= 2 * Hadoop 1.x” to mean that you can get 2x throughput on the same hardware. Even those tests that didn’t perform significantly better were at least on par with Hadoop-1.0.
It’s funny - a few days ago I was joking that the alpha and beta releases of YARN were similar to what Gmail went through – the platform had long gone to production despite its beta moniker at that time. We deliberately chose to delay the GA release till we could be absolutely sure about supporting paradigms beyond YARN by stabilizing the platform APIs. These API changes done by the community enabled us to confidently support users of YARN APIs, essentially framework developers beyond just MapReduce, for a long, long time.
Final note on performance, these improvements in itself give a lot of incentive for organizations to upgrade to HDP 2 even if they don’t plan on taking advantage of other unlocked dimensions of the new platform – things like varied data access, scalability, agility etc.
InfoQ: YARN aside, it looks like the Hadoop ecosystem has exploded in the past year or so. Can you mention a couple of projects from this ecosystem that are extremely relevant to developers and why they should pay attention?
Oh, yes, that also happened during the past activity filled year for Hadoop! It’s becoming harder and harder, even for those of us who are active contributors to Hadoop related projects and involved in the day-to-day evolution of the ecosystem, to assimilate the gamut of innovations happening across the board.
The one project that is close to my heart is Apache Tez – the Distributed execution framework targeted towards data-processing applications. It is built on top of YARN, takes advantage of all the resource management functionality already available to MapReduce applications, and extends it to use-cases beyond batch processing. For a project of such young age, it has already proved its mettle – all the big ecosystem projects of Hadoop like Apache Hive, Apache Pig, etc. have already integrated with Apache Tez to unlock faster execution and greater performance when compared to Hadoop MapReduce. If you already have Hadoop 1 based SQL analytics or data pipelines, at the flip of a switch you can get a massive performance boost. If you are a developer looking to write a new DSL for data processing, you will want to look at Tez.
Another important project to me is Apache Slider, which was started in order to help ease the process of running long-running, services-based distributed applications on YARN. Slider helps these apps integrate with YARN without modification to the existing apps. It is a layer that provides consistency across the ecosystem of YARN applications, which eases operations and resource management. This enables YARN to extend linear scale compute and storage to all sorts of existing enterprise apps today.
There is so much more that is happening in this space – the Stinger initiative around Apache Hive continues to deliver mammoth gains in speed, scale and comprehensive SQL Analytics with Apache Hive, Apache Spark on YARN for fast, in-memory data processing for machine learning and data-science and so on.
In all, I would say the most exciting aspect of this “explosion” that you speak of is the level of involvement in all the open source Apache projects that have risen around Hadoop to make it an enterprise data platform. It is amazing that this small project we worked on for years is now the center of the universe of thousands of developers around the world.
InfoQ: You just quoted at least three favorite projects. What do you have to say for the skeptics of Hadoop who think that the ecosystem is getting too complex with too many overlapping projects doing almost similar things? Can you give some specific examples of how this might be resolved?
There is a truth to the point of growing complexity of the entire ecosystem but there is also a misattribution of the complexity that comes with it.
Unlike many other unified single-stack architectures that came before, the Hadoop platform is built around individual layers of individual responsibilities. This is the Unix philosophy; each of these layers is built in order to perform one thing and one thing well. This not only helps in delineating responsibilities, but it also helps in a much faster evolution. Remember that several different open developer communities are working on each layer. Sometimes, this does mean there are two or more disjoint sets of developers that work on the same layer, but that’s okay – either each of those projects carve out their niche or the single best project simply emerges. In a truly open community, a meritocracy, no single vendor ultimately decides the best approach.
InfoQ: There is a lot of horizontal growth in HDP 2.2, especially with respect to compliance, security, governance, operations and so on. How will these features help Hadoop gain more ground in the Enterprise?
We spend as much time on horizontal concerns of the platform as we do on enhancing vertical capabilities.
In this macroscopic environment, one can easily see how security is no longer a side issue. With increased adoption of Hadoop, and its use of a shared, multi-tenant service within organizations, security has come to the forefront. Hadoop historically had good authentication capabilities woven into all the layers, and we are now extending that with Apache Ranger as part of HDP 2.2. Ranger is a comprehensive approach to centralized security policy administration addressing authorization and auditing. Administrators can easily set policies on files, tables, etc. for individual users and groups and then audit access of specific data feeds, for example. Ultimately, security requires a comprehensive approach and is addressed at all layers of the stack. We are working throughout the entirety of the Hadoop ecosystem to weave a security fabric throughout. It is imperative.
On the data governance and management side, we have Falcon making huge strides in HDP 2.2. You will see enhancements to the data-pipeline management UIs, security improvements, Disaster Recovery via data and meta-data replication, UI and API improvements to help users answer questions about data lineage, classification and audit.
Operations are getting better with dozens of new features in Apache Ambari. If I have to pick one important theme, it would be how Ambari is getting better at enabling third party extensions to both APIs and UIs. With Ambari Views, one can use the same underlying cluster information that Ambari uses and build UIs that expose domain or site specific visualizations. With Ambari Blueprints, one can specify a completely new stack definition and its component layout and let Ambari perform the provisioning and orchestration of the custom stack without having to rely on the usual Install Wizard. This is essentially ripping out the guts of Ambari and exposing them as clean APIs for third party integration.
This is all happening in the open with a broad community of developers from thousands of companies contributing to make certain the right requirements are being delivered on. Again, this is the power of the open community that only the Apache Software Foundation can foster.