Hadoop is an open source distributed computing platform that includes implementations of MapReduce and a distributed file system. Last month InfoQ covered Jeremy Zawodny's overview of the speed increases in Hadoop over the last year. InfoQ's lead Java editor, Scott Delap, recently caught up with Hadoop project lead Doug Cutting. In this special InfoQ interview Cutting discusses how Hadoop is used at Yahoo, the challenges of its development, and the future direction of the project.
Scott Delap (SD): Is Hadoop being used in production for any functions at Yahoo as of yet? If not what is the plan in terms of moving it from an experimental technology to a core infrastructure component.
Doug Cutting (DC): Yahoo! regularly uses Hadoop for research tasks to improve its products and services such as ranking functions, ad targeting, etc. There are also a few cases where data generated by Hadoop is directly used by products. Hadoop's longer-term goal is to provide world-class distributed computing tools that will support next- generation web-scale services such as analyzing web search crawl data.
SD: How large is the Hadoop team at Yahoo? How many active outside contributors does the project have?
DC: Yahoo! has a focused team of developers working directly on Hadoop, actively contributing to the Apache open source project as their primary job. There are also a handful of non-Yahoo!s who contribute to Hadoop on a daily, weekly, or monthly basis.
SD: Yahoo is taking a decidedly different approach to scalable infrastructure in comparison to Google. While Google has published numerous papers on its technology, it is not available for public use. Why do you feel that the open source direction is the right one to take?
DC: Open source works best when everyone agrees on what the product will produce and when there is a documented solution to a well- understood need. The open source development model works exceptionally well for infrastructure software that has a wide variety of applications in many fields. We've seen this with a lot of the software that Yahoo! uses and supports: FreeBSD, Linux, Apache, PHP, MySQL, and so on.
Making this technology available for anyone to use helps Yahoo while also helping to advance the state of the art in building large distributed systems. The source code is only one piece of the puzzle. An organization still needs an incredibly talented team of engineers to put it to use by solving big problems. Having the ability to deploy and manage the right infrastructure is also incredibly important. Few companies have all the necessary resources today.
Finally, engineers love working on Open Source because it helps give them some visibility in a larger community of like-minded developers while also learning non-proprietary skills that can be used on future projects. An environment like that also makes it easier to recruit great engineers.
Yahoo! and the Hadoop community both benefit from collaborating to understand the need for large-scale distributed computing, and from sharing our expertise and technology to create a solution that everyone can use and modify.
SD: Moving back to the technology itself, what have you learned about the factors influencing speed and reliability in the last year as Hadoop has evolved. I noticed that a sort benchmark on 500 notes is 20x faster than a year ago. Has there been one magic bullet or an aggregation of items that has influenced the speedup?
DC: In the process of building software infrastructure for its web- scale services, Yahoo! recognized that a growing number of other companies and organizations were likely to need similar capabilities. Yahoo! made the decision to take an open source approach rather than developing a proprietary solution, hired me to lead its efforts and began backing the project. To date, Yahoo! has contributed the majority of the project's code.
The speedup has been an aggregation of our work in the past few years, and has been accomplished mostly by trial-and-error. We get things running smoothly on a cluster of a given size, then double the size of the cluster and see what breaks. We aim for performance to scale linearly as you increase the cluster size. We learn from this process and then increase the cluster size again. Each time you increase the cluster size reliability becomes a bigger challenge since the number and kind of failures increase.
Every time we do this, we are learning about what can be achieved and contributing to the common knowledge about open source grid computing. New failure modes emerge as the scale increases, problems that were rare become common and must be dealt with, and the lessons we learn influence the next iteration.
SD: Images allowing Hadoop to run on Amazon EC2 have begun to emerge in the last year. This allows developers to quickly start up their own mini clusters. Is there any additional work being done to simplify the management of a cluster, the HDFS, MapReduce processes, etc.?
DC: Yahoo! has a project called Hadoop on Demand (HOD) that allocates MapReduce clusters dynamically to users from a pool of machines. This is in the process of being contributed to the Hadoop open source project. Amazon EC2 is ideal for folks who are getting started with Hadoop, since operating a large cluster is complex and resource-consuming.
SD: How would you compare the functionality of Hadoop to Google's published information at this time in terms of a percentage of equality? Have features such process location to data location optimization been tackled as of yet?
DC: Large scale distributed computing software has been developed internally by numerous companies (including Yahoo!) and in academic research labs in the last decade. Interest has accelerated recently as the computing economics have become more favorable and the applications more visible in consumer products. Unlike Google, Yahoo! has decided to develop Hadoop in an open and non-proprietary environment and the software is free for anyone to use and modify.
The goal for Hadoop extends far beyond cloning anything that exists today. We're focused on building a usable system that everyone will benefit from. We've implemented most of the optimizations Google has published, plus many others not mentioned. Yahoo! has taken a leadership role in the project because its goals match our own needs quite well and we see the benefit of sharing this technology with the rest of the world.
SD: The latest official release is 0.13.1. What are the major features on the horizon in the near term? What needs to be finished for a 1.0 release?
DC: The 0.14.0 release is just about out the door, with a list of 218 changes.
The biggest changes in 0.14.0 are to the filesystem. We greatly improved data integrity. This is a mostly invisible change to users, but it means that the filesystem can detect corruption earlier and more efficiently. This is critical, because, with the size of datasets and clusters we're dealing with, both memory and disk corruption are frequent. We also added modification times to files, added a C++ API for MapReduce, plus a host of other features, bug fixes and optimizations.
Hadoop 0.15.0 is still taking shape, with 88 changes planned so far.
This will add authentication and authorization to the filesystem, to make it harder for folks sharing a cluster to stomp on each other's files. We're also planning to revise a lot of the MapReduce APIs.
That's a tough one, since it will require users to change their applications, so we want to get it right the first time.
We currently hope that 0.15 will be the last pre-1.0 release. Once we make a 1.0 release then we will need to be much more conservative about making incompatible changes. We already pay a lot of attention to back-compatibility, but, with 1.0 that will become much more important yet. The plan is that any user code written against 1.0 must continue to run unmodified for all subsequent 1.X releases. So we need to make sure we have APIs that we can live with for a while, or at least ones that we can easily extend back-compatibly. We're going to try to get those all in 0.15 and live with them for one release before we lock them down in the 1.0 release.