Hadoop is a software platform lets one easily write and run applications that process vast amounts of data...Hadoop implements MapReduce, using the Hadoop Distributed File System (HDFS) (see figure below.) MapReduce divides applications into many small blocks of work. HDFS creates multiple replicas of data blocks for reliability, placing them on compute nodes around the cluster. MapReduce can then process the data where it is located...
Hadoop, which is written using Java, has been backed extensively by Yahoo in recent years with project lead Doug Cutting being hired by Yahoo to work full time on the project in January of 2006. The University of Washington has when so far as to run a course on distributed computing using Hadoop as a base. The course material has been posted on Google Code for other developers interested in the technology.
Recently Yahoo's Jeremy Zawodny provided a status update of Hadoop:
For the last several years, every company involved in building large web-scale systems has faced some of the same fundamental challenges ... The underlying infrastructure has always been a challenge. You have to buy, power, install, and manage a lot of servers. Even if you use somebody else's commodity hardware, you still have to develop the software that'll do the divide-and-conquer work to keep them all busy ... To build the necessary software infrastructure, we could have gone off to develop our own technology, treating it as a competitive advantage, and charged ahead. But we've taken a slightly different approach. Realizing that a growing number of companies and organizations are likely to need similar capabilities, we got behind the work of Doug Cutting (creator of the open source Nutch and Lucene projects) and asked him to join Yahoo to help deploy and continue working on the [then new] open source Hadoop project...
Zawodny goes on to provide data sort benchmarks over the last year. In the tests each node sorts the same amount of input data. So as an example of ratios, 20 nodes might sort 100 records each for a total of 2000 records while 100 nodes would sort 100 records each for a total of 10000. Recent benchmarks are as follows:
Date: | Nodes | Hours | |
April | 2006 | 188 | 47.9 |
May | 2006 | 500 | 42.0 |
December | 2006 | 20 | 1.8 |
December | 2006 | 100 | 3.3 |
December | 2006 | 500 | 5.2 |
December | 2006 | 900 | 7.8 |
July | 2007 | 20 | 1.2 |
July | 2007 | 100 | 1.3 |
July | 2007 | 500 | 2.0 |
July | 2007 | 900 | 2.5 |
Tim O'Reilly picked up on Zawodny's post and confirm that support comes from the top of the Yahoo organization:
...Yahoo! had hired Doug Cutting, the creator of hadoop, back in January. But Doug's talk at Oscon was kind of a coming out party for Hadoop, and Yahoo! wanted to make clear just how important they think the project is. In fact, I even had a call from David Filo to make sure I knew that the support is coming from the top......why is Yahoo!'s involvement so important? First, it indicates a kind of competitive tipping point in Web 2.0, where a large company that is a strong #2 in a space (search) realizes that open source is a great competitive weapon against their dominant competitor ... Supporting Hadoop and other Apache projects not only gets Yahoo! deeply involved in open source software projects they can use, it helps give them renewed "geek cred." ... Second, and perhaps equally important, Yahoo! gives hadoop an opportunity to be tested out at scale...
Blogger John Munsch has summed up the Yahoo involement by saying "Hadoop And The Opposite Of The Not-Invented-Here Syndrome".
Microsoft's Sriram Krishnan shifts the discussion considering the problem facing startups and developers with the industry moving to such large scale solutions and evolving solutions like Hadoop and Amazon EC2:
...Most of the value in Web 2.0 comes from data (generated by lots of users). E.g - del.ico.us, Digg, Facebook ... It is beyond the financial means of any one person to run large scale server software E.g Gmail, Google Search, Live, Y! Search ... Your typical long-haired geek probably can't get his hands on [# Large scale blob storage (S3, Google File System), Large scale structured storage ( Google's Bigtable), Tools to run code across such infrastructure (MapReduce, Dryad).]...I'm not sure how far along Doug Cutting's open source equivalents have come along for these - so that might be an answer...