The Hadoop Summit of 2010 included a keynote from Peter Sirota, General Manager of Amazon Elastic MapReduce (EMR), which is a hosted Hadoop offering from Amazon that includes web-based management tools. Sirota outlined the following use cases as common ones for their customers:
- data mining & business intelligence including log processing, clickstream analysis, similarity analysis, and targeted advertising (which he termed "a huge use case")
- data warehousing especially with Pig and Hive
- bio-informatics (genome analysis)
- financial simulation (e.g., monte carlo simulations)
- file processing (e.g., resizing jpegs)
- web indexing
Sirota noted that customers can store 100's of PBs on Amazon's S3 storage system. He announced that Amazon are now supporting a new stack based on Hadoop 0.20 as well as one based on Hadoop 0.18 that they "won't retire any time soon." The Amazon EMR software is integrated with their management console, and works natively with Amazon's S3 cloud storage facility.
New Stack |
Old Stack |
Hadoop 0.20 | Hadoop 0.18 |
Pig 0.6 | Pig 0.3 |
Hive 0.5 | Hive 0.4 |
Cascading 1.1 | Cascading 1.1 |
Sirota noted that customers had asked for more flexibility in running clusters, better application development tools, improved analytics and improved support options. He then announced new capabilities and partnerships in each area. Sirota announced that they are allowing customers to add and remove nodes to running clusters, which can adjust the runtime of jobs already underway - doubling the computing capacity of a job that is expected to take 6 more hours to complete could cut the time required to finish to 3 hours. He also noted that this will allow customers to conveniently change the sizes of clusters, allowing for a smaller set of nodes to answer queries using Hive and to ramp up clusters for larger batch processes that update a Hadoop system, all while keeping the same EMR cluster up.
Sirota also preannounced the coming availability of spot pricing for elastic mapreduce, extending Amazon's market pricing for excess EC2 capacity to EMR. This will allow bidding a certain amount for additional nodes. The nodes will be added to the EMR cluster if there is capacity available at the price that was bid, although they could be removed if the market price rises above the bid price. He gave the example of having a job use four on demand nodes, with five additional spot nodes being added. This option can provide cost savings for environments where there is more flexibility for how quickly calculations complete.
Sirota also announced the availability of new silver and gold premium support levels for EMR, where gold support is 7x24 and guarantees a 1 hour response time for urgent issues. Sirota then demonstrated Amazon's partnerships with Karmasphere for developer tools and monitoring Datameer for business user analytics, and Microstrategy which is providing Hadoop support in general, including EMR support, providing integration with their business intelligence tools through Hive.
Amazon hosted an Elastic MapReduce customer panel at the Hadoop Summit, which featured case studies from Razorfish, Netflix, Spiral Genetic, and Coldlight Solutions summarized by James Hamilton.
Amazon demonstrated significant continued investment in improving Elastic MapReduce and gave some interesting insights into the kinds of large scale applications that are being made with the hosted offering.