Amazon Elastic MapReduce (Amazon EMR) simplifies running Apache Hadoop and related big data applications on AWS. It removes the cost and complexity of managing a Hadoop installation and the resulting cluster. This enables any developer or business to do analytics without large capital expenditures. It allows one to spin up a performance-optimized Hadoop cluster in the AWS cloud within minutes on the latest high performance computing hardware and network without making a capital investment on purchasing the hardware or patching his Hadoop distribution.
A new Best Practices for Amazon EMR whitepaper explains the best practices of moving data to AWS, strategies for collecting, compressing, aggregating the data, and common architectural patterns for setting up and configuring Amazon EMR clusters for processing. It also provides examples for cost optimization and leverages a variety of Amazon EC2 purchase options such as Reserved and Spot Instances.
The whitepaper starts by describing a number of approaches which are available for moving large amounts of data from local storage to Amazon Simple Storage Service (Amazon S3) or from Amazon S3 to Amazon EMR and the Hadoop Distributed File System (HDFS). When doing so, however, it is critical (according to whitepaper) to use the available data bandwidth strategically. The paper further describes a number of optimizations that enable uploading several terabytes a day.
The main tools for data uploads described in the whitepaper include:
- DistCp (distributed copy) - a tool that can be used for large inter- or intra-cluster copying of data. It uses Amazon EMR for distribution, error handling, recovery, as well as reporting. It converts a list of files and directories into input to map tasks, each of which will copy a portion of the files specified in the source list.
- S3DistCp - an extension of DistCp optimized to work with AWS, particularly Amazon S3. By adding S3DistCp as a step in an EMR job flow, it allows to efficiently copy large amounts of data from Amazon S3 into HDFS where it can be processed by EMR clusters.
- Jets3t Java Library - an open-source Java toolkit for developers to create applications to interact with Amazon S3. JetS3t provides both low-level APIs and tools that allow working with Amazon S3 without writing Java applications.
- Aspera Direct-to-S3 - providing a proprietary file transfer protocol based on UDP, which allows for high-speed file transfers over the Internet. It is especially effective for transferring large amounts of data from a local data center to Amazon S3 for later processing.
While data uploads are well suited for moving large amounts of data from one place to another, bringing together data from multiple locations typically requires slightly different implementations of data collections. The main solutions for data collection described in the whitepaper include:
- Apache Flume - a distributed, reliable, robust, fault tolerant and available service for efficiently collecting, aggregating, and moving large amounts of log data. It has a simple and flexible architecture based on streaming data flows with tunable reliability mechanisms and many failover and recovery mechanisms.
- Fluentd - an open source tool to collect events and logs and store them for search, analytics, and archiving in S3 and Hadoop (HDFS).
Once the data has been collected, the next important task is to aggregate the data. The whitepaper discusses in details several data aggregation best practices including aggregated data size, data partitioning and data compression.
The bulk of the whitepaper is devoted to data processing with AWS EMR, providing recommendations for:
- Picking the right instance size when provisioning an Amazon EMR.
- Picking the right amount of instances for a cluster.
- Estimating the amount of mappers and reducers for a cluster.
- Transient vs. persistent EMR cluster
- Optimizing EMR cluster costs by using appropriate EC2 instances - on-demand, reserved and spot
- Optimizing EMR cluster performance
The whitepaper also covers a number of EMR patterns and advice on:
- Using S3 instead of HDFS
- Using S3 and HDFS together
- Using Elastic EMR clusters
All in all, Best Practices for Amazon EMR is an invaluable guide for everyone who seriously considers using AWS Elastic Map-Reduce.