Amazon recently announced EMR File System (EMRFS), an implementation of Hadoop Distributed File System (HDFS) that allows Amazon Elastic MapReduce (EMR) clusters to use Amazon Simple Storage (S3) with a stronger consistency model. When enabled, this new feature keeps track of operations performed on S3 and provides list consistency, delete consistency and read-after-write-consistency, for any cluster created with Amazon Machine Image (AMI) version 3.2.1 or greater.
EMRFS is particularly important for applications that rely on a chained series of MapReduce jobs because S3 does not guaranty when objects created by a job will be visible by the next one. Amazon explains that changes are typically visible in "tens or hundreds of milliseconds" but does not provide any upper-bound measure. In fact, Patrick Eaton, system architect at Stackdriver (a cloud monitoring service acquired last year by Google), explained in a blog article:
Under normal circumstances, [S3] objects are available just a few seconds after they are written, and the processing pipeline is very predictable. But (...) you cannot rely on such assumptions to hold always when you are using an eventually consistent system. In [one particular] case, it took almost 12 hours for several dozen operations to become consistent. For another ten objects, it took almost a full 24 hours.
Similarly, Netflix engineers noticed that the inconsistency window for list operations can occasionally last more than "a few minutes". As for read-after-write consistency, researchers from the Karlsruhe Institute of Technology in Germany measured a peak at 48 seconds. Running a pipeline of MapReduce jobs on S3 without EMRFS is for this reason very likely to lead to data corruption, data loss and job failures. It is also greatly discouraged by the official Hadoop wiki.
In order to overcome the problem of consistency and allow MapReduce pipelines to run safely on S3, Amazon extended HDFS with mechanisms for detecting inconsistent listings and implementing retry strategies. It uses a DynamoDB table to keep track of all changes made to S3 from EMR, and use this information to check the results of list operations. In case of inconsistency, EMRFS retries five times by default using an exponential back-off retry policy before throwing an exception. Users can also opt for a fixed retry period or log the error without throwing an exception.
As Werner Vogels, chief technology officer at Amazon, explained in an article published 2009 in the Communication of the ACM, building a reliable distributed storage system on a worldwide scale is particularly challenging, and trade-off are unavoidable. In fact, according to the CAP theorem, any shared data system can provide only two of the three properties: consistency (C), high availability (A), and tolerance to network partitions (P). Since availability and network partition are two strong requirements for an Internet-scale computing platform, S3 designers were only left with the option of trading-off consistency.
Seventeen years after the introduction of this theorem by Eric Brewer, professor of computer science at the University of California, Berkeley, researchers continue to investigate how to improve performances of distributed systems. Among the most recent publications are the work of Bailis et al. (2014) on PBS (Probabilistically Bounded Staleness), a statistical model for quantifying eventual consistency, and Bermbach et al. (2014) on building a standard consistency benchmark for cloud-hosted data storage services.
Shortly after the announcement of EMRFS, Jonathan Fritz, senior product manager for Amazon EMR, published a detailed article on the AWS Big Data Blog explaining how to use EMRFS in an ETL pipeline. The article also mentions that while EMRFS is available for free in all AWS regions, the meta-data stored in the DynamoDB table are charged by hour.