MapR Technologies released a big data toolkit, based on
Apache Hadoop with their own distributed storage alternative to
HDFS. The software is commercial, with MapR offering both a free version, M3, as well as a paid version, M5. M5 includes snapshots and mirroring for data, Job Tracker recovery, and commercial support. MapR's M5 edition will form the basis of EMC Greenplum's upcoming
HD Enterprise Edition, whereas EMC Greenplum's HD Community Edition will be based on Facebook's Hadoop distribution rather than MapR technology.
At the
Hadoop Summit last week, MapR Technologies announced the general availability of their "Next Generation Distribution for Apache Hadoop." InfoQ interviewed CEO John Schroeder and VP Marketing Jack Norris to learn more about their approach. MapR claims to improve MapReduce and HBase performance by a factor of 2-5, and to eliminate single points of failure in Hadoop. Schroeder says that they measure performance against competing distributions by timing benchmarks such as DFSIO, Terasort, YCSB, Gridmix, and Pigmix. He also said that customers testing MapR's technology are seeing a 3-5 times improvement in performance against previous versions of Hadoop that they use. Schroeder reports that they had 35 beta testers and that they showed linear scalability in clusters of up to 160 nodes. MapR reports that several of the beta test customers now have their technology in production - including one that has a 140 node cluster in production, and another that "is looking at deploying MapR on 2000 nodes." By comparison, Yahoo is believed to run the largest Hadoop clusters, comprised of
4000 nodes running Apache Hadoop and competitor
Cloudera claimed to have more than 80 customers running Hadoop in production in March 2011, with
22 clusters running Cloudera's distribution that are over a petabyte as of July 2011.
MapR's distributed file system supports full random access read and write within files and provides NFS gateways to support traditional POSIX filesystem access in addition to the Hadoop FileSystem API. The MapR file system works on raw disk (rather than running atop file systems such as ext4), so it requires a separately formatted volume for use. MapR supports compression in the file system layer and makes multiple copies of metadata across the cluster for availability. MapR distributes metadata across nodes and doesn't require it to be held in RAM which they claim will allow a single cluster to support a trillion files. This differs from HDFS, which currently keeps all file metadata in RAM on a single machine. Cloudera and
Hortonworks have both
identified removing the single point of failure for HDFS as a top priority for the Hadoop community, and Hortonworks has also
identified HDFS file scalability as a top priority for 2012. The MapR file system is implemented in C is implemented by routing data using a state machine instead of a multi-threaded locking scheme. MapR uses its distributed file system to implement the Hadoop shuffle (instead of http), and it multiplexes connections between any pair of nodes over a single connection, which allows a wider fan-in for large sorts.
The paid M5 version of MapR's product costs $4000 per node per year and supports replication, snapshots, and mirroring for files, as well as commercial 24x7 support. MapR's commercial M5 distribution also includes a facility to restart a JobTracker within seconds of a failure, and a means for TaskTrackers to reconnect. This means that there can be a delay in completing jobs in this case, but jobs that are in progress will continue to execute and complete, rather than failing as in stock Apache Hadoop. In the event of a file system master process crashing, another replica takes over immediately and transparently without interruption of service.
MapR recently
announced they would contribute enhancements they've made back to open source projects. We asked what technologies they will contribute. Schroeder highlighted fixes in
Zookeeper,
HBase, and
Mahout. Schroeder says that they are considering open sourcing additional technologies if there are clear benefits to customers. He adds, however, that the customers he speaks to are not concerned that some technologies will remain closed-source. Schroeder says that they do want applications to work and to run on standard APIs so they will continue to run in the future.
InfoQ asked Schroeder about Hadoop governance. Schroeder said MapR wants to be a part of the Apache Hadoop community and "we are a part of it by default. What's published by Hadoop becomes a de facto standard." He would like to see a consortium to standardize APIs and offer a certification environment like ANSI SQL or NFS. InfoQ asked Schroeder about the risk of some of the key technologies in Hadoop forking. Schroeder felt that the term fork is a loaded word, but that to him the risk is fracturing the community. He asked "How do you improve a platform that needs innovation without changing it?" To him, the API layers are really important. He asked rhetorically "If the NameNode is a Single Point of Failure or if it does not perform well, if it can't scale, is fixing those problems not allowed?" He said that by extension, you could mandate that you have to use function calls with Hadoop MapReduce, so you can't use
Datameer (a graphical BI tool for Hadoop). Schroeder argues that Hadoop is an open source project that requires a lot of innovation and that a lot of engineering resources are required to mature it, because unlike Linux and MySQL, Hadoop has become popular before its technology class has matured.
MapR's distribution bundles many of the same Hadoop ecosystem components as
Cloudera's Distribution including Apache Hadoop, such as HBase,
Flume,
Sqoop and
Oozie. In addition it includes Mahout and
Cascading, although it doesn't include
Hue. MapR offers its own set of management tools and APIs for installation, configuration, data placement, and monitoring. Norris reports that MapR support native Linux security including any PAM authentication approach and delegation of authority, instead of the
Hadoop Security approach that the Apache Hadoop project has adopted.
Cloudera has announced eleven integration partnerships for database and BI tools integration including with Quest, Teradata, Netezza, Vertica, and Microstrategy. MapR says that all the Sqoop database connectors work with MapR such as Quest's Oracle-Hadoop connector. MapR favors the use of database NFS clients as an integration apporach, and say they are working on enhanced integration with EMC Greenplum. MapR also claims that all the Cloudera BI connectors work with MapR, but that they prefer to support ODBC, JDBC, or NFS access to data from BI tools to their environment such as by generating CSV data to be read through JDBC.