MapR recently announced including Apache Drill in its latest release of MapR distribution. Apache Drill is the open source version of Google’s Dremel. Dremel is the infrastructure on which BigQuery is based upon. Drill is offering a low latency SQL-on-Hadoop interface. While this puts it in the same space as several other technologies around Hadoop, Drill has some unique characteristics setting it apart from other SQL-on-Hadoop technologies.
For one, Drill is fully ANSI SQL compliant. But more importantly, Drill is built around the “data exploration first” principle. Drill supports out of the box several SQL and NoSQL data sources, self-describing data like Avro, Parquet or HBase tables and even complex nested JSON structured data sources. And if SQL interface is not enough, users can connect analytics software to a Drill data source through the ODBC connector it provides.
Data Exploration First means that Drill can query unstructured complex JSON structures without flattening or converting them into a fixed schema. In contrast to schema before read SQL-On-Hadoop technologies like Apache Hive, Apache Drill lets users exploring schema-less data. Drill’s query engine can discover the data schema and prepare query plans matching the SQL queries applied to the dataset.
Apache Drill can be used alongside MapReduce jobs complementing it rather than substituting it. When there is need for quick data exploration and interactive analysis in unstructured data Drill can definitely help. Hadoop MapReduce is still the platform of choice for batch processing in the Big Data world.
MapR has been offering a developer preview of Apache Drill in earlier versions of its Hadoop distribution but Drill 0.5.0, shipped on September 12 is the first beta-quality release. Apart from Drill, there are several other technologies offering SQL-on-Hadoop each with its strengths and weaknesses. MapR 4.0.1 release offers four more different technologies in this aspect, namely Apache Hive, Apache Spark SQL, Cloudera Impala and Vertica integration.