EMC Greenplum has announced Pivotal HD, a new Hadoop distribution including a fully compliant SQL MPP database running on HDFS and being “hundreds of times faster than Hive”.
Pivotal HD contains the usual suspects of a standard Hadoop distribution – HDFS, Pig, Hive, Mahout, Map-Reduce, etc. – but adds a number of other components shown in the architectural snapshot below:
The main component of Pivotal is HAWQ, a MPP (Massively Parallel Processing) relational database running directly on HDFS in Hadoop through a dynamic pipelining mechanism and featuring:
- SQL Compliant – supporting all versions of SQL: ‘92, ‘99, 2003 OLAP, etc. 100% compatible with PostgreSQL 8.2.
- Row or column-oriented data storage
- Query Optimizer – queries can be run on hundreds of thousands of nodes
- Fully ODBC/JDBC compliant
- Interactive Query – complex queries on large data sets are solved in seconds or even sub-seconds
- Data management – provides table statistics, table security
- Supports data stored in HDFS, Hive, HBase, Avro, ProtoBuf, Delimited Text and Sequence Files
- Deep analytics – including data mining or machine learning algorithms
Gavin Sherry, Sr. Director of Engineering at Greenplum, demoed (see video at ~42’42”) running the following SQL SELECT statement on 1B rows totaling several TB of data on a 60-nodes HDFS cluster in ~13 seconds, providing close to real-time querying capabilities:
SELECT gender, count (*)
FROM retail.order JOIN customers ON retail.order.customer_ID = customers.customer_ID
GROUP BY gender;
According to Donald Miner, Solutions Architect at EMC Greenplum, “HAWQ is hundreds of times faster than Hive”, as show in the next graphic from Greenplum (PDF):
HAWQ solves queries with “sub-second response time, while at the same time running over much larger datasets and processing with the full expressiveness of SQL, in the same engine.” Miner explains how they made it possible:
We have what we call “segment servers” manage a shard of each table. Several segment servers run on each data node of your cluster. This shard of data, however, is completely stored within HDFS. We have a “master” node that has the job of storing the top-level metadata, as well as building the query plan and pushing the node-local queries down to the segment servers.
When a query starts up, the data is loaded out of HDFS and into the HAWQ execution engine. HAWQ follows MPP architecture, streaming data through stages in a pipeline, instead of spilling and check pointing to disk (like MapReduce). Also, the segment servers are always running, so there is no spin-up time.
Pivotal HD comes in three flavors (PDF): Enterprise, Database Services and a Community Edition for evaluation purposes.