Apache Parquet, the open-source columnar storage format for Hadoop, recently graduated from the Apache Software Foundation Incubator and became a top-level project. Initially created by Cloudera and Twitter in 2012 to speed up analytical processing, Parquet is now openly available for Apache Spark, Apache Hive, Apache Pig, Impala, native MapReduce, and other key components of the Hadoop ecosystem. It is notably used by Cloudera, Twitter, Netflix, Stripe, Criteo and AppNexus.
The idea behind Parquet - and columnar storage in general - is simple and not new: storing data by columns to optimize analytic query performance and decrease storage cost. The first mentions of transposed files can be traced back to the 1970s, and commercial solutions started to appear in the 2000s with the growth of big data analytics. Parquet is, in this sense, comparable to ORC (Optimized Row Columnar) and RCFile (Record Columnar File), two other open-source initiatives aiming to bring columnar storage to the Hadoop ecosystem.
More concretely, columnar storage allows query engines to read only the columns used in the query (projection push-down), resulting in faster scans. Each column block can also store statistics to speed up the record filtering (predicate push-down). For example, if the dictionary of a dictionary-encoded column does not contain the desired value, the entire chunk can be discarded without reading further. An added advantage is the ability to process larger chunks of data and leverage vectorization techniques to optimize the CPU pipeline, as described by Boncz et al. in their presentation of MonetDB/X100. Finally, since each column holds the same type of data, they can now use tailored encoding and compression schemes to maximize the compression rate, thus reducing disk space usage.
Beyond query efficiency and space efficiency, Parquet is also designed to provide a high level of interoperability. First of all, the file format itself is language-agnostic. It is defined with Apache Thrift IDL and can, therefore, be serialized and deserialized from multiple programming languages including C++, Java, and Python. Secondly, Parquet's internal data structures and algorithms are model-agnostic. They are intended to be used from external converters such as parquet-hive, parquet-pig and parquet-avro to increase the level of abstraction and ease the integration effort.
It is also worth noting that Parquet is built from the ground up to support complex nested data structures and uses the repetition/definition level approach described in Google's Dremel research paper.
Three years after its inception, Parquet has gained a significant role in the Hadoop ecosystem.
At Netflix, Parquet is the primary storage format for data warehousing. More than seven petabytes of our 10-plus petabyte warehouse are Parquet-formatted data that we query across a wide range of tools including Apache Hive, Apache Pig, Apache Spark, PigPen, Presto, and native MapReduce. The performance benefit of columnar projection and statistics is a game changer for our Big Data platform.
— Daniel Weeks, technical lead for the Big Data Platform team at Netflix
According to the official blog post, the community is currently working on significant improvements for future releases, including:
- integrating with the new HDFS zero-copy API to increase the reading speed
- adding statistics to speed up the record filtering
- adding a vectorized read API to reduce the processing time
- adding support for POJO to generate Parquet schemas from existing Java classes
By donating Parquet to the Apache Software Foundation, Twitter and Cloudera intentions were to attract external developers and help Parquet becoming a community driven standard. One year later, Parquet already has more than 50 contributions, among them 18 with more than 1000 lines of code. The community holds regular Google Hangouts and welcomes everyone to join. Dates of meetings are published on the Apache's mailing list.