Julien Le Dem, co-author of Apache Parquet and a PMC member of the Apache Arrow project, presented on Data Eng Conf NY on the future of column-oriented data processing.
Apache Arrow is an open-source standard for columnar in-memory execution that originated from Apache Drill’s own in-memory columnar data structures. It aims to become the de-facto way of efficiently holding data in-memory and exchange data between different execution engines, thereby avoiding serialisation. Apache Arrow is backed by key developers of 13 open source projects, mainly from Apache, including Calcite, Drill, Pandas, HBase, Spark, and Storm.
InfoQ interviewed Le Dem to find out the differences between Arrow and Parquet, a columnar on-disk storage format, and how both can enable a more efficient computation across execution engines.
InfoQ: Do you believe Apache Arrow, like Parquet, will be commoditized across execution engines like Apache Spark? Do you think it will narrow the performance difference between engines?
Le Dem: As pioneered by MonetDB, vectorized execution is the state of the art for efficient query processing. Many open source query engines are moving to this model, and we believe it makes sense to standardize the in-memory columnar representation to provide extremely efficient interoperability. What Parquet did for columnar storage, Arrow provides for columnar in-memory processing and interchange.
These standardization efforts greatly simplify integration between storage layers, query engines, DSLs and UDFs, and provide a much more efficient communication layer by removing the need for serialization. Standardization makes it easier, cheaper, and faster for all systems to interoperate by removing common bottlenecks. However, there is still plenty of room for each execution engine to innovate by providing specialized techniques that further improve performance, such as operating on compressed vectors or having a smarter query optimizer.
InfoQ: Apache Parquet supports predicate pushdown, avoiding reading data from disk whenever a page doesn’t contain data that matches the predicate. Do Apache Arrow data structures include similar functionality?
Le Dem: The tradeoffs between reading data from disk and from memory are different. Currently it is up to the engines to implement predicate push down. Eventually Apache Arrow will provide fast vectorized operations that can be reused across engines, although this effort has not yet started.
InfoQ: One of the Arrow's goals is to provide constant time access to data in-memory and enable vectorized operations with SIMD instructions being executed . Does Arrow also provide in-memory data compression like Parquet?
Le Dem: Arrow supports dictionary encoding that can provide significant compression and faster execution for operations such as aggregations and joins. There is also an ongoing discussion to provide generic buffer compression using general purpose algorithms such as snappy or gzip.
In this initial release, Arrow does not yet support other compression techniques such as bit packing. However, we intended for execution engines to be able to define custom vectors provided that standard vectors are used in interchange. This will allow more advanced techniques like operating on compressed vectors directly. An example that comes to mind is the BitWeaving project from University of Wisconsin. In the future the set of standard vectors will be expanded.
The first release of Arrow has native C++-backed integration between the Pandas library, Arrow, and Parquet, allowing Arrow's Record Batches to be manipulated as Pandas dataframes and exposed to SQL-on-Hadoop engines such as Apache Drill.
InfoQ: Apache Arrow supports interoperability allowing data to be transferred between processes without serialisation. Can you comment on the capabilities of the IPC layer of Arrow?
Le Dem: The IPC layer is still experimental and it is a true zero-copy layer. When an Arrow’s Record Batch is finalized it becomes immutable. In this state it can be shared with other processes in read-only mode using shared memory without worrying about concurrent access. The vector representation is independent from its memory address (no absolute pointer required) and is safe to use in shared memory where each process sees a different address for the buffers.
InfoQ: Like Parquet, Apache Arrow supports nested data types. What data types are currently supported and which are on the roadmap?
Le Dem: Arrow supports all the common data types, a pretty exhaustive list so far. Some of the more recently added types include SQL’s Timestamp and Interval types.