Google recently announced the general availability of the Hive-BigQuery Connector, simplifying integration and migrations between Apache Hive and Google BigQuery. The open-source connector is a Hive storage handler that enables Hive to interact with BigQuery's storage layer.
The new option supports queries in Hive using the SQL-like query language HiveQL to read from and write to BigQuery. Data engineers can access and query BigQuery datasets without data movement and BigQuery users can leverage Hive's tools, libraries, and frameworks for data processing and analysis. Julien Phalip, solutions architect at Google Cloud, writes:
The Hive-BigQuery connector implements the Hive StorageHandler API to allow Hive workloads to integrate with BigQuery and BigLake tables. While Hive's execution engine still handles all compute operations such as aggregates and joins, the connector manages all interactions with the data layer in BigQuery, whether the underlying data is stored in BigQuery native storage or in Cloud Storage buckets via a BigLake connection.
Apache Hive, a popular distributed data warehouse option built on top of Hadoop, allows users to perform queries on large datasets. BigQuery, the serverless data warehouse on Google Cloud, provides scalable queries on massive datasets. The open-source connector ensures data consistency and reliability by using Hive's metadata to represent tables stored in BigQuery.
The connector supports queries with MapReduce and Tez execution engines, creating and deleting BigQuery tables from Hive, and joining BigQuery and BigLake tables with Hive tables. It also supports fast reads from BigQuery tables using the Storage Read API streams and the Apache Arrow format.
Source: Google Data Analytics Blog
According to the cloud provider, the Hive-BigQuery Connector can help companies in the following scenarios: ensure continuity of operations during migration, use BigQuery for a subset of data warehouse needs, or maintain a full open-source software stack.
With the BigQuery Migration Service, Google already supported the BigQuery batch SQL translator and the interactive SQL translator, translating Hive queries to BigQuery's own ANSI-compliant SQL syntax. Phalip explains:
The new Hive-BigQuery connector offers one additional option: You can keep your original queries in their HiveQL dialect, continue to run those queries with the Hive execution engine on your cluster, but let those queries access data migrated to BigQuery and BigLake tables.
This is not the first open-source connector released by Google to reduce data transformations, enabling the analysis of diverse datasets: the Cloud Storage Connector implements the Hadoop Compatible File System (HCFS) API to store and access data files in Cloud Storage, while the Apache Spark SQL connector for BigQuery implements the Spark SQL Data Source API to read BigQuery tables into Spark's dataframes and write DataFrames back into BigQuery.
The Hive-BigQuery connector supports Dataproc 2.0 and 2.1. Google has outlined some limitations regarding partitioning. Since Hive and BigQuery partitioning operate differently, the Hive PARTITIONED BY clause is not supported. However, developers can still use the time-unit column and ingestion time partitioning options supported by BigQuery.
The connector is available on GitHub.