Google Cloud has recently announced the preview of a global queries feature for BigQuery. The new option lets developers run SQL queries across data stored in different geographic regions without first moving or copying the data to aggregate the results.
Removing the need for ETL pipelines, global queries automatically fetch and combine results from each region, simplifying analytics for companies with distributed datasets while letting them control where the query runs. Wawrzek Hyska, product manager at Google, and Oleh Khoma, software engineer at Google, write:
In the background, BigQuery automatically handles the data movement required to execute the query, giving you a seamless, zero-ETL experience for multi-location analytics. BigQuery identifies different parts of the query that must be executed in different regions and runs them accordingly. Next, results of these partial queries are transferred to a selected location when the main query is run (with an optimization attempt to minimize the size of transfer). Finally all parts are combined and the whole query returns the results.
For example, using a standard SQL query, developers can combine transaction data from Europe and Asia with customer data from the US. While similar results could be achieved on BigQuery using ETL pipelines to copy and centralize data before running the SQL statement, the engine can now run the query across data in different regions, making data analysis simpler and faster.
SET @@location = 'US';
WITH transactions AS (
SELECT customer_id, transaction_amount FROM `eu_transactions.sales_2024`
UNION ALL
SELECT customer_id, transaction_amount FROM `asia_transactions.sales_2024`
)
SELECT
c.customer_name,
SUM(t.transaction_amount) AS total_sales
FROM
hq_customers.customer_list AS c
LEFT JOIN transactions AS t
ON c.id = t.customer_id
GROUP BY
c.customer_name
ORDER BY
total_sales DESC;
Global queries incur higher latency than single-region queries due to the time required to transfer data between regions. As the new feature brings additional costs and challenges, and regulations might even prohibit data leaving the original location, developers must explicitly opt in by specifying the location where a global query is executed. The new feature allows data engineers to control where data is processed, aligning with their data residency and compliance requirements. Hyska and Khoma add:
What’s different is that BigQuery now executes it across datasets that are thousands of miles apart. This both dramatically simplifies your architecture and accelerates your time to insight.
Google Cloud is not the only provider offering options to query distributed data with a single SQL statement. For example, AWS provides cross-region data sharing for Amazon Redshift, and Athena can query data across regions, but it does not automatically coordinate distributed execution across regions the way BigQuery global queries do.
To enable global queries, data engineers must update the project or organization configuration by setting enable_global_queries_execution to true in the region where the query runs and enable_global_queries_data_access to true in the regions where the data is stored. Queries can run in one project and access data from regions in another project, without using any cache to avoid transferring data between regions.
The cost of a global query includes the compute cost of each subquery in remote regions based on local pricing, the cost of the final query in the region where it runs, the cost of copying data between regions under data replication pricing, and the cost of storing copied data in the primary region for eight hours.
The new feature is currently in preview.