BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Articles Evolution of Azure Synapse: Apache Spark 3.0, GPU Acceleration, Delta Lake, Dataverse Support

Evolution of Azure Synapse: Apache Spark 3.0, GPU Acceleration, Delta Lake, Dataverse Support

Key Takeaways

  • Azure Synapse is a big data analytics service, supporting Apache Spark, Dedicated SQL, and Serverless SQL processing engines
  • Azure Synapse now supports Apache Spark 3.0 runtime and enables the latest Spark features in Synapse
  • Graphics Processing Units (GPUs) acceleration is available in Azure Synapse, allowing to lower costs and increase efficiency for multi-core parallel workloads
  • Azure Synapse has enhanced its Delta Lake querying capabilities by adding Serverless SQL support for Delta Lake
  • Azure Synapse link for Dataverse removes the barrier for large-scale analytics on data from Microsoft business services

At Microsoft Build 2021, Azure Synapse has announced significant improvements for its Apache Spark pool, its performance and data querying and integration capabilities.

Azure Synapse is a large-scale analytics service, cohesively combining the convenience of data integration, data warehousing, and big data analytics capabilities. Azure Synapse enables developers, data engineers, and data scientists to work with large volumes of data on multiple levels and helps ingest, explore, prepare, manage, and serve data.

The service offers several data processing engines - Apache Spark pool, Serverless SQL pool, Dedicated SQL pool.

Azure Synapse supports the separation of storage and compute in its architecture to accelerate flexibility and scalability of cloud workloads. The service also supports no-copy data sharing from Azure Cosmos DB via Azure Synapse Link using cloud-native hybrid transactional and analytical processing (HTAP) capability.

Users can perform machine learning tasks on Azure Synapse with Azure Machine Learning, Cognitive Services, Apache Spark MLlib, or Microsoft Machine Learning for Apache Spark (mmlspark).

For handling data governance and for the purposes of data cataloging, Azure Synapse works with Azure Purview. It’s also well-integrated with Azure Data Factory for end-to-end workflows, data movement, and creating automatable data pipelines.

Picture Credit: What is Azure Synapse Analytics?

Apache Spark 3.0 runtime in Azure Synapse

Azure Synapse now offers Apache Spark 3.0 runtime in public preview. It is based on the open-source version of Apache Spark and includes optimizations added by Microsoft. Apache Spark 3.0 support enables Adaptive Query Execution, Dynamic Partition Pruning, ANSI SQL compliance option, Pandas User Defined Functions (UDFs) APIs and types, accelerator-aware scheduling, and the most recent version of Delta Lake:

  • With Adaptive Query Execution enabled, Apache Spark improves Spark SQL to use the runtime statistics to take the most efficient query execution plan. It optimizes joins (skew join, soft merge to broadcast join) and helps with shuffle partitions tuning.
  • Thanks to Dynamic Partition Pruning, the join queries on large fact tables and smaller dimension tables can be optimized to reduce the number of rows to be scanned and improve performance.
  • Better compliance with ANSI SQL standard in Apache Spark 3.0 option improves developer experience for data engineers familiar with standard SQL variants, and ensures Spark SQL follows the standard for arithmetic functions, type conversion, SQL functions, and SQL parsing.
  • Python type hints supported in Apache Spark 3.0 simplify expressing Panda UDFs and Pandas Function APIs.
  • With accelerator-aware scheduling support in Apache Spark 3.0, GPU becomes a resource that can be scheduled, allowing users to request resources at multiple levels (e.g. Executor, Driver, Task).

The enhancements added by Microsoft help Azure Synapse with Apache Spark runtime provide faster processing speeds than the traditional open-source Spark. For example, earlier, Microsoft has open-sourced Hyperspace for Apache Spark - an indexing subsystem that brings index-based query acceleration to Apache Spark and big data workloads. Other additional performance improvements included in Azure Synapse Spark runtime are query and cluster optimization, autoscaling, intelligent caching and indexing.

There is another approach to helping Azure Synapse with Apache Spark pool provide better performance - for example, running it on GPUs.

GPU Acceleration for Apache Spark in Azure Synapse

The Azure Synapse feature of GPUs acceleration is now available in a request-based private preview.

While a CPU consists of a few cores, optimized for sequential serial processing, a GPU has a massively parallel architecture consisting of thousands of smaller, more efficient cores designed to handle multiple tasks simultaneously. GPUs are great for a wide range of scenarios, such as high-performance data pre-processing, querying, or model training. Using GPUs can provide distinctive benefits when used for parallel data operations and machine learning computation tasks in areas such as recommendation systems, computer vision, natural language processing, and more.

Since Azure had recently announced support for NVIDIA’s T4 Tensor Core GPUs, NVIDIA and Azure Synapse collaborated to bring GPU acceleration to data scientists and data engineers. The goal is to democratize machine learning and AI for data scientists on Azure using unified data analytics in Azure Synapse.

Picture Credit: NVIDIA GPU Acceleration for Apache Spark™ in Azure Synapse Analytics

While Apache Spark already supports GPUs, Azure Synapse now simplifies developer experience by managing and configuring the required hardware and low-level libraries:

  • Azure Synapse takes care of pre-installing hardware libraries like NVIDIA CUDA and setting up all the complex networking for the compute nodes to offer GPU Apache Spark pools.
  • Azure Synapse has collaborated with NVIDIA to establish optimized configurations for GPU-enabled Apache Spark pools to save operational costs.
  • Azure Synapse provides built-in open-source data preparation and machine learning libraries, such as RAPIDS and Hummingbird, resulting in much faster data processing and model training. Without code changes and with built-in support for NVIDIA’s RAPIDS acceleration, Azure Synapse version of GPU-enabled Apache Spark offers 3.3x better performance and ~1.5x lower cost for standard analytical benchmarks compared to running on CPUs. With the built-in Hummingbird support leveraging the GPUs in Azure Synapse, traditional machine learning workloads are accelerated up to 3 orders of magnitude.

Querying Delta Lake from Azure Synapse

Azure Synapse has enhanced querying data in Delta Lake format. These enhancements are now available in public preview.

Delta Lake is an open-source storage layer that adds capabilities like ACID transactions, scalable metadata handling, time travel, schema enforcement and evolution, unified batch and streaming experience, and more.

Traditionally, Azure Synapse has already been providing an opportunity for users to work with a Delta Lake through Apache Spark pools, as Delta Lake is supported by Apache Spark. After this update, Azure Synapse added a capability for its Serverless SQL pools to query a Delta Lake using T-SQL.

Picture Credit: Query Delta Lake files using T-SQL language in Azure Synapse Analytics

Using the serverless query endpoint in Azure Synapse, users can create a relational layer on top of Delta Lake files that directly references the location where Azure Synapse and Azure Databricks are used to modify data. This allows for real-time analytics on top of the Delta Lake data set without any need to wait for a pipeline to copy and prepare data.

The benefits of using Azure Synapse Serverless pools to query the data in a Delta Lake are:

  • Convenience of sharing data between Azure Synapse and other tools, such as Azure Databricks, without the need to move or transfer the data, or incur additional costs.
  • Ability to use available integrations provided by Azure Synapse reporting and analytics ecosystem (e.g. PowerBI). The serverless endpoint in Azure Synapse represents a bridge between your data stored in Delta Lake format and a reporting and analytics layer where you could use Power BI or Azure Analysis Services. This enables a variety of tools that work on T-SQL endpoints to access Delta Lake data.  
  • Freedom and flexibility to organize the work according to the preferred expertise of big data personas in an organization and in alignment with different types of data engineering tasks. For example, data scientists could use T-SQL via Azure Synapse Serverless SQL pools or Azure Databricks notebooks for exploratory data analysis, data engineers could use Azure Synapse Apache Spark pools for pre-processing after ingestion, and data analysts could use PowerBI for reporting and visualization.
  • Availability of pay-for-what-you-use model when working with Azure Synapse Serverless SQL pools, eliminating the need to plan the workloads and pre-provision resources.

Azure Synapse Link for Dataverse

Azure Synapse Link for Dataverse is now available in public preview.

Dataverse is a universal data store for all Microsoft business services, such as Dynamics 365, Power Apps, Power Automate, and more. Previously, it could take a lot of effort to use big data analytics tools to work directly with Dataverse data.

Azure Synapse link for Dataverse lowers the barrier to large-scale analytics for data within Dataverse. A customer would link their data lake and connect it to the Dataverse environment. Using the new link feature, the data is being replicated and synchronized with the data lake associated with Azure Synapse and the metadata is pushed to Azure Synapse metastore, supporting eventual consistency. Azure Synapse Link for Dataverse supports initial and incremental writes for copy, update, delete for data and metadata, continuous export of data and metadata without any manual intervention, and continuous snapshot updates for large analytics scenarios.

The new feature also allows visualizing Dataverse data in Power BI through Direct Query or Import mode, using Azure Synapse Analytics connector in Get Data in Power BI while specifying the Serverless SQL endpoint as a server. This helps to seamlessly visualize Dataverse data in Azure Synapse workspace.

Conclusion

Azure Synapse Analytics is a service providing a unified experience for large-scale data processing, analytics, machine learning, and data visualization tasks. The recent updates introduced at Microsoft Build 2021 improve its Apache Spark and Delta Lake support, performance, integration with business systems, and support for accelerated hardware.

About the Author

Lena Hall is a Director of Engineering at Microsoft working on Azure, where she focuses on large-scale distributed systems and modern architectures. She is leading a team and technical strategy for product improvement efforts across Big Data services at Microsoft. Lena is the driver behind engineering initiatives and strategies to advance, facilitate and push forward further acceleration of cloud services. Lena has more than 10 years of experience in solution architecture and software engineering with a focus on distributed cloud programming, real-time system design, highly scalable and performant systems, big data analysis, data science, functional programming, and machine learning. Previously, she was a Senior Software Engineer at Microsoft Research. She co-organizes a conference called ML4ALL, and is often an invited member of program committees for conferences like Kafka Summit, Lambda World, and others. Lena holds a master’s degree in computer science. Twitter: @lenadroid. LinkedIn: Lena Hall.

Rate this Article

Adoption
Style

BT