Amazon Athena now supports the open-source distributed processing system Apache Spark to run fast analytics workloads. Data analysts and engineers can use Jupyter Notebook in Athena to perform data processing and programmatically interact with Spark applications.
For interactive Spark workloads that require low latency queries, customers can query data from various sources and visualize the results of their analyses, with Athena starting applications in under a second.
Source: https://aws.amazon.com/blogs/aws/new-amazon-athena-for-apache-spark/
Adding to the existing SQL capabilities of Amazon Athena, Apache Spark on Athena provides on-demand scaling to meet changing data volumes and processing requirements. Donnie Prakoso, principal developer advocate at AWS, explains the main benefit of the new serverless option based on Spark 3.2:
Building the infrastructure to run Apache Spark for interactive applications is not easy. Customers need to provision, configure, and maintain the infrastructure on top of the applications. Not to mention performing optimal tuning resources to avoid slow application starts.
Apache Spark is an open-source distributed processing system designed to run fast analytics workloads used by various industries to perform complex data analysis and often used to explore data lakes to derive insights. With the new Athena feature, data engineers can build Apache Spark applications using notebooks from the AWS console or programmatically using the Athena APIs. Prakoso adds:
Amazon Athena integrates with AWS Glue Data Catalog, which helps customers to work with any data source in AWS Glue Data Catalog, including data in Amazon S3. This opens possibilities for customers in building applications to analyze and visualize data to explore data to prepare data sets for machine learning pipelines.
Apache Spark workloads are already supported on AWS using Jupyter notebooks with Glue or EMR Serverless, leaving some developers dubious about the benefits of the new option. Views created by Athena SQL are not supported by Athena for Spark, making cross-engine queries not supported. AWS dedicates a podcast episode to provide more details about the new feature.
In a demo showing how to explore and derive insights from a data lake using Athena for Apache Spark, Pathik Shah, senior big data architect at AWS, and Raj Devnath, product manager at AWS, write:
You can now use the expressive power of Python and build interactive Apache Spark applications using a simplified notebook experience on the Athena console or through Athena APIs. (...) For performing interactive data explorations on a data lake, you can now use the instant-on, interactive, and fully managed Apache Spark engine.
Apache Spark code executions are charged 0.35 USD per number of data processing units (DPUs) per hour, billed per second, with Athena notebooks provided at no additional cost. A single DPU provides four vCPU and 16 GB of memory.
Athena for Apache Spark is available in a limited number of AWS regions: Ohio, Northern Virginia, Oregon, Tokyo, and Ireland.