The Apache Spark team has integrated the Pandas API in the product's latest 3.2 release. With this change, dataframe processing can be scaled to multiple clusters or multiple processors in a single machine using the PySpark execution engine.
Accelerating transformation and analysis of multidimensional (e.g. NumPy arrays) and tabular (e.g. Pandas dataframes) data in Python is a fast-growing field with many ongoing projects. There have been two main lines of strategies to solve the scaling issues, by leveraging parallelization capabilities of GPUs (e.g. CuPy for arrays, Rapids CuDF for dataframes) and by using multiple processors (e.g. Spark, Dask, Ray).
Processing large amounts of data in a distributed fashion using CPU nodes allows an economical solution for analytics purposes compared to GPUs that are limited by available memory and their relatively high price. As one of the most popular engines exploiting multiple clusters, Apache Spark has been aiming to become more Pythonic and capture more of the Python data-science ecosystem with its Project Zen. The development of a Pandas-compatible API was an important part of this initiative. Similar efforts to expose Pandas API can also be seen in alternative distributed computing libraries such as Ray/Dask-based Modin and in Apache Arrow-based Vaex.
The original Pandas library was not designed to optimize scaling. As stated by its creator, several considerations were left behind such as memory-mapped data storage options. Yet, Pandas is still the second-most-loved library in the Stack Overflow Developer Survey (2021) within the Python numerical computation ecosystem. This shows why standardization of tabular data processing API may be led by Pandas as its adoption grows.
Fig-1: The overview of the PySpark components with the addition of the Pandas API.
Development of the API has been ongoing for several years in a separate Koalas project. Koalas was designed to be an API bridge on top of PySpark dataframes and utilized the same execution engine by converting the Pandas instructions to Spark SQL plan (Fig-1). By merging Koalas into the main PySpark codebase, the project now targets simpler porting of already existing Pandas code to Spark clusters and quick transitions between PySpark and Pandas APIs.
For the next releases, the Spark team aims to bring the current 83% coverage of the pandas API to 90%, increase the type annotation in the codebase, and further improve the performance and stabilization of the API.