Apache Spark integration with deep learning library TensorFlow, online learning using Structured Streaming and GPU hardware acceleration were the highlights of Spark Summit EU 2016 held last week in Brussels.
The first day featured a walkthrough of the innovations introduced by Spark 2.0. The API was simplified to a single interface for DataFrames and Datasets, making it easier to develop big data applications. The second generation of the Tungsten engine takes the processing closer to the hardware, by using ideas applied in MPP databases to data processing queries: the generated bytecode leverages CPU registers for intermediate data and data in-memory is kept in a space-efficient column-oriented format.
Regardless of the API used, the data operation graph is optimised through the Catalyst Optimizer, that generates the plan for the execution of the computations across cluster and optimized bytecode for each operation.
Structured Streaming, a new high-level API for streaming released as alpha, was also covered at the conference. The API is integrated in Spark’s Dataset and DataFrame APIs and allows developers to express data reading and writing operations to/from external systems in a similar fashion as they would using Spark batch APIs. It provides strong consistencies by compiling the streaming computation as a batch computation and allows the transactional integration with storage systems (such as HDFS and AWS S3).
On the second day, Databricks CEO Ali Ghodsi pictured Spark as a tool to democratize AI by facilitating data preparation for ML algorithms and the management of computation infrastructure. Earlier this year, the deep learning library TensorFlow was integrated to run on Spark in a library called TensorFrames. The library allows data to be passed between DataFrames and TensorFlow runtime.
The data science track had a session on how the Structured Streaming enabled resilience for Machine Learning and that it enabled online learning – it will be possible to update some machine learning models with the data as it arrives than performing the model training in a batch offline job.
The last highlight was the announcement of GPU support on Databricks platform and the integration of more deep learning libraries. The GPU support is made via hardware libraries like CUDA, and having it pre-built in Databricks is said to lower the cluster setup cost.