Two years after the first release of Apache Spark, Databricks announced the technical preview of Apache Spark 2.0 , based on upstream branch 2.0.0-preview. The preview is not ready for production, neither in terms of stability nor API, but is a release intended to gather feedback from the community ahead of the general availability of the release.
This new release is focused on feature improvements based on community feedback. There are two main areas of improvement regarding Spark’s development.
One of the most used interfaces for Apache Spark based applications is SQL. Spark 2.0 offers support for all the 99 TPC-DS queries which are largely based on SQL:2003 specification. This alone can help porting existing data loads into a Spark backend with minimal rewriting of the application stack.
The second aspect is based on the programming APIs. Machine Learning has a big emphasis in this new release. The new spark.ml package that is based on DataFrames will replace the existing spark.mllib once it reaches feature parity, which may happen in the next 8 months, as per Databricks Engineer Xiangrui Meng. spark.mllib is still available but in maintenance mode, as explained in the mailing list. Machine Learning pipelines and models can now be persisted across all languages supported by Spark. K-Means, Generalized Linear Models (GLM), Naive Bayes and Survival Regression are now supported in R.
DataFrames and Datasets are now unified for Scala and Java programming languages under the new Datasets class, which also serves as an abstraction for structured streaming. This is not applicable in languages that don’t support compile time type safety, and DataFrames remains as the primary abstraction instead. SQLContext and HiveContext are now replaced by the unified SparkSession. Finally, the new Accumulator API has a simpler type hierarchy and supports specialization for primitive types. Old APIs have been deprecated but remain for backwards compatibility.
The new Structured Streaming API aims to allow managing streaming data sets without added complexity, in the same way that programmers and existing machine learning algorithms deal with batch loaded data sets. Performance has also improved with the second generation Tungsten engine, allowing for up to 10 times faster execution.
The technical preview release is available on DataBricks.
UPDATE: There has been a slight modification in the original article around the state of current spark.mllib package that has been put in maintenance mode until the new spark.ml package reaches feature parity, thanks to DataBricks engineer Xiangrui Meng.