Microsoft announced the release of .NET for Apache Spark, adding new high-performance C# and F# language bindings to the big-data computation engine.
At the recent Spark + AI summit, Microsoft announced the preview release of .NET for Apache Spark. Apache Spark is written in Scala, and so has always had native support for that language. It has also long had API bindings for Java, as well as popular data-science languages Python and R. The new language bindings for C# and F# are written on a new Spark interop layer. In tests on the TPC-H benchmark, .NET performance was comparable to other languages, and in some cases was "2x faster than Python."
Developers can reuse existing .NET Standard compatible code and libraries, and "can access all aspects of Apache Spark including Spark SQL, DataFrames, Streaming, MLLib." Cloud developers can deploy .NET for Apache Spark in Microsoft's Azure using Azure HDInsight and Azure Databricks, or in Amazon Web Services using Amazon EMR Spark and AWS Databricks.
Running .NET apps on Spark requires installation of the Microsoft.Spark.Worker binaries as well as a JDK and the standard Apache Spark binaries. Developing an app requires installation of the Microsoft.Spark nuget package. The Microsoft devs have submitted Spark Project Improvement Proposals (SPIPs) to include the C# language extension and a generic interop layer in Spark itself. However, Sean Owen, an Apache Spark committer, commented that it would be "highly unlikely" that the work would be merged into Spark.
Commenters on Hacker News contrasted .NET for Apache Spark with pre-cursor project Mobius, which has provided C# support on Apache Spark since 2016, pointing out several improvements. First, Mobius was written using Mono, an older open-source version of the .NET framework, whereas .NET for Apache Spark uses .NET Core, which provides better cross-platform support and performance improvements. Further, Mobius only supports versions of Spark up to 2.0; the latest version of Spark is up to 2.4. Finally, .NET for Apache Spark has incorporated many lessons learned and user feedback from Mobius.
One commenter pointed out that while the new .NET library performs as well as or better than the other languages:
"[P]erformance is not the only thing - there is also ability to debug issues. For this you still need to dig into Apache core which is in Scala."
The .NET for Apache Spark roadmap lists several improvements to the project already underway, including support for Apache Spark 3.0, support for .NET Core 3.0 vectorization, and VS Code support. .NET for Apache Spark source code is available on Github.