Apache Flink has released the version 0.8.0 of their project. Besides the usual performance, compatibility, and stability improvements, it has also added a streaming Scala API, where streaming capabilities had so far been missing. Apache Flink has also been promoted to the top-level of the Apache projects recently after joining the incubator roughly nine months ago.
Apache Flink is an open source project with similar goals like Apache Spark. It runs on top of the Hadoop stack and aims to make it easier to write scalable data processing systems by providing more powerful data operations than the map and reduce operations of the original Hadoop system.
Kostas Tzoumas, a Flink committer and co-founder of data Artisans, a Berlin based startup around Apache Flink, has recently discussed the Flink framework and outlined the roadmap for 2015.
One key point of Flink, which also sets it apart from the current version of Apache Spark, is that it uses a similar approach like query optimization in SQL databases. It can apply global optimizations to a query to obtain better performance. For example, Flink is able to reorder operations to improve performance, or select different implementations to execute a given operator based on properties of the data sets involved.
As discussed in the presentation, this allows Flink to execute a sequence of operations in a pipelined fashion, where Spark would execute the different steps one after another.
Flink also provides operators like iterations to get more potential for global optimization. So instead of issuing queries in a for loop, Flink allows users to formulate the iteration as part of the query.
Roadmap features for 2015 include better memory and fault tolerance, support for interactive use, unified batch and streaming, integration of the machine learning library Mahout, among other things.
Flink originally started as part of the ongoing research project Stratosphere. Flink is also the main platform for the Berlin Big Data Center, a research initiative funded by the German government with the goal to bring together machine learning researchers and scalable data processing researchers.