Databricks recently announced open sourcing Delta Lake, their proprietary storage layer, to bring ACID transactions to Apache Spark and big data workloads. Databricks is the company behind the creators of Apache Spark, while Delta Lake is already being used in several companies like McGraw Hill, McAffee, Upwork and Booz Allen Hamilton.
Delta Lake is addressing the heterogeneous data problem that data lakes often have. Ingesting data from multiple pipelines means that engineers need to enforce data integrity manually, throughout all the data sources. Delta Lake can bring ACID transactions to the data lake, with the strongest level of isolation applied, serializability.
Delta Lake provides time travelling, being able to fetch every version of a file in time, a feature quite useful for GDPR and other audit related requests. Metadata on files are stored using the exact same process as data, enabling the same level of processing and feature richness.
Delta Lake provides schema enforcement capabilities. Data types and presence of fields can be checked and enforced, making sure that the data can be kept clean. Schema changes on the other hand, don’t require DDL but can be applied automatically.
Delta Lake is deployed on top of the existing data lake, it is compatible with both batch and streaming data and can be plugged into an existing Spark job as a new data source. Data is stored in the familiar Apache Parquet format.
Delta Lake is also compatible with MLFlow, Databricks newest open source platform that was launched last year. The code is available on GitHub.