Databricks recently made MLflow integration with Databrick notebooks generally available for its data engineering and higher subscription tiers. The integration combines the features of MLflow with those of Databrick notebooks and jobs. Databricks originally authored MLflow as an open source project in June 2018, and has always been usable as a separate standalone command line tool.
MLflow provides the following three main capabilities: experiment tracking, projects, and MLflow models. Each of these features is available with or without Databricks online service. The features express themselves differently when integrated with Databricks and without.
MLflow experiment tracking requires a location to store MLflow runs in. The MLflow command line tool has a built-in tracking server that runs can be stored in, and MLflow can use the local file system for storing runs. The user of the MLflow command line tool is responsible for maintaining the storage of runs. Databrick provides an experiment tracking server integrated with Databrick notebooks, thereby removing the need for the user to manage the tracking themselves. Additionally, Databrick stores a version of the notebook each time a run is recorded for an experiment. Finally, Databricks provides a user interface to explore the MLflow experiments and runs similar to the standalone UI that is accessible from the MLflow command line tool.
MLflow provides a structured configuration driven way of making repeatable runs of code knowns as a project. MLflow turns a git repository into a project by including appropriate configuration files and supports the following environments: Conda, Docker, or system environment. Databricks adds the ability to run projects as jobs on Databrick clusters. Users are required to create an experiment in their Databricks account first, and then users can run a project from the MLflow command line targeting a Databrick’s Job and experiment.
MLflow stores models as artifacts of runs in experiments. Databrick experiments allow for specifying external storage for experiments that store large models. The models from specific runs can be recalled using the MLflow API from within a Databrick notebook or job. Users can then use the recalled models within their notebook to make predictions using Apache Spark UDF or can deploy the model to external services like AWS Sagemaker and Microsoft Azure ML.
Matei Zaharia, chief technologist at Databricks, announced two new features in MLflow coming in version 1.0: multistep workflows and model registry. In the announcement, a demo of the features was presented indicating a user interface for visualizing multistep workflows and then registering of a resulting model. After a model is registered, it can be deployed and tracked via a new user interface.