Developing, deploying, and keeping machine learning models productive is a complex and iterative process with many challenges. MLOps means combining the development of ML models and especially ML systems with the operation of those systems. To make MLOps work, we need to balance iterative and exploratory components from data science with more linear software engineering components.
Hauke Brammer, a senior software engineer at finpair GmbH, spoke about continuous delivery of machine learning systems and MLOps at DevOpsCon Berlin 2021.
According to Brammer, MLOps looks similar to DevOps, but then for machine learning:
The goals of MLOps are quite simple: in the steps before production, we want to generate the best and most reproducible results with our machine learning models with as little work as possible. And when we get to production, we want to know how our model is doing and if it is doing its job well.
One might say, "Let’s just take our DevOps toolbox and make it available to our data scientists". But in fact, it’s not quite that simple, as Brammer explained:
We don’t just have code, but more importantly lots of data and lots of configuration parameters. That’s why we need specialized tools for an MLOps pipeline, which are primarily determined by the size of the team. If I have only two data scientists, I need less complex tools than if I have a team of 50 people.
According to Brammer, the MLOps pipeline can be divided into three sections:
First, Data Pipelines where we pull our raw data from its sources and clean it up. Of course, we don’t always want to do this by hand, but in a reproducible way. So we will use a dataflow tool like Apache Airflow or Prefect here. We also need a tool for versioning our data. DVC is your best friend for this.
The next step is to build the best possible model with the data. For that we need infrastructure for distributed training. But most importantly, we need a way to manage experiments. Because every training of a model is an experiment. To manage the results I can use a tool like MLFlow or Sacred. This saves me double work compared to managing my experiments in Excel files or even in my head; results I keep in my head are not easily accessible for my colleagues.
When we have successfully trained a model, it makes a lot of sense to capture and version this finished model. For the step into production I have to decide if I use a ready-made model server or if I build one myself. Here the possibilities are really unlimited. And then I need tools for monitoring in production.
Brammer mentioned that for successful MLOps, we need a good cross-functional team of software engineers and data scientists. In the end, it’s still the team that has the biggest impact on the success or failure of a project, he mentioned.
Although there are no proven and standardized processes for MLOps, we can adopt many experiences from "traditional" software engineering, Brammer said. There is still a lot of room for innovation and for new and interesting tools. But someone has to use these tools. And that’s a very important point that you can’t neglect in either DevOps or MLOps, Brammer concluded.
InfoQ interviewed Hauke Brammer about applying continuous delivery for data science and monitoring the performance of ML systems.
InfoQ: What challenges do we face when we want to do Continuous Integration, Continuous Delivery, and Continuous Deployment in a data science environment?
Hauke Brammer: The overall problem in a Data Science environment is that we have to deal with all the problems and challenges of "traditional" software. On top of that, we have to deal with the problems that we get when we deal with Machine Learning.
For example, Data Science is incredibly driven by experimentation. Software engineering is much more linear and predictable. But then we also have the problems that models face in production. When we train an ML model on data, that data is always an incomplete snapshot of the world in the past. It is not easy to recognize when it is actually time to deploy a new model.
Also you have challenges that are more related to the organization: if I divide my ML project team into data scientists on one side, and software engineers on the other, I won’t get a meaningful DevOps flow.
InfoQ: What can be done to monitor the performance of ML systems?
Brammer: When monitoring ML systems, we actually just use the classic DevOps tools: Prometheus, Grafana and the Elastic Stack. The difference is in what we monitor.
We also have to keep an eye on the classic metric: latency. For example, how long does our model take to predict a new value? Or resource consumption. Do I need to provision a third or fourth instance of my model server to process all the requests?
But then I also have some things that I need to monitor specifically for ML models.
I definitely want to capture the predictions of my model. For one thing, the single value is critical. Is the prediction still in a reasonable range? Does my model predict "-15" users for my app tomorrow? Then something is very wrong.
The distribution of the predictions is also very important. Does the distribution match the values I expect? If I have a big deviation between predicted and observed classes, then I have a prediction bias. That would then be a clear sign that I need to re-train my model.
Also, I have to answer the question: what input data does my model get? What if I’m processing particularly sensitive business data or personal data? Of course, you want to capture as little sensitive data as possible. On the other hand, we also need to collect data to improve and debug the models. That’s a very, very exciting field of challenges and I think a lot will happen there in the next few years.