BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Articles Agile Development Applied to Machine Learning Projects

Agile Development Applied to Machine Learning Projects

Key Takeaways

  • Agile methods can be used for Machine Learning projects.
  • Many standard practices of software development continue to develop for AI development.
  • Reproducibility is a critical component of software systems that faces challenges for AI Systems.
  • ML systems have dependencies not only in code but potentially in data as well, track them carefully.
  • ML system development is still a work in progress, new needs are being surfaced but best practices and tools are still needed.

Machine Learning (ML) has become a formidable technology that has opened new solutions and opportunities across many industries. These systems make predictions in areas as varied as insurance, finance, medicine, personal assistants, and self-driving cars. Building these systems has relied on many established Software Engineering practices, but many teams find the need to grow these practices to support these new applications. In this article, we'll look at a few fundamental practices for Agile software development, and the challenges that Machine Learning applications present.

What is Machine Learning?

Machine Learning uses an algorithm and data to create a model. The algorithm is code written in Python, R, or your language of choice, and it describes how the computer is going to start learning from the training data. In supervised learning this data and labels are used to train the system to predict a label based on the input. The predictive power of a production model depends on how closely the distribution of the production data matches the distribution of the training data. As these distributions move apart, or drift, the effectiveness of the model decays, and the predictions become less accurate. For example if you use training data for a tree identification system from the summertime, when the leaves are full and green, the system will become less accurate as the color of leaves change or as the trees lose their leaves. To keep your system accurate you will need to update your model as the seasons change to keep the data distributions in sync.

Comparing Lifecycles for Software and ML

Modern software development lifecycle -- as encapsulated in the agile methodology -- has an objective determined by the Product Owner. The development team examines the requirements and turns them into agile features to be designed, built, and tested. For this article, an ‘agile feature’ is a set of code that delivers functionality. An agile feature provides a solution based on logic that the developer has constructed. To create a feature, the team must have a mental model of the problem and the solution that they can encode to the computer.

A typical agile workflow using Git starts by establishing a new branch. Code is developed on the working branch and then inspected before merging with the release branch. An agile feature can be iterated on multiple times before it is finally released. A branch includes code to solve the problem, as well as unit and integration tests to verify that the criteria for the feature have been reached and maintained.

An ML component follows a slightly different path to realization. Product Owners still determine a business objective. The development team will determine if this is a place where an ML solution makes sense. Like any decisions on adopting a new technology, there are costs and benefits to be weighed. Is the benefit of an ML system worth the complexity of a component that requires the collection and curation of data along with the design and training of a model? Is it something that could be accomplished with less risk with a traditional solution made from code? If the solution can be described in a flowchart or with a set of simple heuristics that you could explain on a powerpoint slide you’re likely going to be better off writing code. Companies like Google and Duolingo have found that using simple algorithms or heuristics as a starting point can provide a partial solution while starting to collect data that can then be used to train models with better solutions.  In our tree identification example we might start with a rule that trees with needles are evergreen trees and trees with wide leaves are deciduous trees. Over time as we gather more pictures of trees we can have an expert label the data we’ve collected and train to identify trees by species.

If the development team decides to pursue an ML solution, the objective is turned over to a Data Scientist to determine if the objective is achievable with the existing data. If the data isn't available, then it will need to be collected. Without examples or data, there are no ML systems. Collection of data has a variety of challenges that are better addressed in a separate document, but a few issues include: data cleanliness, bias, and robustness. Building automation and infrastructure to capture, maintain, and audit data are critical for success with ML applications. These tools are also code that needs to be maintained.

Once the data is available, an algorithm is selected. The learning process fits the algorithm to the data, starting from a random set of parameters and performing iterations or epochs until the algorithm has determined a set of parameters that make acceptable predictions. While a few systems like Alpha Go have achieved better than human performance at a task, most AI systems have an upper limit of what an expert human can produce.

The trained model is often then embedded into a traditional software system for deployment in the real world. From an integration perspective, a model is opaque, in that it takes inputs and returns outputs but the internal state of the model is not exposed to the operator. We can only detect errors or problems based on the outputs and not on logs or exceptions thrown inside the model.

This section describes ML development as a straightforward linear process, but in reality it will be the results of many iterations of exploring the data, testing different algorithms and architectures, and training and testing several models!  

Reproducibility

One of the bedrocks of modern software development is our ability to manage source files in tools like Git, Mercurial, or Perforce. Source control gives us the ability to trace the introduction of errors and fixes and is the start of a repeatable software development process. Teams can rely on source control to support strong collaboration between engineers. The ability to trace changes allows downstream team members to understand what's in the current release, or to source a new issue or error. Canary builds, deploying branches, and hot fixes are all possible by the judicious use of branches and source control.

Reproducibility of an ML model is a requirement for a well-engineered system. ML Engineers should be able to recreate the results of the Data Scientist and build pipelines to move the model to production. Just storing the raw data does not provide the whole story of what was added to the model. The intermediate transformations and results are also important. Provenance, or the history of a data item, is necessary for audit purposes, as well as understanding the behavior of a model.  

Data can be in any form, from structured data in closely described schema to unstructured data like images, video, or audio. Often this data cannot be used in a raw form; it must be transformed before the algorithm can consume it. The transformed data is often referred to as the feature, a vastly different use of the term than we see in traditional software development. In ML, features are the properties of the data that are used to make predictions.

Data pipelines are the tools used to transform the data. These transformations happen during training and also during inference. Training is when the data model is trained to make predictions.  Inference is when the model makes a prediction. During training, it is important to track what data was used to build the model. Tracking the data is useful for understanding the distributions of the original data featured. In regulated environments, this may be critical for audit requirements. Even in non-regulated environments, the provenance of the model is useful. During inference, if the source data format changes or if the transformations are not equivalent, the model may see the sample in a different way, and make poor predictions based on the incorrect format.   

Incorrect formats could be as broad as providing a model a black and white image when it expects full color, to as subtle as returning a null in a data flow where only integers were expected. As more companies are starting to deploy multiple ML pipelines that share common features, there is beginning to be a wider adoption of feature stores. Feature stores use the term ‘store’ less like ‘storage’ and more like ‘market’. It offers a common place to keep transformations and feature extractions that can be used across multiple ML applications.

In modern software development, we need to share different artifacts between team members and between teams to communicate. The range of material that needs to be shared to be successful in ML expands what traditional so provides. For instance, in training, it is important to capture the hyperparameters used. Without these starting points, it will be difficult for new team members to progress with improving the model over time. While objectives change all the time in business, machine learning systems are often more sensitive to changes in data and the quality of prediction can quickly degrade in the face of new data. For this reason, training data may need to be refreshed often. Systems or controls need to be in place to make sure that production data is sampled and captured for retraining.

Unlike source code, this data can get large both in volume and in capacity. For example, satellite or medical imagery can easily stretch into GB per file. Online transactions can be millions to billions of rows per day. Managing this presents new challenges.

There is a rich history of ETL tools that have been used for moving and transforming data. While these can be used in ML domains, new tools are also appearing. These tools span a wide range of abilities: DVC augments existing Git Workflows; Pachyderm reinvents source control for data in a Kubernetes context; and Disdat augments Luigi (an existing data pipeline tool from Spotify) to version bundles of files as a data product.

For tracking experiments and training, new tools are also coming out, both as software as a service as well as on premise. Weights and Biases and ClearML both represent new tools for tracking experiments over time.

Dependency Tracking

Still a challenging problem for modern software development, tracking the libraries and dependencies for new applications can be complicated. Managing the supply chain for an application involves looking at dependencies from local libraries, open source, or other third-party resources. While tooling continues to improve, vigilance is important.

Tracking the dependencies for a Machine Learning system is both simpler and more complex. There are several well-known libraries that are used for Machine Learning applications. Developers are by no means limited to tools like PyTorch, Tensorflow, or SciKit Learn, but they do provide a solid base to choose from. On the other hand, the models themselves have strong dependencies on the examples and data used to train the model. In transfer learning applications, we use a pretrained model so that we can then train further against our specific objective. For example, let’s say we want to train a model to identify dinosaurs but we don’t have a lot of labelled images of dinosaurs. We can use a model trained on ImageNet to learn the features of animals or birds and then use transfer learning to fine-tune our limited sample data set to train for dinosaur detection. This saves us time and cost in training to get to a solution, but it also potentially introduces hidden dependencies in the data.

Continuous Integration/Continuous Deployment

In modern software applications, we can build and deploy new applications at a very rapid pace. Large, complex systems can be built in hours. We leverage this in practice by using CI/CD to pull in new changes, test them against a set of unit and integration tests, and then deploy them to production. Code is deterministic and our tests provide good guides to let us know if we've actually built quality code. New defects or errors are still possible (and probable!) but these systems give us confidence that our code is working as we expect under a variety of conditions.

Machine Learning components may take significantly longer to build. It may take many hours or days of iterations before training is completed. More importantly, ML models are not deterministic. It will take a variety of training and testing in order to validate a model. As mentioned before, models are sensitive to their environments. Changing inputs may require new training.

Building health checks for online applications is a standard practice in traditional applications. The best practices for building health checks and monitoring for ML components or ML systems are still being realized. We are still figuring out the methods and requirements for trusting new applications at scale.

Summary

Machine Learning applications often leverage modern software engineering practices, tools, and techniques for developing and deploying new applications. The current state of the art of ML Component development is still working through gaps in tooling and techniques that will need to be addressed as we look at building and scaling new applications. Finding the right tools to help data science and engineering teams collaborate better will decrease the time to deploy new applications while increasing the quality of new applications. Some of these tools will be extensions of existing tools and workflows, but new tools will also emerge as new patterns of development are implemented.  

Agile was intended to help product teams deal with changing circumstances and build tools in a robust, repeatable, and predictable process. Agile-like techniques can work for Machine Learning by supporting better communication, understanding of objectives and communication of concerns.

About the Author

Jay Palat is a Software Engineer at the Carnegie Mellon University Software Engineering Institute Emerging Technology Center. He leads the engineering efforts for customers and research projects. Prior to the ETC, Jay ran his own consulting company, accelerating engineering teams with a mix of management, best practices and product engineering. He was the Director of Engineering and Sr. Director of Data Engineering at Rhiza (now Nielsen). He has successfully led teams to deliver products and services in retail, healthcare, finance and analytics at companies ranging from IBM to Series A startups.

 

 

Rate this Article

Adoption
Style

BT