Key Takeaways
- Real-world machine learning isn’t simply about training a model once. Getting training data is often a complicated problem, and even after the first deployment, continuous monitoring and retraining will be needed.
- In order to get training data, you often need a large group of human workers to label and annotate data for you. But this presents a quality control problem, which you may need statistical monitoring to detect.
- Model selection and feature selection are important but are often constrained by the amount of data you have available. Even if a model or feature doesn’t work now, it may work later on as you get more data.
- Your users and your products will change, and the performance of your machine learning models will change with them. You’ll need to need to regather training data, reevaluate the algorithms and features you chose, and retrain your models; so try to automate these steps as much as possible.
Machine learning has long powered many products we interact with daily–from "intelligent" assistants like Apple's Siri and Google Now, to recommendation engines like Amazon's that suggest new products to buy, to the ad ranking systems used by Google and Facebook. More recently, machine learning has entered the public consciousness because of advances in "deep learning"–these include AlphaGo's defeat of Go grandmaster Lee Sedol and impressive new products around image recognition and machine translation.
In this series, we'll give an introduction to some powerful but generally applicable techniques in machine learning. These include deep learning but also more traditional methods that are often all the modern business needs. After reading the articles in the series, you should have the knowledge necessary to embark on concrete machine learning experiments in a variety of areas on your own.
This InfoQ article is part of the series "An Introduction To Machine Learning". You can subscribe to receive notifications via RSS.
The previous articles in this series focused on the algorithmic part of machine learning: training simple classifiers, pitfalls in classification, and the basics of neural nets. But the algorithmic part of machine learning is just one small part of the process of deploying a model to solve a real-world problem.
In this article, we'll talk about the end-to-end flow of developing machine learning models: where you get training data, how you pick the ML algorithm, what you must address after your model is deployed, and so forth.
End-to-end model deployment
There are many machine learning classification problems where using log data is standard, essentially giving you labels for free. For example, ad click prediction models are typically trained on which ads users click on, video recommendation systems make heavy use of which videos you’ve watched in the past, etc.
However, even these systems need to move beyond simple click data once they reach large enough scale and sophistication; for instance, because they’re heavily biased towards clicks, it can be difficult to tune the systems to show new ads and new videos to users, and so explore-exploit algorithms become necessary.
What’s more, many of these systems eventually incorporate explicitly human-generated labels as well. For instance, Netflix employs over 40 people to hand-tag movies and TV shows, in order to make better recommendations and generate labels like “Foreign movies featuring a strong female lead,” YouTube hand-labels every ad to have better features when making ad click predictions, and Google trains its search algorithm in part on scores that a large, internal team of dedicated raters gives to query-webpage pairs.
As another example, suppose you're an e-commerce site like eBay or Etsy. You're starting to see a lot of spammy profiles selling Viagra, drugs, and other blacklisted products, so you want to fight the problem with machine learning. But how do you do this?
1. First, you're going to need people to label training data. You can't use logs; your users aren't flagging things for you, and even if they were, they’re surely wildly biased (and spammers themselves would misuse the system).
But gathering training data is a generally difficult problem in and of itself. You'll need hundreds of thousands of labels, requiring thousands of hours of work, so where will you get these?
2. Next, you'll need to build and deploy an actual ML algorithm. Even with ML expertise, this is a difficult and time-consuming process: how do you choose an algorithm, how do you choose the features to input into the algorithm, and how do you do this in a repeatable manner, so that you can easily experiment with different models and parameters?
3. Even after you’ve deployed your first spam classifier, you can't stop there. As you get new sources of users, or spammers get more creative, the types of spam appearing on your website will quickly change.
So you'll need to continually rerun steps 1 and 2—which is a surprisingly difficult process to automate, especially while maintaining the accuracy levels you need.
4. And even with a working, fully fledged machine learning pipeline, you’re not finished. You don't want to accidentally flag and remove legitimate users. So there will always be cases of ML decision boundaries where you need a human to go and look at. But how do you build a scalable human labor pipeline that seamlessly integrates into your ML and returns results in real-time?
At Hybrid, our platform for machine learning and large-scale human labor, we realized that we were building this complicated pipeline over and over again for many of our problems, so we built a way to abstract all the complexity behind a single API call:
# Create your classifier
curl https://www.hybridml.com/api/classifiers
-u API_KEY:
-d categories="spam, not spam"
-d accuracy=0.99
# Start classifying
curl https://www.hybridml.com/api/classify
-u API_KEY:
-d classifier_id=ABCDEFG
-d text="Come buy the latest Viagra at 50% off."
Behind the scenes, the same call automatically and invisibly decides whether a machine learning classifier is reliable enough to classify the example on its own, or whether human intervention is needed. Models get built automatically, they’re continually retrained, and the caller never has to worry whether more data is needed.
In the rest of this article, we’ll go into more detail on the problems we described above—problems that are common to all efforts to deploy machine learning to solve real-world problems.
Getting labels for training
In order to train any spam classifier, you’ll first need a training set of “spam” and “not spam” labels. How do you form it?
One way could be to use your site’s visitors and logs. Just add a button that allows visitors to mark profiles as spam, and use this as a training set.
However, this can be problematic for several reasons:
Most of your visitors will ignore the button, so your training set is likely to be very small.
It’s easily gamed: spammers can simply start marking legitimate profiles as spammy.
It’s likely to be biased in unknown ways (after all, plenty of people are fooled by spammy Nigerian emails).
Another way could be to label a bunch of profiles yourself. But this is almost certainly a waste of time and resources: spam probably constitutes less than 1-2% of all profiles, so you’d need hundreds of thousands of profile classifications (and thousands of hours of your own time) in order to form a reasonable training set.
What you need, then, is a large group of workers to comb through a large set of profiles, and mark them as “spam” or “not spam” according to a set of instructions. Common ways to find workers to perform these types of tasks include hiring off of Craigslist or using online crowdsourcing platforms like Amazon Mechanical Turk, Crowdflower, or Hybrid.
However, the work generated by Craigslist or Mechanical Turk workers is often low quality; we’ve often seen spam rates, where workers randomly click on labels, as high as 80-90%. So you’ll also need to monitor worker output for accuracy.
One common monitoring technique is to use gold standards: you label a number of profiles into spam or not spam yourself, and randomly send them to your workers in order to see if they agree with you.
Another potential approach is to use statistical distribution tests to catch “outlier” workers. For example, imagine a simple image labeling task: if most workers label 80% of the images as “cat” and 20% as “not cat,” then a worker who labels only 45% of images as “cat” should probably be flagged.
One difficulty, though, is that workers deviate from each other in completely legitimate ways. For example, imagine people tend to upload more cat images during the day, or that spammers tend to operate during the night. Then daytime workers will have higher “cat” and “not spam” labels compared to someone who works during the night. To account for this kind of natural deviation, a more sophisticated approach is to apply non-parametric Bayesian techniques to cluster worker output, which then get measured for deviations.
Model selection
Once we have enough training labels, we can start building our machine learning models. Which algorithm should we use? In the text classification world, for example, three common algorithms are naive Bayes, logistic regression, and deep neural networks.
We won’t go deeply into this question of how to choose which machine learning classifier, but one way to think about the difference between different algorithms is in terms of the bias-variance tradeoff. Put simply: simpler models tend to perform worse than complex models with large amounts of data (they aren’t powerful enough to model the problem, so they have high bias), but they can often perform better when the data is limited (complex models can easily be overfit, and are sensitive to small changes in the data, so they exhibit high variance).
As a result, if you’re limited by the number of labels you have, it’s fine—often, actually better—to start with a simpler model or fewer features, and to add more sophistication as you get more data.
What this also means is that just because a more powerful algorithm was less accurate early on, it doesn’t mean that it shouldn’t be re-evaluated later.
This approach mirrors what we do at Hybrid. Our goal is to always have the most accurate machine learning possible, whether we have only 500 data points or 500,000. As a result, we automatically transition between different algorithms: we usually start with the simpler models that perform better with limited amounts of data and switch to more powerful models as more and more data comes in, depending on how different algorithms perform on an out-of-sample test set.
Feature selection
In terms of feature selection—choosing which features to use in the machine learning algorithm—there are two types of approaches.
The more manual approach is to think of feature selection as a pre-processing step: each feature is scored independently of the machine learning model, and only the top N features or the features passing some threshold are kept. For example, a common feature selection algorithm is to score whether the feature has a different distribution under each class (e.g., when considering whether the word “Viagra” should be kept as a feature in an email spam classifier, we can compare whether “Viagra” appears significantly more often in spam vs. non-spam emails), and to take the features where the distributions change as much as possible between classes.
Another, increasingly common approach is to let feature selection be performed automatically by the machine learning algorithm itself. For example, logistic regression models can take a regularization parameter that effectively controls whether coefficients in the model are biased towards zero. By experimenting with different values of this parameter and monitoring accuracy on a test set, the model automatically decides which features to zero out (i.e., throw away), and which features to keep.
One more note about features: it’s often very useful to add crossed features. For example, suppose teenagers in general and Londoners in general tend to click on ads, but London teenagers do not. Then an ad click prediction model with a “user is teenager AND user lives in London” feature would likely perform better than a model that only contains separate teenager and Londoner features. This is one of the advantages of deep neural networks: because of their fully-connected, hierarchical structure, they can automatically discover feature crosses on their own, whereas models like logistic regression need feature crosses fed into them.
Adapting to changes
The final question we’ll look at is how to take care of changes in the data distribution. For example, suppose you’ve built a spam classifier, but suddenly you experience a spurt of user growth in a new country, or a new spammer has decided to target your website. Because these are new sources of data, your existing classifier is unlikely to be able to accurately handle these cases.
One common practice is to gather new labels on a regular and frequent basis, but this can be inefficient: how do you know how many new labels we need to gather, and what if the data distribution hasn’t actually changed?
As a result, another common practice is to only gather new labels and retrain models every few months or so. But this is problematic, since quality may severely degrade in the meantime.
One solution is to randomly send examples off for human labeling (e.g., less than 1% of the time, once a model has reached high enough accuracy). This way we have an unbiased set of samples we can monitor accuracy against, so we can quickly detect if something has changed. You can also monitor the ML scores returned by the algorithm; if the distribution of these scores change, this is another indication the underlying data has changed and the models need a fresh regime of training.
About the Authors
Edwin Chen works at Hybrid, a platform for machine learning and human labor. He used to build machine learning systems at Google, Twitter, Dropbox, and quantitative finance.
Justin Palmer is founder of topfunnel, software for recruiters, and works on Hybrid. He was most recently VP Data at LendingHome, and has built ML products for speech recognition and natural language processing at Basis Technology and MITRE.
Machine learning has long powered many products we interact with daily–from "intelligent" assistants like Apple's Siri and Google Now, to recommendation engines like Amazon's that suggest new products to buy, to the ad ranking systems used by Google and Facebook. More recently, machine learning has entered the public consciousness because of advances in "deep learning"–these include AlphaGo's defeat of Go grandmaster Lee Sedol and impressive new products around image recognition and machine translation.
In this series, we'll give an introduction to some powerful but generally applicable techniques in machine learning. These include deep learning but also more traditional methods that are often all the modern business needs. After reading the articles in the series, you should have the knowledge necessary to embark on concrete machine learning experiments in a variety of areas on your own.
This InfoQ article is part of the series "An Introduction To Machine Learning". You can subscribe to receive notifications via RSS.