Key Takeaways
- Bias in data has created a bottleneck in enterprise AI which cannot be solved by excessively optimizing machine learning algorithms or inventing new ones.
- Hindsight bias is the accidental presence of information in the training data that will never legitimately be available in production. In layman terms, it is like Marty McFly (from Back to the Future) traveling to the future, getting his hands on the Sports Almanac, and using it to bet on the games of the present.
- There is no silver bullet which solves it. A combination of statistical methods and feature engineering can help to detect and fix it.
- Features that exhibit such bias need to be distinguished from true predictors and determining the right threshold is key.
- At Salesforce Einstein, building awareness of such bias with our customers was the first hurdle, before we could get to resolve it
Once upon a time, there was a sales executive, who tracked incoming sales leads by entering the minimal data needed to insert a lead record. Data entry is a pain, we all know that! As he worked through the process of converting the leads, some of them turned into purchases. At the time of conversion, he filled in additional information for only those which had the positive outcome of conversion to purchases. If you train your machine learning algorithm with years of such labeled data, it will correlate those features with a positive label, though they would never really be available before the conversion. The business process built bias into the data from the ground up.
This story repeats itself across different enterprise use cases, users and data. Machine learning algorithms often assume that a mythical “perfect dataset” is fed into them to predict the target label. In reality, there is often a lot of noise in the data. The Achilles’ heel in this domain is Hindsight Bias (also known as label leakage or data leakage). It is the accidental presence of information in the training data that will never legitimately be available in production, causing unrealistic results in the research environment while poor results in the production environment.
Albert Einstein said: “If I had an hour to solve a problem, I'd spend 55 minutes thinking about the problem and 5 minutes thinking about solutions.” So let's uncover the problem a bit more, with an example:
Demystifying hindsight bias with Titanic
In the machine learning community, the Titanic passenger survivability prediction is pretty well known. The lack of sufficient lifeboats was responsible for many lost lives in the aftermath of the shipwreck. Specific groups of passengers such as women, children and the upper class were more likely to survive than others. Machine learning is used to learn such signals and predict which passengers survived the tragedy.
What many do not know is, that the data used in the Kaggle challenge is the filtered, cleaned up version. The original data had additional features, two of which were particularly problematic: Boat and Body fields. In the aftermath of the shipwreck, passengers were assigned a boat number if they safely made it to a lifeboat, or a body number if they were eventually found dead. Well, of course! If there is a body number, the passenger is dead. You do not need a fancy machine learning algorithm to tell you that.
When using the original dataset, Information about the target label crept into the training data. Boat and body are only known in the future after the event has already occurred. They are not known in the present when making the prediction. If we train the model with such data, it will perform poorly in the present, as that piece of information would not legitimately be available.
This problem is known formally as hindsight bias. And, it is predominant in real-world data, which we’ve witnessed first-hand while building predictive applications at Salesforce Einstein. Here is an actual example in the context of predicting sales lead conversion: the data had a field called deal value which was populated intermittently when a lead was converted or was close to being converted (similar to the Boat and Body fields in the Titanic story).
In layman terms, it is like Marty McFly (from Back to the Future) traveling to the future, getting his hands on the Sports Almanac, and using it to bet on the games of the present. Since time travel is still a few years away, hindsight bias is a serious problem today.
Hindsight bias versus modeling algorithm
Machine Learning algorithms take center stage today in artificial intelligence applications. There is a race to gain a fraction of a percentage improvement in model accuracy by optimizing modeling algorithms or inventing new ones. While that is useful, you can achieve a bigger bang for your buck by focusing where the bottleneck is for applied machine learning, specifically with enterprise data. Hindsight bias is one such area mostly unexplored. So, how can we address this problem?
Mitigation strategies
1. Statistical analysis of input features
There is a suite of statistical tests that we can run on the input features to detect strong association of the features to the target label. Pearson Correlation provides a numeric measure in the range (-1,1) between the feature and the label, which expresses how strong the association is between the feature and the label as well as the direction. While it works great for numeric features, it can work for categorical features as well once they are vectorized. However, if categoricals have a large number of unique values (e.g. cities in the world), correlation misses association with labels due to dilution of the feature across several columns during vectorization. This can be addressed by CramersV, and hence it is a more preferred statistical test for categorical features.
The impact of such biased features might be trickier when it affects a small fraction of the examples. Imagine a global geographic data. The portion of rows in which City = San Francisco might be one in a thousand. Lift is an alternative measure which catches such sparse hindsight bias.
2. Statistical analysis of derived features
One strategy which has turned out to be useful is to perform some preliminary feature engineering before running statistical tests on the input features.
For instance, many categorical features with hindsight bias follow the pattern of being null, until the target label is determined. They tend to have some value filled in, close to the when the label is specified. The boat and body fields from the Titanic data are examples of this pattern. The way to bust them is to add a null indicator (isNull) derived feature and use CramersV as a statistical test.
Correlation does not always catch numeric features with hindsight bias. For example, in the context of predicting whether a sales opportunity will be won or lost, there was a feature called expected revenue. The system filled in the value after the salesperson closed the opportunity. When the salesperson lost the opportunity, the system calculated expected revenue as 0 or 1. Otherwise, the system calculated it as a large number. A decision tree can be used to discover the two bins: [0,1] and [2, infinity]. Once you bin a numeric feature, you can treat it as a categorical feature. A statistical test like CramersV can then reveal the strong association between the specific bin and the label, thus exposing the bias.
The other noteworthy pattern that we observed: categorical features disguised as text. For instance, while predicting whether a deal will be lost or won, there was a feature called Lost in stage. Clearly, heavily biased, it was defined as a text feature but with only three possible values. A cardinality check on such features, converting them to categoricals and then trying CramersV like statistical tests can reveal hindsight bias.
3. Training versus scoring distribution
Sometimes, the most elusive hindsight bias may not get exposed to the techniques above by just looking at the training data. One of the principal assumptions behind training a machine learning algorithm is that the data used for training is similar to the data used for scoring.
Since features with hindsight bias contain information about the label at the time or right before the actual label is determined, we can look at the distribution of features in training data and the scoring data (before knowing the actual label). If any of the features surface a statistically significant gap in the two distributions, that is a candidate for hindsight bias.
Temporal or timestamp cutoff is a related technique. Here, we determine a cutoff timestamp as the moment in time when the prediction event is supposed to occur, based on current and past records. We then exclude all data before the event of interest, so we are not using any data that we collected close to the prediction or after, i.e., in the future.
4. Cross-validation folds and data preparation
It is crucial to perform all data preparation and feature engineering within each cross-validation fold. As an example, if we use the label info in any feature engineering step, such as binning, we inherently introduce hindsight bias in the data. The same applies for feature selection, outlier removal, encoding and feature scaling methods for dimensionality reduction. If we perform any of these on the entire data before cross-validation, then the test data in each fold of the cross-validation procedure played a role in choosing the features, and this introduces hindsight bias in the data.
Is it hindsight bias or a true predictor?
Across all the methods discussed so far, the hardest aspect is discovering the right threshold for your data and use case, which would help you reveal hindsight bias. What should be the correlation measure, beyond which you mark a feature as biased? Is 0.9 a good threshold or should it be 0.75? At what point do you say a feature is biased versus actually being a true predictor? You need to make the same decision on all other statistical measures, including the difference in the training and scoring distribution and so forth.
At Salesforce Einstein, our experience in building models for a wide variety of use cases and data of different shapes and sizes helps inform acceptable thresholds. However, it is far from set in stone. We are continuously iterating on the thresholds to reflect the real world data and problems.
Conclusion
Hindsight bias in enterprise AI is a more prevalent problem than in consumer AI or academia. The most significant challenge we faced was building awareness of it with our customers. Once we got past that, understanding the business processes and data patterns which introduce such bias was crucial. This journey helped us develop solutions which automated hindsight bias detection and mitigation. The result — more reliable machine learning predictions.
About the Author
Mayukh Bhaowal is a Director of Product Management at Salesforce Einstein, working on automated machine learning and data science experience. Mayukh received his Masters in Computer Science from Stanford University. Prior to Salesforce, Mayukh worked at startups in the domain of machine learning and analytics. He served as Head of Product of a ML platform startup, Scaled Inference, backed by Khosla Ventures, and led product at an ecommerce startup, Narvar, backed by Accel. He was also a Principal Product Manager at Yahoo and Oracle.