At QCon San Francisco last week, Leslie Miley gave a keynote talk in which he explained how inherent bias in data sets have affected things from the 2016 Presidential race to criminal sentencing in the United States.
He started by emphasizing that 2017 has been an unprecedented year because social media has been overwhelmed with bias powered by fake news, machine learning and AI. He recounted the numbers that Facebook acknowledged - in 2016 they claimed that fake news was not a problem, then in October 2017 they stated that 10 million people saw fake ads and in November they revised this number to 126 million, and climbing. Twitter identified 6000 Russian-linked bots which generated 131,000 tweets between September and November 2016, viewed by 288 people globally and 68 million in the USA. He asked the question - how can this happen?
He explained that while he was at Twitter he ran abuse-safety-security in the accounts team. They identified hundreds of millions of accounts which had been created in the Ukraine & Russia. He doesn't know if they were removed. Facebook has stated that up to 200 million of their accounts may be false or fake or compromised. There is a significant problem which is not being addressed.
He explained that in 2016 Twitter released their algorithmic timeline, which is designed to ensure that you see more tweets from the people you interact with the most and that it:
Ensured the most popular tweets are far more widely seen than they used to be, enabling them to go viral on an unprecedented scale.
They have achieved this goal very effectively, however there is a problem when the most popular tweets and posts are falsified news. He said the system didn't deliver news, it delivered propaganda; it didn't deliver your cat video, it delivered biased information. They told people to go out and protest against Black Lives Matter; they told someone to go and shoot up a pizza parlour in the middle of the country - someone did that because of fake information that they received from social media.
He maintains that Facebook and Twitter are publishers, media companies; however they are not held account like media companies are because they are treated as a "platform". There is extensive ongoing debate and discussion about the role of Facebook and Twitter as media companies or platforms.
There could be as many as nearly one billion false accounts on social media which are generating fake posts and taking advantage of the algorithmic timeline features to spread their content very widely, taking advantage of bias and altering people's moods and behaviours. He cited a Facebook experiment which showed how inserting different posts into peoples timelines altered their mood and actions. He asked if, having published that this is possible, they did anything to prevent others from using the same techniques; his contention is that they did not.
The false data becomes a part of the training data which determines what the timeline algorithms present to people.
He related this to the 2008 mortgage crisis - the way information is collected and presented with very little control and not understanding how the system works and why it works in the way it does.
He explained why this concerns him - he is sure that the "next big thing" will be an AI/ML company and asks if they will repeat the mistakes of the past. Without conscious care and effort, this is a very likely outcome.
There is a growing and thriving industry emerging around the use of algorithms in a wide range of areas. He gave the example of their use in ride-sharing – what would happen if the algorithm determines that in a particular area most rides are under $5.00? Will they send people to pick up in those areas, will they send drivers who are lower-rated? What is the impact on the people who live in those areas? This is already happening – and there is no visibility into what is going on.
It is also happening in sentencing guidelines where the algorithm resulted in African-Americans being 45% more likely to get sent to prison for the same crime, because the dataset they used to train the model was inherently biased. This algorithm has been deployed in 25 states in the USA, without being changed.
There is no transparency around how the algorithms are put together and trained and these algorithms are making more and more life and death decisions in society – around employment, health care, mortgage rates and in many other areas of our lives.
When these problems become apparent and come crashing down, the public will be left to pick up the pieces.
He then identified concrete things we can do to prevent these problems from happening. This starts with having the discussion about where the training data comes from; is it over-sampled or under-sampled, how are the algorithms built? Be transparent about what information is collected, how it is used, what elements are taken into account in the calculations.
He presented some actionable steps we can take:
- Seek people outside of your circle for your data training experiments – widen your datasets
- Practice radical transparency in what data is being used – ensure the data set identification and algorithm is peer reviewed
- Hire more women in engineering – just do it; engineering teams with more women in them produce better results
- Work on empathy and self-awareness – every day try to wring a little bit of the bias out of yourself (referencing President Obama); refactor your empathy and self-awareness
He ended by providing a list of sources for the audience to delve further into these topics:
- Elite Data Science on Bias-Variance Tradeoff
- Algorithm Watch
- Algorithmic Justice League
- The European Union General Data Protection Regulation
- Federica Pelzel
He encouraged the audience:
Let’s not build an ML weapon of mass destruction and then stand back after five or so years and try to say “but we’re just a platform”.
He ended by saying that in technology we have worked with little oversight and regulation – let’s aim to be self-regulated rather than waiting for these types of problems to cause government regulation. Think about the impact of what we’re building on people who are less privileged than the people who build the systems.