Bias in machine learning models — Penn Science Policy and Diplomacy Group

By Sonia Roberts

This is the second post on the big idea of confirmation bias.

“Confirmation bias” refers to a characteristically human weakness: The tendency to favor information that supports what we already believe. But if we aren’t careful, biases can also pop up in systems that use machine learning tools. Let’s say you want to develop a tool to determine the locations in a city where more crimes are committed so that you can alert the police to send more officers to those locations. The tool is trained using existing police reports. More crimes are reported in the locations where there are more officers, because officers are the ones filing reports. Thus, the tool learns that more crimes are committed in the places that already have a greater density of police officers, and keeps advising the police force to send officers to those locations. There could be other parts of the city with lots of crimes that are not being reported, but the tool will never learn about those crimes -- they simply don’t appear in the dataset. This could create a feedback loop that looks and behaves a lot like confirmation bias does in humans.

This particular example about predictive policing shows the high stakes of getting bias in machine learning right: In real-world policing, the areas with more officers are not randomly distributed, but are often rapidly gentrifying low-income communities of color. This could contribute to people of different races and socioeconomic backgrounds having different experiences with police. It’s tempting to believe that machine learning tools should be free of the biases we humans are so vulnerable to, but machine learning has its own biases that we have to keep in mind when developing and using these tools. To learn more about some of the biases that are particular to artificial intelligences, I spoke with Kristian Lum, a professor at Penn who studies algorithmic bias.

Prof. Lum was hesitant to give a simple definition of this term. “It can mean a lot of things,” she said. “When we’re talking about algorithmic bias, we’re not talking about algorithms predicting the width of a screw in a manufacturing plant. We’re talking about algorithms that are used to make decisions or predictions about people, and are in some way skewed. It’s become sort of a catch-all phrase for algorithms that impact humans badly.” So “algorithmic bias” describes the feedback loop in the predictive policing example, but it can also describe many other ways that an automated system -- often developed using machine learning techniques -- produces different outcomes for different groups of people. This bias can have many sources, and Lum took me through a few examples.

One source is measurement bias, which is any systematic error in data collection. “An algorithm can be biased because the method you’re using to estimate its parameters is biased,” Lum said. Say that you want to measure the heights of Penn students, so you go from department to department and ask all the majors to stand next to a big ruler. The students in the math department get the idea to stand on their Intro Calculus textbooks, so their heights are on average a bit higher. “If you were to use a model to predict those heights, your model would inherit that bias,” Lum said. “And that can particularly become a problem if there’s some sort of systematic rule for who’s more likely to get that book to stand on.”

Checking whether someone is standing on a thick book is an easy way to determine whether our measurements are biased, but it can be harder to find this particular source of bias in real-world situations. For example, a study from the 1990s showed that children who were able to withstand the temptation of eating one marshmallow until the researcher returned to reward them with a second marshmallow performed better on standardized tests in the future. More recent research that controls for factors like household income has revealed that passing the “marshmallow test,” as it’s called, actually correlates more with the child’s socioeconomic background than it does with the child’s long-term success. Socioeconomic factors such as food security act like the thick Intro Calculus book, giving these kids a boost on the marshmallow test.

Bias can also come from sampling in an unrepresentative way. For example, let’s say that you only sample the heights of students in the locker room that the basketball team uses. As Professor Lum tells us: “Not everybody in the population of students at Penn would be equally likely to be in that sample, and it would be unrepresentative. Again, if you fit some sort of machine learning model to predict those heights, it’s going to be biased upwards: You misrepresented the composition of the population.”

Sampling bias could be one of the reasons that face recognition tools developed using machine learning perform differently on different demographics, as MIT researcher Joy Buolamwini and her colleagues showed in a ground-breaking study. They tested whether the face recognition algorithms developed by IBM, Microsoft, and Face++ -- leading companies in this area -- could guess the gender of faces in a diverse dataset with a broad range of skin tones and a near-even gender split. All of the algorithms performed best on light-skinned men and worst on dark-skinned women. In response to these results, IBM pulled its face recognition technology off the market and began working on developing more diverse datasets. This is an excellent first step, but since sampling bias may not have been the only cause of the disparate outcomes for facial recognition tools, it is important to keep performing tests like Buolamwini and her colleagues did on the performance of these tools on different demographic groups.

Possibly the hardest bias to detect results from historical causes. Even if your data collection methods are fair and unbiased, and you have an accurate, representative sample of the populations you’re interested in, “you might still end up with disparities between different groups because the world as it is, perfectly measured, is not a fair place,” said Professor Lum. “There are certain groups that have been historically marginalized or oppressed, and that turns up in the data. You can fit a model to make predictions or learn things about that data, but that might still perpetuate some of the past issues, either by reinforcing those stereotypes or making the problem that already exists worse.”

For example, men far outnumber women at tech companies, and one suggested reason for this is that hiring teams assume that women won’t have a good “cultural fit” with the company. When Amazon tried to make a resume-scanning tool to score applicants using machine learning, the tool learned the strong correlation between being male and working at Amazon and adopted this as a bias. It docked points from resumes for including the word “women’s,” as in “women’s chess club captain,” and for graduating from women’s colleges. The team at Amazon couldn’t figure out how to counteract the bias, and ended up scrapping the project.

It’s tempting to believe that automated systems should be free from bias because they are free from human emotions and our lived experiences. But unfortunately, algorithms are trained on real-world data by fallible humans, and they have their own forms of bias that we have to be aware of — just like people do.