Machine Learning Times
Machine Learning Times
EXCLUSIVE HIGHLIGHTS
How Generative AI Helps Predictive AI
 Originally published in Forbes, August 21, 2024 This is the...
4 Ways Machine Learning Can Perpetuate Injustice and What to Do About It
 Originally published in Built In, July 12, 2024 When ML...
The Great AI Myth: These 3 Misconceptions Fuel It
 Originally published in Forbes, July 29, 2024 The hottest thing...
Where FICO Gets Its Data for Screening Two-Thirds of All Card Transactions
 Originally published in The European Business Review, March 21,...
SHARE THIS:

4 years ago
Moving Beyond “Algorithmic Bias is a Data Problem”

 
Originally published in Patterns, April 9, 2021.
A surprisingly sticky belief is that a machine learning model merely reflects existing algorithmic bias in the dataset and does not itself contribute to harm. Why, despite clear evidence to the contrary, does the myth of the impartial model still hold allure for so many within our research community? Algorithms are not impartial, and some design choices are better than others. Recognizing how model design impacts harm opens up new mitigation techniques that are less burdensome than comprehensive data collection.

In the absence of intentional interventions, a trained machine learning model can and does amplify undesirable biases in the training data. A rich body of work to date has examined these forms of problematic algorithmic bias, finding disparities—relating to race, gender, geo-diversity, and more—in the performance of machine learning models.

However, a surprisingly prevalent belief is that a machine learning model merely reflects existing algorithmic bias in the dataset and does not itself contribute to harm. Here, we start out with a deceptively simple question: how does model design contribute to algorithmic bias?

A more nuanced understanding of what contributes to algorithmic bias matters because it also dictates where we spend effort mitigating harm. If algorithmic bias is merely a data problem, the often-touted solution is to de-bias the data pipeline. However, data “fixes” such as re-sampling or re-weighting the training distribution are costly and hinge on (1) knowing a priori what sensitive features are responsible for the undesirable bias and (2) having comprehensive labels for protected attributes and all proxy variables.

 

To continue reading this article, click here.

Leave a Reply