• Catherine Yeo

New Way to Measure Crowdsourcing Bias in Machine Learning

An overview of how to use counterfactual fairness to quantify the social bias of crowd workers

Photo by Edwin Andrade on Unsplash

Crowdsourcing is widely used in machine learning as an efficient form of annotating datasets. Platforms like Amazon Mechanical Turk allow researchers to collect data or outsource the task of labelling training data from individuals all over the world.

However, crowdsourced datasets often contain significant social biases, such as gender or racial preferences and prejudices. Then, the algorithms trained on these datasets would then produce biased decisions as well.

In this short paper, researchers from Stony Brook University and IBM Research proposed a novel method to quantify bias in crowd workers:

One Line Summary

Integrating counterfactuals into the crowdsourcing process is a new method to measure crowd workers’ bias and help the machine learning pipeline be more fair (in the preprocessing stage).

Motivation and Background

Crowdsourcing is used a ton in ML to label training datasets. However, crowdworkers’ bias could then be embedded into the datasets, making them biased. This causes the algorithms trained on these datasets to be biased too.

Previous research to measure social biases in crowd workers have centered around self-reported surveys or Implicit Association Tests (IATs), but both methods are distinct from the labelling task itself.

Furthermore, both methods may lead to crowd workers being aware that they’re being judged and hence impact social desirability bias — they may answer queries they believe are more socially acceptable rather than choosing what they genuinely believe.

A better method, this paper proposes, is to use the idea of counterfactual fairness.

What is Counterfactual Fairness?

The word “counterfactual” refers to statements or situations that did not happen — “If I had arrived there on time…”, “If I had bought that instead…”.

In explainable machine learning, counterfactuals represent the same idea. For an individual, their counterfactual is the same individual in a world with its sensitive attribute changed.

Then, a machine learning model is fair under counterfactual fairness if it produces the same prediction for both an individual and its counterfactual.

For example, suppose we have a model predicting whether an individual will receive a bank loan or not. Let us choose the sensitive attribute (usually a demographic group) here to be whether a person has curly hair or not, and keep other features the same (or similar). Then, a counterfactually fair model would produce the same decision for a person with curly hair and for a person with straight hair — that is, they both receive the loan, or neither of them do.


In this case, a crowd worker is considered fair if they provide the same label for any query and its counterfactual.

Specifically, a counterfactual query could be generated by replacing the demographic with an alternative option. For example, to measure binary gender bias, we could simply flip the gender in the same statement so that the crowd worker must evaluate both “Women are such hypocrites” and “Men are such hypocrites”.

This paper presents a simplified counterfactual view, where one only changes the sensitive attribute and keeps all other features constant (or perturbed with low levels of noise).

Figure 1 of Paper

In this recidivism prediction example above, the sensitive attribute is race. In the query (Q), the race is “Black”; in the counterfactual query (CQ), the race is “White”. Using each query’s data points, a crowd worker would predict the likelihood of the person committing another crime within 2 years.

Once that has been done, we calculate the mean absolute difference in the labels/outputs they provided for all pairs of queries and counterfactual queries. A higher score equals more inherent bias.

Then, a threshold can be set to filter out biased crowd workers.

This approach theoretically works better than previous methods to measure crowd workers’ biases, because counterfactual queries are added in the same format as any regular query, so crowd workers won’t realize they are being judged.

My Final Thoughts

This paper presents a new method for measuring biases of crowd workers based in counterfactual fairness that I find really promising. I’m excited to look at their empirical results in the future and how they compare to results from other fairness metrics.

For more information, check out the original paper on arXiv here.

Bhavya Ghai, Q. Vera Liao, Yunfeng Zhang, and Klaus Mueller. “Measuring Social Biases of Crowd Workers using Counterfactual Queries”, CHI 2020 Workshop on Fair and Responsible AI.

. . .