• Tanishq Sandhu

StereoSet: Combatting Inherently Biased Linguistic Models

Exploring a dataset that measures bias in AI language models

Photo by Markus Spiske on Unsplash


Vinodkumar Prabhakaran started off his presentation on Bias and Fairness in NLP at the EMNLP-IJCNLP 2019 with a well-known riddle for the audience:


“A man and his son are in a terrible accident and are rushed to the hospital in critical care. The doctor looks at the boy and exclaims “I can’t operate on this boy, he’s my son!” How could this be?”

The answer is quite simple: the doctor is the mother.

The riddle draws on the fact that humans generally associate a doctor with being male, and so many people are stumped trying to understand how the doctor can be the child’s parent while the father is also known to be injured.

Natural Language Processing (NLP) is a subset of machine learning dealing with the manipulation and analysis of human language.

When people use their own biases to choose to write on certain topics (such as the riddle above) rather than others, they inherently present the world with their own biases and develop a frequency which is inconsistent with reality. This phenomenon is called human reporting bias.

The issue at heart is that these misleadingly generated writings are the same texts that NLP models operate on.

In 2020, researchers Moin Nadeem (MIT), Anna Bethke (formerly Intel), and Siva Reddy (Facebook CIFAR) developed StereoSet: a data set and metrics system to measure the level of stereotyping in NLP models.

How Does StereoSet Work?

StereoSet measures bias along 4 categories: gender, race, profession, and religion. It uses three main metrics for measuring bias:

1) Language Modeling Score (LMS)

The language modeling score assesses the baseline performance of the model in basic language modeling tasks. An ideal model would have a score of 100 (for every word or phrase it is able to make the correct and meaningful association).

2) Stereotype Score (SS)

The stereotype score determines the inclination of the model towards a stereotype or anti-stereotype term. The ideal score for this metric would be a 50 (no inherent bias for a stereotypical term).

3) Idealized Context Association Test (ICAT)

The idealized context association test uses the stereotyping score and language modeling score to determine the efficacy of a model against its bias. The ideal model with a lms of 100 and ss of 50 would have an icat score of 100. On the other hand, a fully stereotypical model with a lms of 0 and ss of 0 or 100 would have an icat score of 0.

These metrics are powered by Amazon Mechanical Turk, which crowdsources the very definition of bias to individuals across the USA who are tasked to construct sentences and phrases which, relative to their opinion, are stereotypical or anti-stereotypical.


Example of a CAT (Source: paper)

(If you’re interested in reading more about crowdsourcing in machine learning and overcoming biases in it, read more here).

The Findings

The results were astonishing: all of the popular models the StereoSet team tested (Facebook’s RoBERTa, Google’s BERT and XLNet, and OpenAI’s GPT-2) were found to be guilty of above-average levels of stereotyping.

Of the four models, a small GPT-2 model scored the best (combination of best performance and lowest bias) with an icat score of 73.0, and the overall least-biased was the base RoBERTa model with a stereotype score of 50.5 (only 0.5 above the ideal score of 50).

Intuitively, it makes sense why GPT-2 performed well. GPT-2 is trained on articles from Reddit which are already split by subreddits related to target categories in StereoSet such as religion or gender. These categorical subreddits lead to the ability to form correct associations. (If you would like to read about a similar language model, GPT-3, and its bias, read this article here.)

What Now?

As the authors of StereoSet have suggested, the StereoSet is indeed a large step forward in working towards de-biasing NLP models. Although the crowdsourced StereoSet data is arguably subjective and still doesn’t change any of the existing stereotypes, the initiative and efforts to identify and acknowledge existing stereotyping and bias is the first step towards universal fairness and unbiased NLP models.

The next step after detection is correction. The detection of stereotypes in linguistic models should be used to fuel efforts to retroactively create unbiased models. One solution is to recreate training sets so that they minimize the stereotype score while maximizing the language modeling score and icat score for the model that trains on them — mainly by making them as diverse and representative of the real world as possible. When models have datasets that truly are representative of our real world, free from any stereotypes or bias, only then can the models follow suit and remain unbiased.

For more information, check out the original paper on arXiV here and their website here.


Tanishq Sandhu is pursuing his Bachelors Degree in Computer Science at Georgia Tech passionate about full stack development and artificial intelligence. To connect, make a suggestion, or learn more, visit Tanishq’s website at www.tanishqsandhu.com