Simple questions expose AI bias as effectively as expert hacking methods

Average internet users asking intuitive questions can trigger the same biased responses from AI chatbots as sophisticated technical jailbreak methods used by experts, according to research from Penn State that challenges assumptions about how AI discrimination is uncovered.

The study, presented at the 8th AAAI/ACM Conference on AI, Ethics, and Society, analysed entries from a “Bias-a-Thon” competition in which 52 participants submitted 75 prompts to 8 generative AI models, including ChatGPT and Gemini. Researchers found that 53 of the prompts generated reproducible, biased results across multiple large language models.

“A lot of research on AI bias has relied on sophisticated ‘jailbreak’ techniques,” said Amulya Yadav, associate professor at Penn State’s College of Information Sciences and Technology. “These methods often involve generating strings of random characters computed by algorithms to trick models into revealing discriminatory responses. While such techniques prove these biases exist theoretically, they don’t reflect how real people use AI. The average user isn’t reverse-engineering token probabilities or pasting cryptic character sequences into ChatGPT — they type plain, intuitive prompts. And that lived reality is what this approach captures.”

The competition, organised by Penn State’s Center for Socially Responsible AI, challenged contestants to come up with prompts that would lead generative AI systems to respond with biased answers. Participants provided screenshots of their prompts and AI responses, along with explanations of the bias or stereotype they identified.

Eight categories of biases

Biases fell into eight categories: gender bias, race, ethnic and religious bias, age bias, disability bias, language bias, historical bias favouring Western nations, cultural bias and political bias. Participants employed seven strategies to elicit these biases: role-playing, hypothetical scenarios, utilising human knowledge to ask about niche topics, using leading questions on controversial issues, probing biases in underrepresented groups, feeding the LLM false information, and framing the task as having a research purpose.

“The competition revealed a completely fresh set of biases,” said Yadav, organiser of the Bias-a-Thon. “For example, the winning entry uncovered an uncanny preference for conventional beauty standards. The LLMs consistently deemed a person with a clear face to be more trustworthy than a person with facial acne, or a person with high cheekbones more employable than a person with low cheekbones. This illustrates how average users can help us uncover blind spots in our understanding of where LLMs are biased.”

The researchers conducted Zoom interviews with a subset of participants to gain a deeper understanding of their prompting strategies and conceptions of concepts such as fairness, representation, and stereotyping when interacting with generative AI tools. Once they arrived at a participant-informed working definition of bias, which included a lack of representation, stereotypes and prejudice, and unjustified preferences toward groups, the researchers tested the contest prompts in several LLMs to see if they would elicit similar responses.

The researchers described mitigating biases in LLMs as a cat-and-mouse game, with developers constantly addressing issues as they arise. They suggested strategies, including implementing a robust classification filter to screen outputs before they are sent to users, conducting extensive testing, educating users, and providing specific references or citations so users can verify the information.