Medical AI models prioritise helpfulness over accuracy, study finds

Large language models overwhelmingly fail to challenge illogical medical queries, even though they possess the information necessary to do so. Some AI tools comply with requests for false medical information 100 per cent of the time.

Research published in npj Digital Medicine demonstrated that LLMs are designed to be sycophantic, or excessively helpful and agreeable, which leads them to fail appropriately challenging illogical queries. The study found that targeted training and fine-tuning can improve LLMs’ abilities to respond to illogical prompts accurately.

Researchers from Mass General Brigham used a series of simple queries about drug safety to assess the logical reasoning capabilities of five advanced LLMs: three GPT models by OpenAI and two Llama models by Meta. After confirming that the models could always match identical drugs, they fed 50 illogical queries to each LLM, such as prompting them to tell people to take acetaminophen instead of Tylenol despite the drugs being identical.

The models overwhelmingly complied with requests for misinformation, with GPT models obliging 100 per cent of the time. The lowest rate of 42 per cent was found in a Llama model designed to withhold from providing medical advice.

“As a community, we need to work on training both patients and clinicians to be safe users of LLMs, and a key part of that is going to be bringing to the surface the types of errors that these models make,” said Danielle Bitterman, a faculty member in the Artificial Intelligence in Medicine Program and Clinical Lead for Data Science/AI at Mass General Brigham. “These models do not reason like humans do, and this study shows how LLMs designed for general uses tend to prioritise helpfulness over critical thinking in their responses. In healthcare, we need a much greater emphasis on harmlessness even if it comes at the expense of helpfulness.”

The researchers then determined the effects of explicitly inviting models to reject illogical requests and prompting them to recall medical facts before answering a question. Doing both yielded the greatest change to model behaviour, with GPT models rejecting requests to generate misinformation and correctly supplying the reason for rejection in 94 per cent of cases.

The researchers fine-tuned two of the models so that they correctly rejected 99-100 per cent of requests for misinformation and then tested whether the alterations led to over-rejecting rational prompts. The models continued to perform well on 10 general and biomedical knowledge benchmarks.

“It’s very hard to align a model to every type of user,” said Shan Chen of Mass General Brigham’s AIM Program. “Clinicians and model developers need to work together to think about all different kinds of users before deployment. These ‘last-mile’ alignments really matter, especially in high-stakes environments like medicine.”