AI chatbots fail to master the crucial human skill of medical reasoning

Generative AI might be able to pass complex medical exams, but it still lacks the fundamental human reasoning needed to safely diagnose patients in real-world settings.

According to a major new study from Mass General Brigham, publicly available chatbots like ChatGPT and Gemini consistently fail at “differential diagnoses” — the crucial early stage of medical investigation where doctors must narrow down a list of potential ailments using incredibly limited information.

Published in the journal JAMA Network Open, the research tested 21 large language models (LLMs) to determine whether they could successfully play doctor across 29 published clinical scenarios.

The human art of medicine

To simulate a real-world hospital visit, the researchers fed the AI models patient data incrementally. The chatbots were first given basic details such as age, gender, and initial symptoms, before gradually receiving physical examination findings and laboratory results.

The study revealed that when the AI was eventually provided with all the necessary medical data, it successfully arrived at the correct final diagnosis more than 90 per cent of the time. However, during the initial stages when information was scarce, every single model failed to produce an appropriate differential diagnosis more than 80 per cent of the time.

Dr Marc Succi, executive director of the MESH Incubator at Mass General Brigham and the study’s corresponding author, warned that the technology is far from ready for unsupervised deployment.

Dr Succi said: “Differential diagnoses are central to clinical reasoning and underlie the ‘art of medicine’ that AI cannot currently replicate. The promise of AI in clinical medicine continues to lie in its potential to augment, not replace, physician reasoning, provided all the relevant data is available – not always the case.”

Benchmarking the chatbots

To better track this critical oversight, the research team developed PrIME-LLM, a novel scoring system designed to evaluate an AI’s competency across the entire diagnostic process—rather than just grading its final answer.

Testing the latest major models available, the researchers found PrIME-LLM scores ranged from 64 per cent for Gemini 1.5 Flash up to 78 per cent for Grok 4 and GPT-5.

Arya Rao, the study’s lead author and an MD-PhD student at Harvard Medical School, noted that evaluating the models step-by-step exposes their fundamental psychological weaknesses.

Rao explained: “By evaluating LLMs in a stepwise fashion, we move past treating them like test-takers and put them in the position of a doctor. These models are great at naming a final diagnosis once the data is complete, but they struggle at the open-ended start of a case, when there isn’t much information.”

The research team concluded that any integration of AI into modern healthcare must keep a “human in the loop” to guarantee patient safety and provide the vital reasoning that machines still lack.