Understanding the present, shaping the future.

Search
11:26 PM UTC · TUESDAY, MAY 5, 2026 LA ERA · México
May 5, 2026 · Updated 11:26 PM UTC
Health

AI models struggle with clinical reasoning despite high diagnostic accuracy

A study of 21 large language models found that while AI can reach correct diagnoses with complete data, it fails to perform the critical reasoning required for differential diagnosis.

Lucía Paredes

2 min read

AI models struggle with clinical reasoning despite high diagnostic accuracy
Médico analizando datos clínicos con asistencia de inteligencia artificial.

Researchers at Mass General Brigham have found that even the most advanced large language models (LLMs) lack the clinical reasoning necessary for unsupervised medical use.

The study, published in JAMA Network Open, tested 21 different AI models including versions of ChatGPT, Claude, Gemini, and Grok by asking them to navigate complex clinical scenarios.

While the models achieved a correct final diagnosis more than 90% of the time when provided with all patient information, they consistently failed at the early stages of the diagnostic process. Specifically, the models struggled to develop a 'differential diagnosis'—a list of potential conditions that require further testing.

"Differential diagnoses are central to clinical reasoning and underlie the ‘art of medicine’ that AI cannot currently replicate," said Marc Succi, MD, executive director of the MESH Incubator at Mass General Brigham.

The PrIME-LLM Benchmark

To measure these shortcomings, the research team developed a new metric called PrIME-LLM. Unlike previous assessments that focused solely on accuracy, this new measure evaluates competency across the entire diagnostic workflow, including identifying potential diagnoses, ordering appropriate tests, and managing treatment.

Researchers simulated real-world medical encounters by feeding models information incrementally. They began with basic patient demographics and symptoms before adding physical exam findings and laboratory results.

Dr. Succi noted that the technology is currently better suited for augmenting doctors rather than replacing them. "The promise of AI in clinical medicine continues to lie in its potential to augment, not replace, physician reasoning, provided all the relevant data is available—not always the case," he said.

The study concludes that because models perform poorly when information is missing or incomplete, they are not yet ready for clinical-grade deployment without human oversight.

Comments