AI models struggle with clinical reasoning despite high diagnostic accuracy

A study of 21 large language models found that while AI can reach correct diagnoses with complete data, it fails to perform the critical reasoning required for differential diagnosis.

Lucía Paredes

April 13, 2026 2 min read

AI models struggle with clinical reasoning despite high diagnostic accuracy

Médico analizando datos clínicos con asistencia de inteligencia artificial.

Researchers at Mass General Brigham have found that even the most advanced large language models (LLMs) lack the clinical reasoning necessary for unsupervised medical use.

The study, published in JAMA Network Open, tested 21 different AI models including versions of ChatGPT, Claude, Gemini, and Grok by asking them to navigate complex clinical scenarios.

While the models achieved a correct final diagnosis more than 90% of the time when provided with all patient information, they consistently failed at the early stages of the diagnostic process. Specifically, the models struggled to develop a 'differential diagnosis'—a list of potential conditions that require further testing.

"Differential diagnoses are central to clinical reasoning and underlie the ‘art of medicine’ that AI cannot currently replicate," said Marc Succi, MD, executive director of the MESH Incubator at Mass General Brigham.

The PrIME-LLM Benchmark

To measure these shortcomings, the research team developed a new metric called PrIME-LLM. Unlike previous assessments that focused solely on accuracy, this new measure evaluates competency across the entire diagnostic workflow, including identifying potential diagnoses, ordering appropriate tests, and managing treatment.

Researchers simulated real-world medical encounters by feeding models information incrementally. They began with basic patient demographics and symptoms before adding physical exam findings and laboratory results.

Dr. Succi noted that the technology is currently better suited for augmenting doctors rather than replacing them. "The promise of AI in clinical medicine continues to lie in its potential to augment, not replace, physician reasoning, provided all the relevant data is available—not always the case," he said.

The study concludes that because models perform poorly when information is missing or incomplete, they are not yet ready for clinical-grade deployment without human oversight.

AI models struggle with clinical reasoning despite high diagnostic accuracy

The PrIME-LLM Benchmark

Comments

Keep reading

More from Health

Latest news

AI models struggle with clinical reasoning despite high diagnostic accuracy

The PrIME-LLM Benchmark

Keep reading

More from Health

Technical Glitches and System Outages Delay Universal Health ID Registration in Mexico

Rising Healthcare Costs in Mexico Drive Up Poverty and Out-of-Pocket Spending

Counterfeit drug ring expands in Mexico amid medical supply shortages

Latest news

Federal Security Forces Boost Presence in Sinaloa Following Rocha Moya's Departure

Mexican Government Launches 'Plan México' to Boost Investment Through Administrative Simplification

Rising Premium Gasoline Prices Hit Mexico City Residents Amid Middle East Conflict