From the Journals

AI Falls Short on Differential Dx

New PrIME-LLM benchmark shows strong diagnostic accuracy but persistent gaps in clinical reasoning across 21 large language models

By
Kathryn Wighton
April 13, 2026
4 min

Conexiant

A cross-sectional study published in JAMA Network Open evaluated the accuracy of 21 large language models (LLMs) in medical diagnoses through the Proportional Index of Medical Evaluation for LLMs (PrIME-LLM). The study found that while LLMs like GPT-5 and Grok 4 achieved high accuracy in final diagnoses, they struggled significantly with differential diagnoses, revealing a failure rate above 80%. Researchers emphasized the limitations of current LLMs in processing clinical uncertainty and called for better evaluation methods that reflect real-world decision-making processes.

AI Falls Short on Differential Dx

Original Source(s)

AI Falls Short on Differential Dx

Related Content

FDA Flags Missing Trial Results

Can Nasal Nitric Oxide Aid PCD Diagnosis?

Utah Just Let an Algorithm Write Prescriptions