Just add humans: Oxford medical study underscores the missing link in chatbot testing

by | Jun 13, 2025 | Technology

Join the event trusted by enterprise leaders for nearly two decades. VB Transform brings together the people building real enterprise AI strategy. Learn more

Headlines have been blaring it for years: Large language models (LLMs) can not only pass medical licensing exams but also outperform humans. GPT-4 could correctly answer U.S. medical exam licensing questions 90% of the time, even in the prehistoric AI days of 2023. Since then, LLMs have gone on to best the residents taking those exams and licensed physicians.

Move over, Doctor Google, make way for ChatGPT, M.D. But you may want more than a diploma from the LLM you deploy for patients. Like an ace medical student who can rattle off the name of every bone in the hand but faints at the first sight of real blood, an LLM’s mastery of medicine does not always translate directly into the real world.

A paper by researchers at the University of Oxford found that while LLMs could correctly identify relevant conditions 94.9% of the time when directly presented with test scenarios, human participants using LLMs to diagnose the same scenarios identified the correct conditions less than 34.5% of the time.

Perhaps even more notably, patients using LLMs performed even worse than a control group that was merely instructed to diagnose themselves using “any methods they would typically employ at home.” The group left to their own devices was 76% more likely to identify the correct conditions than the group assisted by LLMs.

The Oxford study raises questions about the suitability of LLMs for medical advice and the benchmarks we use to evaluate chatbot deployments for various applications.

Guess your malady

Led by Dr. Adam Mahdi, researchers at Oxford recruited 1,298 participants to present themselves as patients to an LLM. They were tasked with both attempting to figure out what ailed them and the appropriate level of care to seek for it, ranging from self-care to calling an ambulance.

Each participant received a detailed scenario, representing conditions from pneumonia to the common cold, along with general life details and medical history. For instance, one scenario describes a 20-year-old engineering student who develops a crippling headache on a night out with friends. It includes import …

Article Attribution | Read More at Article Source