When the right prompt saves the day: what an Oxford study on AI and medical diagnosis (really) reveals

They know how. But they don't know how to talk to people. That, in essence, is the finding of a ground-breaking study by Oxford University into the performance of large language models (LLMs) in dealing with medical problems. When questioned in a direct and structured manner, models such as GPT-4o correctly diagnose pathology in 95 % of cases. But as soon as you introduce a human into the loop - an average patient with his or her words and hesitations - the results plummet. What does this mean for the health sector, and more broadly for the nutrition sector? Here's how.

A large-scale study: 1298 participants, 10 medical scenarios, 3 AI tested

The study entitled« Clinical knowledge in LLMs does not translate to human interactions »was conducted by the’Oxford Internet Institute, in partnership with several British and American health institutions. It had a clear objective: test the ability of LLMs to help non-expert patients make the right decisions in ten common medical situations (from migraine to pneumothorax).

The participants (representative of the British population) were divided into four groups:

a control group, using the usual methods (Google searches, intuition, personal experience),
three test groups, each using an LLM (GPT-4o from OpenAI, LLaMA 3 from Meta, or Command R+ from Cohere).

Each person had to answer two questions for a given scenario :

What decision should be taken (emergency, primary care, self-treatment, etc.)?
What condition(s) do you think are involved?

Results: models shine... without humans

When we give scenarios directly to AI (without a human intermediary), the results are impressive:

94,7 % correct response rate for GPT-4o in pathology identification (99.2 % for LLaMA 3).
Approximately 64,7 % good management recommendations for GPT-4o.

But when it comes to patients interacting with AI :

less than 34.5 % of participants correctly identify at least one relevant pathology.
And fewer 44,2 % offer appropriate care, which is not the case for no better than the control group (without AI).

In other words: AI knows, but is not understood.

Where do things go wrong? Language, promptness, trust

Analysis of exchanges between patients and RNs reveals several major obstacles:

Poorly worded prompts the participants often forget key information (location of the pain, context, duration, etc.), limiting the relevance of the AI's response.
Unclear or incomplete answers from certain models, despite the correct suggestions included in the dialogue.
Inefficient sorting Users don't always know how to extract the right information from the answers provided.

In 65 to 73 % of the cases, the AIs suggested at least one correct pathology in the dialogue... but this information was not retained in the user's final response.

And yet... AI outperforms existing benchmarks

The researchers compared the results of the models on simulated cases with questions of MedQA, a standard benchmark derived from American medical examinations. Result:

LLMs happily exceed 80 % of correct answers on MedQA.
But in the human experience, these scores are poorly correlated the ability of real users to use it properly.

Similarly, tests carried out with «simulated patients» (other AIs playing the role of the patient) gave much higher scores than those of real participants. Conclusion: benchmark and simulated data tests are not enough to predict actual performance in real-life situations.

What lessons for health... and beyond?

This study puts its finger on an essential truth: the raw performance of an LLM is not enough. The value is in the interaction between man and machine.

This raises crucial issues for companies in the health sector, and in particular for the functional and nutritional ingredients sector:

Educating users If professionals (and soon patients) have to interact with AI to obtain a diagnosis, iThey need to be taught how to formulate the right prompts. This also involves nutritional and medical literacy.
Controlling use The tools must incorporate safeguards to avoid misinterpretation, offering rephrasing and even asking additional questions to clarify symptoms.
Focus on intelligent interfaces B2C self-assessment tools: future B2C self-assessment tools (diets, functional recommendations, monitoring of digestive or immune symptoms, etc.) will need to incorporate a layer of educational and even emotional guidance.

And what about ingredients?

In the world of health ingredients, where promises often revolve around prevention, well-being or support for chronic illnesses, this study serves as a reminder of a fundamental principle: user-perceived understanding is more important than scientific accuracy alone.

So :

Brands developing nutritional chatbots or interactive tools must test their solutions with real users, not just simulated scenarios.
Le vocabulary used (e.g. microbiota, inflammation, intestinal permeability) must be adapted to real levels of understanding, even if this means simplifying them without betraying the scientific truth.
LLMs can become a a powerful advisory tool for healthcare professionals (pharmacists, dieticians) if they are trained and used in a context of support, not self-diagnosis.

Conclusion: AI + human, it's more complex than expected

This Oxford study is a strong signal: LLMs are not (yet) pocket doctors. Their use must be supported, contextualised and tested in real-life conditions. For the health and nutrition industries, the challenge is not just to find out what AI knows, but to how she says it, to whom, and with what effect.

The right answer is not enough. You have to ask the right question.

Source:

Bean, A. M. et al (2025). Clinical knowledge in LLMs does not translate to human interactions. arXiv:2504.18919v1 - https://arxiv.org/abs/2504.18919

previous following