{"id":577,"date":"2025-06-19T10:59:00","date_gmt":"2025-06-19T08:59:00","guid":{"rendered":"https:\/\/www.nutrimedia.info\/?post_type=news&#038;p=577"},"modified":"2025-06-16T15:22:42","modified_gmt":"2025-06-16T13:22:42","slug":"oxford-lia-medical-study-fails-when-the-patient-speaks","status":"publish","type":"news","link":"https:\/\/www.nutrimedia.info\/en\/news\/etude-oxford-lia-medicale-echoue-quand-cest-le-patient-qui-parle\/","title":{"rendered":"When the right prompt saves the day: what an Oxford study on AI and medical diagnosis (really) reveals"},"content":{"rendered":"<p><strong>They know how. But they don't know how to talk to people. That, in essence, is the finding of a ground-breaking study by Oxford University into the performance of large language models (LLMs) in dealing with medical problems. When questioned in a direct and structured manner, models such as GPT-4o correctly diagnose pathology in 95 % of cases. But as soon as you introduce a human into the loop - an average patient with his or her words and hesitations - the results plummet. What does this mean for the health sector, and more broadly for the nutrition sector? Here's how.<\/strong><\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><br><strong>A large-scale study: 1298 participants, 10 medical scenarios, 3 AI tested<\/strong><\/h2>\n\n\n\n<p>The study entitled\u00ab\u00a0<em>Clinical knowledge in LLMs does not translate to human interactions<\/em>\u00a0\u00bbwas conducted by the<strong><a href=\"https:\/\/www.ox.ac.uk\/\" target=\"_blank\" rel=\"noreferrer noopener\">\u2019Oxford Internet Institute,<\/a><\/strong> in partnership with several British and American health institutions. It had a clear objective:<strong> test the ability of LLMs to help non-expert patients make the right decisions in ten common medical situations (from migraine to pneumothorax).<\/strong><\/p>\n\n\n\n<p>The participants (representative of the British population) were divided into four groups:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>a control group<\/strong>, using the usual methods (Google searches, intuition, personal experience),<br><\/li>\n\n\n\n<li><strong>three test groups<\/strong>, each using an LLM (GPT-4o from OpenAI, LLaMA 3 from Meta, or Command R+ from Cohere).<br><\/li>\n<\/ul>\n\n\n\n<p>Each person had to answer <strong>two questions for a given scenario<\/strong> :<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>What decision should be taken (emergency, primary care, self-treatment, etc.)?<br><\/li>\n\n\n\n<li>What condition(s) do you think are involved?<br><\/li>\n<\/ol>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>Results: models shine... without humans<\/strong><\/h2>\n\n\n\n<p>When we give<strong> scenarios directly to AI<\/strong> (without a human intermediary), <strong>the results are impressive:<\/strong><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>94,7 %<\/strong> correct response rate for GPT-4o in pathology identification (99.2 % for LLaMA 3).<br><\/li>\n\n\n\n<li>Approximately <strong>64,7 %<\/strong> good management recommendations for GPT-4o.<br><\/li>\n<\/ul>\n\n\n\n<p><strong>But when it comes to patients interacting with AI :<\/strong><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>less than 34.5 %<\/strong> of participants correctly identify at least one relevant pathology.<br><\/li>\n\n\n\n<li>And fewer <strong>44,2 %<\/strong> offer appropriate care, which is not the case for <strong>no better<\/strong> than the control group (without AI).<br><\/li>\n<\/ul>\n\n\n\n<p>In other words: <strong>AI knows, but is not understood.<\/strong><\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>Where do things go wrong? Language, promptness, trust<\/strong><\/h2>\n\n\n\n<p>Analysis of exchanges between patients and RNs reveals several major obstacles:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Poorly worded prompts<\/strong> the participants often forget key information (location of the pain, context, duration, etc.), limiting the relevance of the AI's response.<br><\/li>\n\n\n\n<li><strong>Unclear or incomplete answers<\/strong> from certain models, despite the correct suggestions included in the dialogue.<br><\/li>\n\n\n\n<li><strong>Inefficient sorting<\/strong> Users don't always know how to extract the right information from the answers provided.<br><\/li>\n<\/ul>\n\n\n\n<p><strong>In 65 to 73 %<\/strong> of the cases, the AIs suggested at least one correct pathology in the dialogue... but this information was not retained in the user's final response.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>And yet... AI outperforms existing benchmarks<\/strong><\/h2>\n\n\n\n<p>The researchers compared the results of the models on simulated cases with questions of <a href=\"https:\/\/paperswithcode.com\/dataset\/medqa-usmle\" target=\"_blank\" rel=\"noreferrer noopener\">MedQA<\/a>, a standard benchmark derived from American medical examinations. Result:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>LLMs happily exceed 80 % of correct answers on MedQA.<br><\/li>\n\n\n\n<li>But in the human experience, <strong>these scores are poorly correlated<\/strong> the ability of real users to use it properly.<br><\/li>\n<\/ul>\n\n\n\n<p>Similarly, tests carried out with \u00absimulated patients\u00bb (other AIs playing the role of the patient) gave much higher scores than those of real participants. Conclusion: <strong>benchmark and simulated data tests are not enough<\/strong> to predict actual performance in real-life situations.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>What lessons for health... and beyond?<\/strong><\/h2>\n\n\n\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\">\n<p>This study puts its finger on an essential truth: <strong>the raw performance of an LLM is not enough<\/strong>. <strong>The value is in the interaction between man and machine.<\/strong><\/p>\n<\/blockquote>\n\n\n\n<p>This raises crucial issues for companies in the health sector, and in particular for the functional and nutritional ingredients sector:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Educating users<\/strong> If professionals (and soon patients) have to interact with AI to obtain a diagnosis, i<strong>They need to be taught how to formulate the right prompts<\/strong>. This also involves nutritional and medical literacy.<br><\/li>\n\n\n\n<li><strong>Controlling use<\/strong> The tools must incorporate safeguards to avoid misinterpretation, offering rephrasing and even asking additional questions to clarify symptoms.<br><\/li>\n\n\n\n<li><strong>Focus on intelligent interfaces<\/strong> B2C self-assessment tools: future B2C self-assessment tools (diets, functional recommendations, monitoring of digestive or immune symptoms, etc.) will need to incorporate a layer of educational and even emotional guidance.<br><\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>And what about ingredients?<\/strong><\/h2>\n\n\n\n<p>In the world of health ingredients, where promises often revolve around prevention, well-being or support for chronic illnesses, this study serves as a reminder of a fundamental principle: <strong>user-perceived understanding is more important than scientific accuracy alone<\/strong>.<\/p>\n\n\n\n<p>So :<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Brands developing nutritional chatbots or interactive tools must <strong>test their solutions with real users<\/strong>, not just simulated scenarios.<br><\/li>\n\n\n\n<li>Le <strong>vocabulary used<\/strong> (e.g. microbiota, inflammation, intestinal permeability) must be adapted to real levels of understanding, even if this means simplifying them without betraying the scientific truth.<br><\/li>\n\n\n\n<li>LLMs can become a <strong>a powerful advisory tool for healthcare professionals<\/strong> (pharmacists, dieticians) if they are trained and used in a context of support, not self-diagnosis.<br><\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>Conclusion: AI + human, it's more complex than expected<\/strong><\/h2>\n\n\n\n<p>This Oxford study is a strong signal: <strong>LLMs are not (yet) pocket doctors<\/strong>. Their use must be supported, contextualised and tested in real-life conditions. For the health and nutrition industries, the challenge is not just to find out what AI knows, but to <strong>how she says it, to whom, and with what effect<\/strong>.<\/p>\n\n\n\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\">\n<p><strong>The right answer is not enough. You have to ask the right question.<\/strong><\/p>\n<\/blockquote>\n\n\n\n<h3 class=\"wp-block-heading\">Source:<\/h3>\n\n\n\n<p>\u00a0Bean, A. M. et al (2025). <em>Clinical knowledge in LLMs does not translate to human interactions<\/em>. arXiv:2504.18919v1 -<a href=\"https:\/\/arxiv.org\/abs\/2504.18919\" target=\"_blank\" rel=\"noreferrer noopener\"> https:\/\/arxiv.org\/abs\/2504.18919<\/a><br><\/p>","protected":false},"template":"","meta":{"_acf_changed":true,"_et_pb_use_builder":"","_et_pb_old_content":"","_et_gb_content_width":"","inline_featured_image":false},"class_list":["post-577","news","type-news","status-publish","hentry"],"acf":[],"jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/www.nutrimedia.info\/en\/wp-json\/wp\/v2\/news\/577","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.nutrimedia.info\/en\/wp-json\/wp\/v2\/news"}],"about":[{"href":"https:\/\/www.nutrimedia.info\/en\/wp-json\/wp\/v2\/types\/news"}],"wp:attachment":[{"href":"https:\/\/www.nutrimedia.info\/en\/wp-json\/wp\/v2\/media?parent=577"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}