Helen Salisbury: AI medical chatbots—more hype than help

15 May 2026

Helen Salisbury: AI medical chatbots—more hype than help

The BMJ

Artificial intelligence has been shown to outperform doctors in medical exams, with ChatGPT achieving more than 95% accuracy in the US Medical Licensing Exam, and in correctly identifying conditions from written scenarios. A recent study has shown, however, that when chatbots are required to interact with real people, they arrive at the right diagnosis in only 35% of cases.

This is an important finding, as people are increasingly turning to AI for medical advice. As doctors we’re familiar with the difference between the effects of a drug “in vitro” and “in vivo,” as some treatments work well in a test tube or petri dish but not in humans, and it seems that we now have to add “in silico” to our vocabulary. If you test chatbots with AI simulated patients (a common way of testing them), unsurprisingly, they are much better at reaching the correct diagnosis than when interacting with real people. It’s easy to imagine why this is, as patients don’t necessarily tell you all their relevant symptoms in the opening sentence, and they may include all kinds of distracting details because they don’t know what is and isn’t important. The doctor’s skill in judging what to include and what to set aside requires a level of clinical reasoning that’s not available in a large language model (LLM).

The problem isn’t solely—or even mostly—about reaching a correct diagnosis but also about judging risk and acuity. An important finding of this AI study was the alarmingly low rate (43%) of correct advice about what the patient should do next. When choosing options ranging from self-care to calling an ambulance, the chatbots tended to underestimate the potential seriousness of symptoms and the need for action.

When AI is trained and tested on constructed medical cases and exam questions, there’s usually a right answer. In real life we occasionally stumble upon a “textbook example” of a condition that exactly mirrors what we were taught in medical school—but much more often we’re offered a collection of symptoms that point to several different diagnoses, as well as some that don’t fit any condition we know. This sifting and sorting and setting-aside-for-now is an integral part of our consultations, followed by decisions about what to investigate and when it’s safe to leave symptoms without a medical explanation.

The conclusion of this AI research was that, although the chatbots were successful at identifying conditions and appropriate actions, patients struggled to interact with them. Tellingly, one of the study authors’ takeaway questions is about identifying “why humans fail when interacting with LLM based tools.” Is it really the humans who are failing?

News