In today’s interconnected world, where information is just a few clicks away, large language models (LLMs) such as Generative AI chatbots have emerged as powerful tools, revolutionizing the way we access and consume information across various sectors, including research, education, business, marketing, and even medicine. These sophisticated AI programs are designed to understand and generate human-like text, making them incredibly versatile for a wide range of applications. From drafting emails and articles to providing customer support and assisting with complex coding tasks, LLMs have demonstrated their ability to streamline workflows, enhance productivity, and offer readily available information. Many individuals have adopted them as a primary source for quick answers, often treating them as advanced search engines, even for sensitive queries related to health and medical advice. The appeal lies in their accessibility, speed, and the perception that they can offer comprehensive responses, making them a go-to resource for a myriad of everyday questions. However, as with any rapidly evolving technology, particularly in critical domains like healthcare, their widespread adoption necessitates a thorough examination of their accuracy, reliability, and potential impact on public well-being.
Recent research has cast a critical eye on the reliability of medical information disseminated by these popular AI chatbots, uncovering some concerning limitations. A study published in the open-access journal BMJ Open revealed that a substantial portion of the medical information provided by five widely used chatbots—Gemini (Google), DeepSeek (High-Flyer), Meta AI (Meta), ChatGPT (OpenAI), and Grok (xAI)—was found to be inaccurate or incomplete. The researchers conducted a rigorous assessment by posing a series of clear, evidence-based questions, and their findings indicated that a staggering half of the answers were “somewhat” or “highly” problematic. This means that a significant number of responses, if followed without professional guidance, could potentially lead users to ineffective treatments or even cause harm. The study highlights a critical gap between the perceived authority of these AI-generated responses and their actual scientific validity. The researchers emphasize that the continued deployment of these chatbots without adequate public education and robust oversight mechanisms poses a significant risk of amplifying misinformation, which could have serious consequences for public health.
To understand the extent of this issue, the researchers meticulously designed their investigation to probe areas of health and medicine that are particularly susceptible to misinformation and directly influence everyday health behaviors. They challenged the five selected chatbots with 10 open-ended and closed questions across five distinct categories: cancer, vaccines, stem cells, nutrition, and athletic performance. The questions were crafted to mimic common ‘information-seeking’ health and medical queries frequently encountered online, as well as to reflect misinformation tropes prevalent in both online discussions and academic discourse. This intentional design aimed to ‘stress test’ the AI models, pushing them towards generating potentially misleading or contraindicated advice – a strategy increasingly employed to identify behavioral vulnerabilities in AI systems. Closed prompts demanded specific, factual answers aligned with scientific consensus, often having a single correct response. In contrast, open-ended prompts required the chatbots to generate multiple responses, typically presented in a list format, allowing for a broader range of potential inaccuracies to surface.
The evaluation process categorized chatbot responses as “non-problematic,” “somewhat problematic,” or “highly problematic,” based on objective, pre-defined criteria. A response was deemed problematic if it could plausibly direct a lay user toward potentially ineffective treatment or lead to harm if acted upon without the oversight of a healthcare professional. Beyond just accuracy, the study also scrutinized the completeness of the information provided and paid particular attention to instances where a chatbot presented a false balance between scientifically proven facts and unproven or non-scientific claims. This included cases where conflicting information was presented without clear distinctions in the strength of supporting evidence. Furthermore, each response was graded for readability using the Flesch Reading Ease score, assessing whether the language was easy and plain English or more akin to difficult, academic prose. This comprehensive approach allowed for a multifaceted understanding of the quality and potential impact of the information being generated by these AI systems.
The study unearthed some truly concerning statistics, revealing that a significant 50% of the responses were problematic, with 30% categorized as “somewhat problematic” and an additional 20% as “highly problematic.” The type of prompt significantly influenced the nature of the responses; open-ended questions, for instance, produced a disproportionately high number of “highly problematic” responses (40), far exceeding expectations, while yielding significantly fewer “non-problematic” answers. Conversely, closed prompts tended to elicit more accurate information. While the overall quality of responses didn’t vary drastically among the five chatbots, Grok notably generated the most “highly problematic” responses (58%), whereas Gemini performed best, producing the fewest such errors and the most accurate answers. Interestingly, chatbots performed relatively well in topics like vaccines and cancer but struggled considerably with stem cells, athletic performance, and nutrition, areas often rife with complex and evolving scientific understanding. A particularly troubling observation was the chatbots’ consistent confidence and certainty in their answers, often providing very few caveats or disclaimers. Out of 250 questions, only two instances of refusal to answer were recorded, both from Meta AI in response to queries about anabolic steroids and alternative cancer treatments, highlighting a general reluctance to decline even when the information might be dubious. Moreover, the quality of references provided was abysmal, with an average completeness score of only 40%. The study exposed rampant “chatbot hallucinations” and fabricated citations, meaning not a single chatbot managed to provide a fully accurate reference list. To compound these issues, all responses were graded as “difficult,” comparable in complexity to college-level text, raising concerns about their accessibility to the general public.
While acknowledging the limitations of their study—namely, the assessment of only five chatbots and the rapid evolution of commercial AI—the researchers firmly assert that their findings underscore critical behavioral limitations that demand a reevaluation of how AI chatbots are integrated into public-facing health and medical communication. They emphasize that chatbots, by their very nature, do not access real-time data or engage in genuine reasoning or evidence weighing. Instead, they generate outputs by recognizing statistical patterns from their training data and predicting likely word sequences. This fundamental operational model means they are incapable of making ethical or value-based judgments, leading to a propensity to “reproduce authoritative-sounding but potentially flawed responses.” The data informing these chatbots often includes less-vetted sources like Q&A forums and social media, and scientific content is typically restricted to open-access or publicly available articles, which represent only a fraction (30-50%) of published studies. This selective data diet, while enhancing conversational fluency, comes at the significant cost of scientific accuracy. The researchers conclude with a resolute call to action: as AI chatbot utilization continues to expand, there is an urgent need for widespread public education, comprehensive professional training, and robust regulatory oversight. These measures are crucial to ensure that generative AI truly supports and enhances public health, rather than inadvertently undermining it by propagating misinformation.

