It’s becoming increasingly common for us to turn to artificial intelligence, specifically chatbots, when we’re worried about our health. These digital assistants are always at our fingertips, ready to answer questions like: “Should I be concerned about this rash?” or “What if this insect bite gets infected?” or even “Could this pain be a sign of something more serious?” When it comes to our well-being, the accuracy of these answers is absolutely paramount. It’s not just about getting a quick response; it’s about getting the right response. This growing reliance on AI for medical advice highlights a critical need: how do we ensure these advanced systems provide trustworthy and reliable information, especially when a misdiagnosis could have significant consequences? This challenge has spurred a lot of research into how these systems work and, more importantly, how to make them safer and more dependable for everyone.
Back in 2023, researchers at Binghamton University put Open AI’s ChatGPT to the test, and while it showed a remarkable ability to correctly identify medical terms, drug names, and genetic information, there was a significant snag. This sophisticated AI also produced a lot of “hallucinations” – confident, yet entirely false, information. Imagine a medical chatbot confidently telling you something that sounds plausible but is completely untrue; that’s the kind of problem they encountered. This issue of generating inaccurate but convincing responses is a major hurdle in deploying AI for sensitive applications like healthcare. Recognizing this critical flaw, a subsequent study, backed by a generous $100,000 grant from New York state’s Empire AI Consortium, was launched. The goal was ambitious: to find a way to eliminate these misleading “hallucinations.” This new research, led by Ahmed Abdeen Hamed, a research fellow at Binghamton University’s Thomas J. Watson College of Engineering and Applied Science, in collaboration with Professor Luis M. Rocha, unearthed an incredibly innovative verification method. Their groundbreaking findings were recently published in the journal STAR Protocols, offering a beacon of hope for more reliable AI-powered medical diagnostics.
The core of this groundbreaking new protocol lies in leveraging the burgeoning landscape of open-source AI options, each offering a distinct approach to processing inquiries and formulating responses. Hamed and Rocha meticulously selected seven prominent large language models (LLMs) from this diverse pool. What made their approach particularly ingenious was the mandatory application of Retrieval-Augmented Generation (RAG). This technique compelled each chatbot to first consult an authoritative, verified database of medical terminology before generating any response. It’s akin to a student being required to show their sources before answering a question. To truly put this system through its paces, the researchers conducted over 10,000 experiments. In each experiment, the seven chatbots were presented with identical, plain-language descriptions of symptoms. Subsequently, each chatbot was tasked with identifying the corresponding medical terms, complete with their official identification numbers. The crucial final step involved a “vote” among the chatbots: they essentially cross-referenced and validated each other’s proposed answers. The results were nothing short of remarkable: a staggering 76.85% of the answers were corroborated by at least four of the LLMs, and the remaining 23.15% were supported by at least two. Crucially, there were absolutely no unmatched terms and, most importantly, no “hallucinations.” Hamed enthusiastically shared his thoughts on this success, stating, “The new workflow is incredible because it can verify anything from a biomedical point of view – biological knowledge with disease and genetics, translational knowledge from diseases to treatments and clinical trials, and also from a healthcare point of view with symptoms and treatments.” This method’s ability to ensure accuracy and eliminate erroneous information represents a monumental leap forward in making AI a trustworthy partner in healthcare.
One of the most compelling aspects of this innovative protocol is its inherent reproducibility, offering the potential for near-infinite permutations to continually enhance its accuracy and reliability. Hamed explains this ingenious flexibility by highlighting the vast and ever-growing number of open-source large language models available. “There can be 100 large language models that are open source, and every time we can perform an experiment with seven LLMs selected at random from that list,” he notes. The power of this approach lies in repetition: “When we perform the experiment many, many times, we increase the confidence in the voting.” This iterative process of random selection and repeated experimentation acts like a self-correcting mechanism, continuously refining the system’s ability to provide accurate information. Beyond medical diagnostics, Professor Rocha sees this protocol as a crucial step towards bolstering confidence in complex multiscale network models of disease, a key area of focus for his Complex Adaptive Systems and Computational Intelligence Lab at Binghamton. His lab is engaged in pioneering research, including the development of “digital twins” for precision medicine. These are essentially dynamic, virtual replicas of biological processes, constantly updated using AI and real-time data. The aim is to create precise, predictive simulations of human reactions to treatments, allowing healthcare providers to optimize outcomes even before real-world testing. “For instance, the protocol can extract and provide multi-agent verification of evidence for an adverse drug reaction for a given medication that is available in clinical trials, the scientific literature, pharmacological databases, and even social media discourse,” Rocha explains. He further elaborates on its extensive capabilities: “And it can assist in the extraction of evidence at multiple scales, from multiomics to epidemiological and behavioral data sources, which we have already started to pilot by building multi-layer models of ER+ breast cancer.” Hamed deeply appreciates Rocha’s guidance, acknowledging him as essential to their success: “The guidance from Professor Rocha was huge, from securing the grant to helping to decide the direction of where this research would go and coaching us to develop the protocols needed to make it all work.” While the immediate focus was biomedical, the Binghamton team’s discovery has far-reaching potential. This verification method could effectively curb or eliminate other forms of LLM hallucinations, such as fabricated legal citations, fake academic references, or outright historical inaccuracies. Hamed concisely summarizes its broader impact: “This protocol is a big step toward the democratization of knowledge verification.” This means that reliable information, validated through a robust AI-driven process, could become more accessible to everyone, across various fields of knowledge.
As this groundbreaking research concludes at Binghamton, Ahmed Abdeen Hamed is embarking on a new professional journey, transitioning to a research associate professor role at the University of Nebraska-Lincoln. His mentor, Professor Rocha, expresses immense pride in Hamed’s contributions, reflecting on a productive collaboration: “Dr. Hamed’s period in our lab was most productive, not only in the rapid development of AI-driven workflows and publications, but in catalyzing new, creative ideas for all lab members.” Rocha eagerly anticipates Hamed’s future accomplishments: “I cannot wait to see the amazing new research he will produce at the University of Nebraska—Lincoln.” Hamed himself conveys deep gratitude for the opportunities provided by Binghamton University’s Watson College. He explains how the environment was conducive to nurturing his research vision: “Watson College provided an exceptional environment where I could fully develop and implement the forward-looking research agenda I began during my time in Europe.” He recognized that the initial direction he envisioned was still nascent during his time abroad, and the fellowship at Binghamton offered the perfect setting to propel it forward. He remains optimistic about the broader impact of their work: “I’m hopeful that the resulting peer-reviewed publications can help shift perspectives and demonstrate how GenAI and LLMs can be used responsibly, constructively, and with genuine innovation.” This collaborative effort between Hamed and Rocha has not only advanced the field of AI verification within medical contexts but also set a precedent for how large language models can be harnessed to deliver reliable and impactful insights across various domains, fostering a future where AI serves humanity with greater accuracy and integrity.

