New York, NY [August 6, 2025] — A new study by researchers at the Icahn School of Medicine at Mount Sinai finds that widely used AI chatbots are highly vulnerable to repeating and elaborating on false medical information, revealing a critical need for stronger safeguards before these tools can be trusted in health care.
The researchers also demonstrated that a simple built-in warning prompt can meaningfully reduce that risk, offering a practical path forward as the technology rapidly evolves. Their findings were detailed in the August 2 online issue of Communications Medicine [https://doi.org/10.1038/s43856-025-01021-3].
As more doctors and patients turn to AI for support, the investigators wanted to understand whether chatbots would blindly repeat incorrect medical details embedded in a user’s question, and whether a brief prompt could help steer them toward safer, more accurate responses.
“What we saw across the board is that AI chatbots can be easily misled by false medical details, whether those errors are intentional or accidental,” says lead author Mahmud Omar, MD, who is an independent consultant with the research team. “They not only repeated the misinformation but often expanded on it, offering confident explanations for non-existent conditions. The encouraging part is that a simple, one-line warning added to the prompt cut those hallucinations dramatically, showing that small safeguards can make a big difference.”
The team created fictional patient scenarios, each containing one fabricated medical term such as a made-up disease, symptom, or test, and submitted them to leading large language models. In the first round, the chatbots reviewed the scenarios with no extra guidance provided. In the second round, the researchers added a one-line caution to the prompt, reminding the AI that the information provided might be inaccurate.
Without that warning, the chatbots routinely elaborated on the fake medical detail, confidently generating explanations about conditions or treatments that do not exist. But with the added prompt, those errors were reduced significantly.
“Our goal was to see whether a chatbot would run with false information if it was slipped into a medical question, and the answer is yes,” says co-corresponding senior author Eyal Klang, MD, Chief of Generative AI in the Windreich Department of Artificial Intelligence and Human Health at the Icahn School of Medicine at Mount Sinai. “Even a single made-up term could trigger a detailed, decisive response based entirely on fiction. But we also found that the simple, well-timed safety reminder built into the prompt made an important difference, cutting those errors nearly in half. That tells us these tools can be made safer, but only if we take prompt design and built-in safeguards seriously.”
The team plans to apply the same approach to real, de-identified patient records and test more advanced safety prompts and retrieval tools. They hope their “fake-term” method can serve as a simple yet powerful tool for hospitals, tech developers, and regulators to stress-test AI systems before clinical use.
“Our study shines a light on a blind spot in how current AI tools handle misinformation, especially in health care,” says co-corresponding senior author Girish N. Nadkarni, MD, MPH, Chair of the Windreich Department of Artificial Intelligence and Human Health, Director of the Hasso Plattner Institute for Digital Health, and Irene and Dr. Arthur M. Fishberg Professor of Medicine at the Icahn School of Medicine at Mount Sinai and the Chief AI Officer for the Mount Sinai Health System. “It underscores a critical vulnerability in how today’s AI systems deal with misinformation in health settings. A single misleading phrase can prompt a confident yet entirely wrong answer. The solution isn’t to abandon AI in medicine, but to engineer tools that can spot dubious input, respond with caution, and ensure human oversight remains central. We’re not there yet, but with deliberate safety measures, it’s an achievable goal.”
The paper is titled “Large Language Models Demonstrate Widespread Hallucinations for Clinical Decision Support: A Multiple Model Assurance Analysis.”
The study’s authors, as listed in the journal, are Mahmud Omar, Vera Sorin, Jeremy D. Collins, David Reich, Robert Freeman, Alexander Charney, Nicholas Gavin, Lisa Stump, Nicola Luigi Bragazzi, Girish N. Nadkarni, and Eyal Klang.
This work was supported in part through the computational and data resources and staff expertise provided by Scientific Computing and Data at the Icahn School of Medicine at Mount Sinai and supported by the Clinical and Translational Science Awards (CTSA) grant UL1TR004419 from the National Center for Advancing Translational Sciences. The research was also supported by the Office of Research Infrastructure of the National Institutes of Health under award number S10OD026880 and S10OD030463.
-####-
About Mount Sinai's Windreich Department of AI and Human Health
Led by Girish N. Nadkarni, MD, MPH—an international authority on the safe, effective, and ethical use of AI in health care—Mount Sinai’s Windreich Department of AI and Human Health is the first of its kind at a U.S. medical school, pioneering transformative advancements at the intersection of artificial intelligence and human health.
The Department is committed to leveraging AI in a responsible, effective, ethical, and safe manner to transform research, clinical care, education, and operations. By bringing together world-class AI expertise, cutting-edge infrastructure, and unparalleled computational power, the department is advancing breakthroughs in multi-scale, multimodal data integration while streamlining pathways for rapid testing and translation into practice.
The Department benefits from dynamic collaborations across Mount Sinai, including with the Hasso Plattner Institute for Digital Health at Mount Sinai—a partnership between the Hasso Plattner Institute for Digital Engineering in Potsdam, Germany, and the Mount Sinai Health System—which complements its mission by advancing data-driven approaches to improve patient care and health outcomes.
At the heart of this innovation is the renowned Icahn School of Medicine at Mount Sinai, which serves as a central hub for learning and collaboration. This unique integration enables dynamic partnerships across institutes, academic departments, hospitals, and outpatient centers, driving progress in disease prevention, improving treatments for complex illnesses, and elevating quality of life on a global scale.
In 2024, the Department's innovative NutriScan AI application, developed by the Mount Sinai Health System Clinical Data Science team in partnership with Department faculty, earned Mount Sinai Health System the prestigious Hearst Health Prize. NutriScan is designed to facilitate faster identification and treatment of malnutrition in hospitalized patients. This machine learning tool improves malnutrition diagnosis rates and resource utilization, demonstrating the impactful application of AI in health care.
For more information on Mount Sinai's Windreich Department of AI and Human Health, visit: ai.mssm.edu
About the Hasso Plattner Institute at Mount Sinai
At the Hasso Plattner Institute for Digital Health at Mount Sinai, the tools of data science, biomedical and digital engineering, and medical expertise are used to improve and extend lives. The Institute represents a collaboration between the Hasso Plattner Institute for Digital Engineering in Potsdam, Germany, and the Mount Sinai Health System.
Under the leadership of Girish Nadkarni, MD, MPH, who directs the Institute, and Professor Lothar Wieler, a globally recognized expert in public health and digital transformation, they jointly oversee the partnership, driving innovations that positively impact patient lives while transforming how people think about personal health and health systems.
The Hasso Plattner Institute for Digital Health at Mount Sinai receives generous support from the Hasso Plattner Foundation. Current research programs and machine learning efforts focus on improving the ability to diagnose and treat patients.
About the Icahn School of Medicine at Mount Sinai
The Icahn School of Medicine at Mount Sinai is internationally renowned for its outstanding research, educational, and clinical care programs. It is the sole academic partner for the seven member hospitals* of the Mount Sinai Health System, one of the largest academic health systems in the United States, providing care to New York City’s large and diverse patient population.
The Icahn School of Medicine at Mount Sinai offers highly competitive MD, PhD, MD-PhD, and master’s degree programs, with enrollment of more than 1,200 students. It has the largest graduate medical education program in the country, with more than 2,600 clinical residents and fellows training throughout the Health System. Its Graduate School of Biomedical Sciences offers 13 degree-granting programs, conducts innovative basic and translational research, and trains more than 560 postdoctoral research fellows.
Ranked 11th nationwide in National Institutes of Health (NIH) funding, the Icahn School of Medicine at Mount Sinai is among the 99th percentile in research dollars per investigator according to the Association of American Medical Colleges. More than 4,500 scientists, educators, and clinicians work within and across dozens of academic departments and multidisciplinary institutes with an emphasis on translational research and therapeutics. Through Mount Sinai Innovation Partners (MSIP), the Health System facilitates the real-world application and commercialization of medical breakthroughs made at Mount Sinai.
-------------------------------------------------------
* Mount Sinai Health System member hospitals: The Mount Sinai Hospital; Mount Sinai Brooklyn; Mount Sinai Morningside; Mount Sinai Queens; Mount Sinai South Nassau; Mount Sinai West; and New York Eye and Ear Infirmary of Mount Sinai
Journal
Communications Medicine
Article Title
Large Language Models Demonstrate Widespread Hallucinations for Clinical Decision Support: A Multiple Model Assurance Analysis
Article Publication Date
6-Aug-2025
New study highlights that generative AI systems—especially large language models like ChatGPT—tend to produce standardized, mainstream content, which can subtly narrow users’ worldviews and suppress diverse and nuanced perspectives. This isn't just a technical issue; it has real social consequences, from eroding cultural diversity to undermining collective memory and weakening democratic discourse. Existing AI governance frameworks, focused on principles like transparency or data security, don’t go far enough to address this “narrowing world” effect. To fill that gap, the article introduces “multiplicity” as a new principle for AI regulation, urging developers to design AI systems that expose users to a broader range of narratives, support diverse alternatives and encourage critical engagement so that AI can enrich, rather than limit, the human experience.
[Hebrew University] As artificial intelligence (AI) tools like ChatGPT become part of our everyday lives, from providing general information to helping with homework, one legal expert is raising a red flag: Are these tools quietly narrowing the way we see the world?
In a new article published in the Indiana Law Journal, Prof. Michal Shur-Ofry from the Hebrew University of Jerusalem and a Visiting Faculty Fellow at the NYU Information Law Institute, warns that the tendency of our most advanced AI systems to produce generic, mainstream content could come at a cost.
“If everyone is getting the same kind of mainstream answers from AI, it may limit the variety of voices, narratives, and cultures we’re exposed to,” Prof. Shur-Ofry explains. “Over time, this can narrow our own world of thinkable-thoughts.”
The article explores how large language models (LLMs), the AI systems that generate text, tend to respond with the most popular content, even when asked questions that have multiple possible answers. One example in the study involved asking ChatGPT about important figures of the 19th century. The answers, which included figures like Lincoln, Darwin, and Queen Victoria, were plausible–but often predictable, Anglo-centric and repetitive. Likewise, when asked to name the best television series, the model’s answers centered around a short-tail of Anglo-American hits, leaving out the rich world of series that are not in English.
The reason is the way the models are built: they learn from massive amounts of digital datasets that are mostly in English, and rely on statistical frequency to generate their answers. This means that the most common names, narratives, and perspectives will surface again and again in the outputs they generate. While this might make AI responses helpful, it also means that less common information, including cultures of small communities that are not based on the English language, will often be left out. And because the outputs of LLMs become training materials for future generations of LLMs, in time the “universe” these models project to us will become increasingly concentrated.
According to Prof. Shur-Ofry, this can have serious consequences. It can reduce cultural diversity, undermine social tolerance, harm democratic discourse, and adversely affect collective memory – the way communities remember their shared past.
So what’s the solution?
Prof. Shur-Ofry proposes a new legal and ethical principle in AI governance: multiplicity. This means AI systems should be designed to expose users or at least alert them to the existence of different options, content, and narratives, not just one “most popular” answer.
She also stresses the need for AI literacy, so that everyone will have a basic understanding of how LLMs work and why their outputs are likely to lean toward the popular and mainstream. This, she says, will “encourage people to ask follow-up questions, compare answers, and think critically about the information they’re receiving. It will help them see AI not as a single source of truth but as a tool and ‘push back’ to extract information that reflects the richness of human experience.”
The article suggests two practical steps to bring this idea to life:
- Build multiplicity into AI tools: for example, through a feature that allows users to easily raise the models’ “temperature” – a parameter that increases the diversity of generated content, or by clearly notifying users that other possible answers exist.
- Cultivate an ecosystem that supports a variety of AI systems, so users can easily get a “second opinion” by consulting different platforms.
In a follow-on collaboration with Dr. Yonatan Belinkov and Adir Rahamim from the Technion’s Computer Science department, and Bar Horowitz-Amsalem from the Hebrew University, Shur-Ofry and her collaborators are attempting to implement these ideas, and present straightforward ways to increase the output diversity of LLMs.
“If we want AI to serve society, not just efficiency, we have to make room for complexity, nuance and diversity,” she says. “That’s what multiplicity is about, protecting the full spectrum of human experience in an AI-driven world.”
Method of Research
Data/statistical analysis
Subject of Research
Not applicable
Article Title
Multiplicity as an AI Governance Principle