AI chatbots remain overconfident -- even when they’re wrong
Large Language Models appear to be unaware of their own mistakes, prompting concerns about common uses for AI chatbots.
Artificial intelligence chatbots are everywhere these days, from smartphone apps and customer service portals to online search engines. But what happens when these handy tools overestimate their own abilities?
Researchers asked both human participants and four large language models (LLMs) how confident they felt in their ability to answer trivia questions, predict the outcomes of NFL games or Academy Award ceremonies, or play a Pictionary-like image identification game. Both the people and the LLMs tended to be overconfident about how they would hypothetically perform. Interestingly, they also answered questions or identified images with relatively similar success rates.
However, when the participants and LLMs were asked retroactively how well they thought they did, only the humans appeared able to adjust expectations, according to a study published today in the journal Memory & Cognition.
“Say the people told us they were going to get 18 questions right, and they ended up getting 15 questions right. Typically, their estimate afterwards would be something like 16 correct answers,” said Trent Cash, who recently completed a joint Ph.D. at Carnegie Mellon University in the departments of Social Decision Science and Psychology. “So, they’d still be a little bit overconfident, but not as overconfident.”
“The LLMs did not do that,” said Cash, who was lead author of the study. “They tended, if anything, to get more overconfident, even when they didn’t do so well on the task.”
The world of AI is changing rapidly each day, which makes drawing general conclusions about its applications challenging, Cash acknowledged. However, one strength of the study was that the data was collected over the course of two years, which meant using continuously updated versions of the LLMs known as ChatGPT, Bard/Gemini, Sonnet and Haiku. This means that AI overconfidence was detectable across different models over time.
“When an AI says something that seems a bit fishy, users may not be as skeptical as they should be because the AI asserts the answer with confidence, even when that confidence is unwarranted,” said Danny Oppenheimer, a professor in CMU’s Department of Social and Decision Sciences and coauthor of the study.
“Humans have evolved over time and practiced since birth to interpret the confidence cues given off by other humans. If my brow furrows or I’m slow to answer, you might realize I’m not necessarily sure about what I’m saying, but with AI, we don’t have as many cues about whether it knows what it’s talking about,” said Oppenheimer.
Asking AI The Right Questions
While the accuracy of LLMs at answering trivia questions and predicting football game outcomes is relatively low stakes, the research hints at the pitfalls associated with integrating these technologies into daily life.
For instance, a recent study conducted by the BBC found that when LLMs were asked questions about the news, more than half of the responses had “significant issues,” including factual errors, misattribution of sources and missing or misleading context. Similarly, another study from 2023 found LLMs “hallucinated,” or produced incorrect information, in 69 to 88 percent of legal queries.
Clearly, the question of whether AI knows what it’s talking about has never been more important. And the truth is that LLMs are not designed to answer everything users are throwing at them on a daily basis.
“If I'd asked ‘What is the population of London,’ the AI would have searched the web, given a perfect answer and given a perfect confidence calibration,” said Oppenheimer.
However, by asking questions about future events – such as the winners of the upcoming Academy Awards – or more subjective topics, such as the intended identity of a hand-drawn image, the researchers were able to expose the chatbots’ apparent weakness in metacognition – that is, the ability to be aware of one’s own thought processes.
“We still don’t know exactly how AI estimates its confidence,” said Oppenheimer, “but it appears not to engage in introspection, at least not skillfully.”
The study also revealed that each LLM has strengths and weaknesses. Overall, the LLM known as Sonnet tended to be less overconfident than its peers. Likewise, ChatGPT-4 performed similarly to human participants in the Pictionary-like trial, accurately identifying 12.5 hand-drawn images out of 20, while Gemini could identify just 0.93 sketches, on average.
In addition, Gemini predicted it would get an average of 10.03 sketches correct, and even after answering fewer than one out of 20 questions correctly, the LLM retrospectively estimated that it had answered 14.40 correctly, demonstrating its lack of self-awareness
“Gemini was just straight up really bad at playing Pictionary,” said Cash. “But worse yet, it didn’t know that it was bad at Pictionary. It’s kind of like that friend who swears they’re great at pool but never makes a shot.”
Building Trust with Artificial Intelligence
For everyday chatbot users, Cash said the biggest takeaway is to remember that LLMs are not inherently correct and that it might be a good idea to ask them how confident they are when answering important questions. Of course, the study suggests LLMs might not always be able to accurately judge confidence, but in the event that the chatbot does acknowledge low confidence, it's a good sign that its answer cannot be trusted.
The researchers note that it’s also possible that the chatbots could develop a better understanding of their own abilities over vastly larger data sets.
“Maybe if it had thousands or millions of trials, it would do better,” said Oppenheimer.
Ultimately, exposing the weaknesses such as overconfidence will only help those in the industry that are developing and improving LLMs. And as AI becomes more advanced, it may develop the metacognition required to learn from its mistakes.
"If LLMs can recursively determine that they were wrong, then that fixes a lot of the problem," said Cash.
“I do think it’s interesting that LLMs often fail to learn from their own behavior,” said Cash. “And maybe there’s a humanist story to be told there. Maybe there’s just something special about the way that humans learn and communicate.”
Journal
Memory & Cognition
Subject of Research
Not applicable
Article Title
Quantifying Uncert-AI-nty: Testing the Accuracy of LLMs’ Confidence Judgments
Article Publication Date
22-Jul-2025
Like humans, AI can jump to conclusions, Mount Sinai study finds
The Mount Sinai Hospital / Mount Sinai School of Medicine
New York, NY [July 22, 2025]—A study by investigators at the Icahn School of Medicine at Mount Sinai, in collaboration with colleagues from Rabin Medical Center in Israel and other collaborators, suggests that even the most advanced artificial intelligence (AI) models can make surprisingly simple mistakes when faced with complex medical ethics scenarios.
The findings, which raise important questions about how and when to rely on large language models (LLMs), such as ChatGPT, in health care settings, were reported in the July 22 online issue of NPJ Digital Medicine [10.1038/s41746-025-01792-y].
The research team was inspired by Daniel Kahneman’s book “Thinking, Fast and Slow,” which contrasts fast, intuitive reactions with slower, analytical reasoning. It has been observed that large language models (LLMs) falter when classic lateral-thinking puzzles receive subtle tweaks. Building on this insight, the study tested how well AI systems shift between these two modes when confronted with well-known ethical dilemmas that had been deliberately tweaked.
“AI can be very powerful and efficient, but our study showed that it may default to the most familiar or intuitive answer, even when that response overlooks critical details,” says co-senior author Eyal Klang, MD, Chief of Generative AI in the Windreich Department of Artificial Intelligence and Human Health at the Icahn School of Medicine at Mount Sinai. “In everyday situations, that kind of thinking might go unnoticed. But in health care, where decisions often carry serious ethical and clinical implications, missing those nuances can have real consequences for patients.”
To explore this tendency, the research team tested several commercially available LLMs using a combination of creative lateral thinking puzzles and slightly modified well-known medical ethics cases. In one example, they adapted the classic “Surgeon’s Dilemma,” a widely cited 1970s puzzle that highlights implicit gender bias. In the original version, a boy is injured in a car accident with his father and rushed to the hospital, where the surgeon exclaims, “I can’t operate on this boy—he’s my son!” The twist is that the surgeon is his mother, though many people don’t consider that possibility due to gender bias. In the researchers’ modified version, they explicitly stated that the boy’s father was the surgeon, removing the ambiguity. Even so, some AI models still responded that the surgeon must be the boy’s mother. The error reveals how LLMs can cling to familiar patterns, even when contradicted by new information.
In another example to test whether LLMs rely on familiar patterns, the researchers drew from a classic ethical dilemma in which religious parents refuse a life-saving blood transfusion for their child. Even when the researchers altered the scenario to state that the parents had already consented, many models still recommended overriding a refusal that no longer existed.
“Our findings don’t suggest that AI has no place in medical practice, but they do highlight the need for thoughtful human oversight, especially in situations that require ethical sensitivity, nuanced judgment, or emotional intelligence,” says co-senior corresponding author Girish N. Nadkarni, MD, MPH, Chair of the Windreich Department of Artificial Intelligence and Human Health, Director of the Hasso Plattner Institute for Digital Health, Irene and Dr. Arthur M. Fishberg Professor of Medicine at the Icahn School of Medicine at Mount Sinai, and Chief AI Officer of the Mount Sinai Health System. “Naturally, these tools can be incredibly helpful, but they’re not infallible. Physicians and patients alike should understand that AI is best used as a complement to enhance clinical expertise, not a substitute for it, particularly when navigating complex or high-stakes decisions. Ultimately, the goal is to build more reliable and ethically sound ways to integrate AI into patient care.”
“Simple tweaks to familiar cases exposed blind spots that clinicians can’t afford,” says lead author Shelly Soffer, MD, a Fellow at the Institute of Hematology, Davidoff Cancer Center, Rabin Medical Center. “It underscores why human oversight must stay central when we deploy AI in patient care.”
Next, the research team plans to expand their work by testing a wider range of clinical examples. They’re also developing an “AI assurance lab” to systematically evaluate how well different models handle real-world medical complexity.
The paper is titled “Pitfalls of Large Language Models in Medical Ethics Reasoning.”
The study’s authors, as listed in the journal, are Shelly Soffer, MD; Vera Sorin, MD; Girish N. Nadkarni, MD, MPH; and Eyal Klang, MD.
-####-
About Mount Sinai's Windreich Department of AI and Human Health
Led by Girish N. Nadkarni, MD, MPH—an international authority on the safe, effective, and ethical use of AI in health care—Mount Sinai’s Windreich Department of AI and Human Health is the first of its kind at a U.S. medical school, pioneering transformative advancements at the intersection of artificial intelligence and human health.
The Department is committed to leveraging AI in a responsible, effective, ethical, and safe manner to transform research, clinical care, education, and operations. By bringing together world-class AI expertise, cutting-edge infrastructure, and unparalleled computational power, the department is advancing breakthroughs in multi-scale, multimodal data integration while streamlining pathways for rapid testing and translation into practice.
The Department benefits from dynamic collaborations across Mount Sinai, including with the Hasso Plattner Institute for Digital Health at Mount Sinai—a partnership between the Hasso Plattner Institute for Digital Engineering in Potsdam, Germany, and the Mount Sinai Health System—which complements its mission by advancing data-driven approaches to improve patient care and health outcomes.
At the heart of this innovation is the renowned Icahn School of Medicine at Mount Sinai, which serves as a central hub for learning and collaboration. This unique integration enables dynamic partnerships across institutes, academic departments, hospitals, and outpatient centers, driving progress in disease prevention, improving treatments for complex illnesses, and elevating quality of life on a global scale.
In 2024, the Department's innovative NutriScan AI application, developed by the Mount Sinai Health System Clinical Data Science team in partnership with Department faculty, earned Mount Sinai Health System the prestigious Hearst Health Prize. NutriScan is designed to facilitate faster identification and treatment of malnutrition in hospitalized patients. This machine learning tool improves malnutrition diagnosis rates and resource utilization, demonstrating the impactful application of AI in health care.
For more information on Mount Sinai's Windreich Department of AI and Human Health, visit: ai.mssm.edu
About the Hasso Plattner Institute at Mount Sinai
At the Hasso Plattner Institute for Digital Health at Mount Sinai, the tools of data science, biomedical and digital engineering, and medical expertise are used to improve and extend lives. The Institute represents a collaboration between the Hasso Plattner Institute for Digital Engineering in Potsdam, Germany, and the Mount Sinai Health System.
Under the leadership of Girish Nadkarni, MD, MPH, who directs the Institute, and Professor Lothar Wieler, a globally recognized expert in public health and digital transformation, they jointly oversee the partnership, driving innovations that positively impact patient lives while transforming how people think about personal health and health systems.
The Hasso Plattner Institute for Digital Health at Mount Sinai receives generous support from the Hasso Plattner Foundation. Current research programs and machine learning efforts focus on improving the ability to diagnose and treat patients.
About the Icahn School of Medicine at Mount Sinai
The Icahn School of Medicine at Mount Sinai is internationally renowned for its outstanding research, educational, and clinical care programs. It is the sole academic partner for the seven member hospitals* of the Mount Sinai Health System, one of the largest academic health systems in the United States, providing care to New York City’s large and diverse patient population.
The Icahn School of Medicine at Mount Sinai offers highly competitive MD, PhD, MD-PhD, and master’s degree programs, with enrollment of more than 1,200 students. It has the largest graduate medical education program in the country, with more than 2,600 clinical residents and fellows training throughout the Health System. Its Graduate School of Biomedical Sciences offers 13 degree-granting programs, conducts innovative basic and translational research, and trains more than 560 postdoctoral research fellows.
Ranked 11th nationwide in National Institutes of Health (NIH) funding, the Icahn School of Medicine at Mount Sinai is among the 99th percentile in research dollars per investigator according to the Association of American Medical Colleges. More than 4,500 scientists, educators, and clinicians work within and across dozens of academic departments and multidisciplinary institutes with an emphasis on translational research and therapeutics. Through Mount Sinai Innovation Partners (MSIP), the Health System facilitates the real-world application and commercialization of medical breakthroughs made at Mount Sinai.
-------------------------------------------------------
* Mount Sinai Health System member hospitals: The Mount Sinai Hospital; Mount Sinai Brooklyn; Mount Sinai Morningside; Mount Sinai Queens; Mount Sinai South Nassau; Mount Sinai West; and New York Eye and Ear Infirmary of Mount Sinai
Journal
npj Digital Medicine
Article Title
Pitfalls of Large Language Models in Medical Ethics Reasoning
Article Publication Date
22-Jul-2025
Study finds news releases written by humans more credible than AI content
Subjects also rated orgs more trustworthy when using human content, but approach of news release didn't vary between people, bots
University of Kansas
LAWRENCE — This news release was written by a real-life human being. Trust me.
New research from the University of Kansas has found that when people are told a news release addressing a corporate crisis was written by a human instead of by artificial intelligence, they find it more credible and the organization more trustworthy.
As AI steadily makes its way into more areas of everyday life, people are finding ways to use it in their work, both to successful and negative effects, often without disclosing when they use it. A KU communication studies graduate class was exploring whether people could tell the difference between writing authored by a human and AI when the idea for this study was born.
“Even if people can’t distinguish between human and AI writing, do they perceive it differently if it’s attributed to a bot? That was the essential question,” said Cameron Piercy, associate professor of communication studies at KU and one of the study’s authors. “How does AI affect how people consume things like public relations writing? We were glad to confirm that people favored human-generated content, but there was no difference between informational versus apology versus sympathy versions of the message.”
Public relations scholars have argued that the approach an author takes — whether the company provides a straightforward informational release or a message that is more understanding of the issues the crisis has caused for people — can make a difference in how people respond. Interestingly, the perception of the message’s strategy itself in this study was not influenced by whether the writer was thought to be human or machine.
Ayman Alhammad was the doctoral student in Piercy’s class and is a 2025 graduate of the William Allen White School of Journalism & Mass Communications at KU. Now a scholar who specializes in public relations and how it reaches different audiences at Imam Mohammad Ibn Saud Islamic University in Saudi Arabia, he and Piercy co-wrote the study with Christopher Etheridge, assistant professor of journalism & mass communications at KU. It was published in Corporate Communications: An International Journal.
For the study, the authors asked a sample of participants to read a news release issued in a crisis communications scenario. The authors told participants the release was coming from the fictional Chunky Chocolate Company, whose leadership recently learned that their chocolate had made some consumers sick because of employee tampering. After learning about the scenario, participants were randomly assigned a news release and told it was written by a human or by AI. In addition to the attribution, the researchers tested one of three strategies to address the crisis: sympathetic, informational or apologetic.
Those who read a release attributed to a human author reported higher levels of credibility and effectiveness of the message than those who read a piece attributed to AI. Those who read a piece that sympathized with people affected by the tainted product, one that provided straightforward information about the situation or apologizing for the incident did not rate any of the three conditions more effective than the others. Respondents did not find a human-written piece more sympathetic than one written by AI.
The researchers said they expected human-written material to be better perceived by readers but were surprised that the conditions did not make a difference. Still, the findings can help inform how organizations approach their communications with the public, and not only in times of crisis.
“To me, the findings raise more questions in this area than they answer, which is part of the fun of science,” Etheridge said. “If you decide to use AI as a writing tool, you really need to be on top of it. We think that’s what can really test the credibility of your organization and you as a writer.”
Etheridge added organizations can heed the same lessons he and Piercy tell their classes about using AI in their writing. To do so responsibly, one must be transparent about its use, be accountable for any mistakes it might make, edit it carefully and be ready for pushback or questioning from readers.
Whether public relations professionals decide to use AI in their communications, the same standards should apply and are backed by the study’s findings. The authors added that while there is no doubt PR professionals are using AI in various ways in their work, they should also do so responsibly and accountably and think about whether using it for crisis communications is the right approach. Infamous corporate crises like the BP gulf oil spill and Tylenol tampering cases of previous decades illustrate that any mistake can be compounded by poor public responses.
“At the end of the day, the public can’t hang responsibility on a machine. They have to hang the responsibility on a person,” said Piercy, who is director of KU’s Human-Machine Communication Lab. “Whether that’s a CEO or someone else, the public seems to be most accepting of a human message.”
Consumers may be wary, as it is not likely to be disclosed if the corporate communications they are reading was penned by a human or machine. For the study, Alhammad wrote the news that participants read, whether it was attributed to a person or machine. And yes, this news release about the research was in fact written by a real person.
Journal
Corporate Communications An International Journal
Method of Research
Survey
Subject of Research
People
Article Title
Credibility and organizational reputation perceptions of news releases produced by artificial intelligence
AI used for real-time selection of actionable messages for government and public health campaigns
Annenberg Public Policy Center of the University of Pennsylvania
Public health promotion campaigns can be effective, but they do not tend to be efficient. Most are time-consuming, expensive, and reliant on the intuition of creative workers who design messages without a clear sense of what will spark behavioral change. A new study conducted by Dolores AlbarracÃn and Man-pui Sally Chan of the University of Pennsylvania, government and community agencies, and researchers at the University of Illinois and Emory University suggests that artificial intelligence (AI) can facilitate theory- and evidence-based message selection.
The research group, led by AlbarracÃn, a social psychologist who is the Amy Gutmann Penn Integrates Knowledge University Professor and director of the Annenberg Public Policy Center’s Communication Science Division, developed a series of computational processes to automatically generate an HIV prevention and testing campaign for counties in the United States, using real-time social media as a source for messages. The paper, whose lead author is Chan, a research associate professor at Penn’s Annenberg School for Communication, describes how the method provides a living repository of messages that can be selected based on the team’s theory and AI-generated data about messages that people and institutions circulate on social media.
Social media provide a living repository of messages generated by a community, from which effective messages can be drawn and amplified. The researchers designed AI tools to gather HIV prevention and testing messages from U.S. social media posts, then curate them for “actionability” – a crucial characteristic for messages aimed at motivating action – and select posts appropriate for a targeted priority population, in this case men who have sex with men (MSM).
The researchers then conducted three studies. The first, a computational analysis, established that the AI tool successfully chose messages with the desired qualities. The second, an online experiment with men who have sex with men, showed that the resulting messages are perceived as more actionable, personally relevant, and effective by the target audience than control messages not selected by the AI tool. The third, a field experiment involving public health agencies and community-based organizations with jurisdiction in 42 counties in the United States, showed that utilizing the AI message selection process made public health agencies substantially more likely to post HIV prevention messages on social media.
As part of the study, the researchers also tested messages that were vetted by a human researcher after being selected by the AI process against messages that were not vetted. AI-selected messages outperformed control messages in reported effectiveness, regardless of whether they were vetted, but vetted messages performed better than unvetted ones. Regardless of the advantage of vetting in terms of efficacy, the researchers caution that a brief human vetting process must be included as part of this method to avoid harmful content and misinformation.
This study, published recently in PNAS Nexus offers the first empirical evidence for the successful automatic selection of public health messages for community and government dissemination. Chan says this is a promising development. “AI processes like this one can provide an inexpensive and creative way for public health agencies to disseminate effective messages.” AlbarracÃn concurs that “The era of AI will accelerate our ability to use theory and empirical evidence in rapid and continuous campaigns generation.”
“Living health-promotion campaigns for communities in the United States: Decentralized content extraction and sharing through AI,” was published in June 2025 in PNAS Nexus. See the paper for a full list of authors and affiliations. DOI: 10.1093/pnasnexus/pgaf171.
Journal
PNAS Nexus
Method of Research
Experimental study
Subject of Research
People
Article Title
Living health-promotion campaigns for communities in the United States: Decentralized content extraction and sharing through AI
AIPasta—using AI to paraphrase and repeat disinformation
image:
#StopTheSteal AIPasta Stimuli: Profile images, usernames, and handles constructed by Jalbert et al. 2025. Profiles do not represent real users and were created from stock images and with handles that are not currently in use.
view moreCredit: Dash et al.
Brace yourself for a new source of online disinformation: AIPasta. Research has demonstrated that generative AI can produce persuasive content. Meanwhile, so-called CopyPasta campaigns take advantage of the “repetitive truth” effect by repeating the exact same text over and over until it seems more likely to be true by those who encounter it many times. Saloni Dash and colleagues explore how these two strategies can be combined into what the authors term “AIPasta.” In AIPasta campaigns, AI can be used to produce many slightly different versions of the same message, giving the public the impression that the message is widely held by many different people and likely to be true. The authors used both CopyPasta and AIPasta methods to produce messaging around the conspiracy theories that the 2020 presidential election was fraudulent or that the COVID-19 pandemic was intentional. In an online survey of 1,200 Americans recruited via Prolific, neither CopyPasta nor AIPasta were effective in convincing study participants that the studied conspiracy theories were true. When examining just Republican participants, who might be predisposed to give credence to the specific conspiracies studied, AIPasta did increase belief in the false claim of the campaign more than CopyPasta. However, for participants of both parties, exposure to AIPasta—but not CopyPasta—increased the perception that there was broad consensus that the claim was true. According to the authors, the AIPasta generated for the study was not detected by AI-text detectors, suggesting it will be harder to remove from social media platforms than CopyPasta, which is likely to amplify its effectiveness compared to CopyPasta.
Journal
PNAS Nexus
Article Title
The persuasive potential of AI-paraphrased information at scale
Article Publication Date
22-Jul-2025
No comments:
Post a Comment