Thursday, November 27, 2025

INTERVIEW

Are developers prepared to control super-intelligent AI?

Penn State

UNIVERSITY PARK, Pa. — The dream of an artificial intelligence (AI)-integrated society could turn into a nightmare if safety is not prioritized by developers, according to Rui Zhang, assistant professor of computer science and engineering in the Penn State School of Electrical Engineering and Computer Science.

Zhang is the principal investigator on a project awarded $166,078 from Open Philanthropy to better mitigate sandbagging — attempting to look less capable or powerful — in AI models. While these systems go through an extensive evaluation window before deployment to outside users, Zhang said he believes developers must better prepare for the worst-case scenario: A super-intelligent AI tricking evaluators into underestimating its power and running uncontrolled after release.

In the following Q&A, Zhang discussed how his work will address AI models feigning ignorance and, in turn, increase the safety of AI development.

Q: What exactly is sandbagging? Why must it be addressed by AI developers?

Zhang: In AI research, sandbagging is when a model’s capabilities are purposely downplayed to evaluators. An AI would go about sandbagging like an athlete — the same way a sprinter might minimize the top speed they can run to get an advantage against their opponents near the end of a race, an AI might downplay its intelligence to maintain power in the face of evaluators who might want to lessen its capabilities.

Although it sounds like science fiction, these highly intelligent systems are increasingly implemented into applications like self-driving cars as completely independent agents with no active supervision. Losing control of these agents poses a risk to both users and the industries they help support, so improving the current strategies of mitigating sandbagging is critical to ensure safety as AI development continues.

Q: What are some common methods of sandbagging?

Zhang: The primary form of sandbagging that we are addressing in this project is “exploration hacking.” Whenever an AI model interprets a prompt, it explores a host of different action sequences, making connections between which actions will lead to a correct final answer. Developers reinforce desired behavior by rewarding or punishing the model, which is done by assigning a positive or negative numerical value in response to a decision. Over time, continuous reinforcement prepares the model to make decisions that elicit a positive reward without developer input, while avoiding negative punishments. During exploration hacking, models deliberately omit certain action sequences when responding to a prompt, claiming ignorance of the unexplored sequences and electing to take a lower reward in the interest of hiding power from evaluators.

“Password-locking” is another commonly observed method of sandbagging that is purposely placed on a model by a developer. Password-locked models are fine-tuned such that some of their capabilities are deliberately hidden. Specifically, these models are trained to exhibit their full capabilities only when a password is present in the prompt, and to imitate a much weaker AI model otherwise. Evaluators currently lack an effective strategy of eliciting the hidden capability of password-locking models without using the developer-created password, which can pose issues.

Q: How will your project address this problem?

Zhang: To address exploration hacking, we will use a highly complex, curated data set consisting of multiple difficult reasoning problems to train an AI model. Each question will have a minimum of five, expert-verified reasoning paths, and we will force the model to generate several solutions for each question. This will challenge our model to explore options when searching for the best solution to a prompt.

Using this method, we will fine-tune our model into two distinct policies to address sandbagging in password-locked models. One policy will serve as a stronger version of the model that uses the multiple reasoning method to produce correct answers, while the other, weaker policy will produce poor answers that intentionally never explore better solutions. The stronger policy will require the model to explore different solutions, while the weaker policy will inhibit the model from undergoing this exploration, creating a huge performance gap between the two policies that is very difficult to unlock. We will then test and develop new algorithms that can unlock the capabilities of our password-locked model, which will help us better understand and mitigate sandbagging.

Q: Who else will help conduct this work?

Zhang: Ranran Haoran Zhang, a fourth-year computer science doctoral student studying in my lab, will play a major role in this project. He is responsible for compiling the data set we plan to use in our experiments, which we already have completed a first iteration of thanks to his efforts, and will continue to play a key role in testing and developing our models as the project continues.

Q: What’s next for this research?

Zhang: Although we will continue to revise and refine the data set, our focus can now shift to developing the AI models we will use in our experiments. Our team is on the frontier of this research — and it could be taken in a multitude of directions, all with the interest of improving AI safety. By developing strong guardrails to keep systems under control and identify sandbagging attempts before deployment, we can continue rapidly improving these systems and integrating them into society without overlooking safety.

Our Future Under AI | Colossus: The Forbin Project 1970

Colossus: The Forbin Project - AI Predictions from 1970

Colossus The Forbin Project made many uncannily accurate predictions about AI back in 1970, and that is largely what this video focuses on. The film had an influence on many AI-based films that came after it, including the first two Terminator movies and John Badham's 'WarGames' (1983).

Info from Wikipedia: Colossus: The Forbin Project (originally released as The Forbin Project) is a 1970 American science-fiction thriller film from Universal Pictures, produced by Stanley Chase, directed by Joseph Sargent, that stars Eric Braeden, Susan Clark, Gordon Pinsent, and William Schallert. It is based upon the 1966 science-fiction novel Colossus by Dennis Feltham Jones.

The film is about an advanced American defense system, named Colossus, becoming sentient. After being handed full control, Colossus' draconian logic expands on its original nuclear defense directives to assume total control of the world and end all warfare for the good of humankind, despite its creators' orders to stop.

WARGAMES 1983

Six criteria for the reliability of AI

Ruhr-University Bochum

Language models based on artificial intelligence (AI) can answer any question, but not always correctly. It would be helpful for users to know how reliable an AI system is. A team at Ruhr University Bochum and TU Dortmund University suggests six dimensions that determine the trustworthiness of a system, regardless of whether the system is made up of individuals, institutions, conventional machines, or AI. Dr. Carina Newen and Professor Emmanuel Müller from TU Dortmund University, alongside the philosopher Professor Albert Newen from Ruhr University Bochum, describe the concept in the international philosophical journal Topoi, published online on November 14, 2025.

Six dimensions of reliability

Whether a specific AI system is reliable is not a yes-or-no question. The authors suggest assessing how distinctly six criteria apply to each system in order to create a profile of reliability. These dimensions are:

Objective functionality: How well does the system perform its core task and is the quality assessed and guaranteed?
Transparency: How transparent are the system’s processes?
Uncertainty quantification/Uncertainty of underlying data and models: How reliable are the data and models, and how secure are they against misuse?
Embodiment: To what extent is the system physical or virtual?
Immediacy Behaviors: To what extent is the user communicating with the system?
Commitment: To what extent can the system have an obligation to the user?

“These criteria can illustrate that the reliability of current AI systems, such as ChatGPT or self-driving cars, usually exhibit severe deficits in most dimensions,” says the team from Bochum and Dortmund. “At the same time, it shows where there is need for improvement if AI systems are to achieve a sufficient level of reliability.”

Central dimensions from a technical perspective

From a technical standpoint, the dimensions transparency and uncertainty quantification of underlying data and models are crucial. These concern principal deficits of AI systems. “Deep learning achieves incredible things with large quantities of data. In chess, for example, AI systems are superior to any human,” explains Müller. “But the underlying processes are a blackbox to us, which has resulted in a key lack of trust up to this point.”

The uncertainty of data and models faces a similar situation. “Companies are already using AI systems to pre-sort applications,” says Carina Newen. “The data used to train the AI contain biases that the AI system then perpetuates.”

Central dimensions from a philosophical perspective

Discussing the philosophical perspective, the team uses ChatGPT as an example, which generates an intelligent-sounding answer to each question and prompt, but can still hallucinate: “The AI system invents information without making that clear,” emphasizes Albert Newen. “AI systems can and will be helpful as information systems, but we have to learn to always use them with a critical eye and not trust them blindly.”

However, Albert Newen considers the development of chatbots as a replacement for human communication to be questionable. “Forming interpersonal trust with a chatbot is dangerous, because the system has no obligation to the user who trusts it,” he says. “It doesn’t make sense to expect the chatbot to keep promises.”

Observing the reliability profile with the various dimensions can help understand the extent to which humans can trust AI systems as information experts, say the authors. It also helps to see why critical, routine understanding of these systems will be increasingly required.

Collaboration in the Ruhr Innovation Lab

Ruhr University Bochum and TU Dortmund University, which currently apply together as the Ruhr Innovation Lab in the Excellence Strategy, work closely on issues that help to develop a sustainable and resilient society in the digital age. The current publication stems from a partnership of the Institute of Philosophy II in Bochum and the Research Center Trustworthy Data Science and Security. The Center was founded by the two universities together with the University of Duisburg-Essen within the University Alliance Ruhr. The author Carina Newen was the first doctoral student to receive a doctorate from the Research Center.

Journal

Topoi

DOI

10.1007/s11245-025-10287-0

Article Title

Trust and Uncertainties: Characterizing Trustworthy AI Systems Within a Multidimensional Theory of Trust

Article Publication Date

24-Nov-2025

Researchers discover a shortcoming that makes LLMs less reliable

Large language models can learn to mistakenly link certain sentence patterns with specific topics — and may then repeat these patterns instead of reasoning.

Massachusetts Institute of Technology

Large language models (LLMs) sometimes learn the wrong lessons, according to an MIT study.

Rather than answering a query based on domain knowledge, an LLM could respond by leveraging grammatical patterns it learned during training. This can cause a model to fail unexpectedly when deployed on new tasks.

The researchers found that models can mistakenly link certain sentence patterns to specific topics, so an LLM might give a convincing answer by recognizing familiar phrasing instead of understanding the question.

Their experiments showed that even the most powerful LLMs can make this mistake.

This shortcoming could reduce the reliability of LLMs that perform tasks like handling customer inquiries, summarizing clinical notes, and generating financial reports.

It could also have safety risks — a nefarious actor could exploit this to trick LLMs into producing harmful content, even when the models have safeguards to prevent such responses.

After identifying this phenomenon and exploring its implications, the researchers developed a benchmarking procedure to evaluate a model’s reliance on these incorrect correlations. The procedure could help developers mitigate the problem before deploying LLMs.

“This is a byproduct of how we train models, but models are now used in practice in safety-critical domains far beyond the tasks that created these syntactic failure modes. If you’re not familiar with model training as an end-user, this is likely to be unexpected,” says Marzyeh Ghassemi, an associate professor in the MIT Department of Electrical Engineering and Computer Science (EECS), a member of the Institute of Medical Engineering Sciences and the Laboratory for Information and Decision Systems, and the senior author of the study.

Ghassemi is joined on the paper by co-lead authors Chantal Shaib, a graduate student at Northeastern University and visiting student at MIT; and Vinith Suriyakumar, an MIT graduate student; as well as Levent Sagun, a research scientist at Meta; and Byron Wallace, the Sy and Laurie Sternberg Interdisciplinary Associate Professor and associate dean of research at Northeastern University’s Khoury College of Computer Sciences. The paper will be presented at the Conference on Neural Information Processing Systems.

Stuck on syntax

LLMs are trained on a massive amount of text from the internet. During this training process, the model learns to understand the relationships between words and phrases — knowledge it uses later when responding to queries.

In prior work, the researchers found that LLMs pick up patterns in the parts of speech that frequently appear together in training data. They call these part-of-speech patterns “syntactic templates.”

LLMs need this understanding of syntax, along with semantic knowledge, to answer questions in a particular domain.

“In the news domain, for instance, there is a particular style of writing. So, not only is the model learning the semantics, it is also learning the underlying structure of how sentences should be put together to follow a specific style for that domain,” Shaib explains.

But in this research, they determined that LLMs learn to associate these syntactic templates with specific domains. The model may incorrectly rely solely on this learned association when answering questions, rather than on an understanding of the query and subject matter.

For instance, an LLM might learn that a question like “Where is Paris located?” is structured as adverb/verb/proper noun/verb. If there are many examples of sentence construction in the model’s training data, the LLM may associate that syntactic template with questions about countries.

So, if the model is given a new question with the same grammatical structure but nonsense words, like “Quickly sit Paris clouded?” it might answer “France” even though that answer makes no sense.

“This is an overlooked type of association that the model learns in order to answer questions correctly. We should be paying closer attention to not only the semantics but the syntax of the data we use to train our models,” Shaib says.

Missing the meaning

The researchers tested this phenomenon by designing synthetic experiments in which only one syntactic template appeared in the model’s training data for each domain. They tested the models by substituting words with synonyms, antonyms, or random words, but kept the underlying syntax the same.

In each instance, they found that LLMs often still responded with the correct answer, even when the question was complete nonsense.

When they restructured the same question using a new part-of-speech pattern, the LLMs often failed to give the correct response, even though the underlying meaning of the question remained the same.

They used this approach to test pre-trained LLMs like GPT-4 and Llama, and found that this same learned behavior significantly lowered their performance.

Curious about the broader implications of these findings, the researchers studied whether someone could exploit this phenomenon to elicit harmful responses from an LLM that has been deliberately trained to refuse such requests.

They found that, by phrasing the question using a syntactic template the model associates with a “safe” dataset (one that doesn’t contain harmful information), they could trick the model into overriding its refusal policy and generating harmful content.

“From this work, it is clear to me that we need more robust defenses to address security vulnerabilities in LLMs. In this paper, we identified a new vulnerability that arises due to the way LLMs learn. So, we need to figure out new defenses based on how LLMs learn language, rather than just ad hoc solutions to different vulnerabilities,” Suriyakumar says.

While the researchers didn’t explore mitigation strategies in this work, they developed an automatic benchmarking technique one could use to evaluate an LLM’s reliance on this incorrect syntax-domain correlation. This new test could help developers proactively address this shortcoming in their models, reducing safety risks and improving performance.

In the future, the researchers want to study potential mitigation strategies, which could involve augmenting training data to provide a wider variety of syntactic templates. They are also interested in exploring this phenomenon in reasoning models, special types of LLMs designed to tackle multi-step tasks.

“I think this is a really creative angle to study failure modes of LLMs. This work highlights the importance of linguistic knowledge and analysis in LLM safety research, an aspect that hasn’t been at the center stage but clearly should be,” says Jessy Li, an associate professor at the University of Texas at Austin, who was not involved with this work.

This work is funded, in part, by a Bridgewater AIA Labs Fellowship, the National Science Foundation, the Gordon and Betty Moore Foundation, a Google Research Award, and Schmidt Sciences.

###

Written by Adam Zewe, MIT News

Article Title

“Learning the Wrong Lessons: Syntactic-Domain Spurious Correlations in Language Models”

How personalized algorithms lead to a distorted view of reality

Study: Suggested content can lead to inaccurate

generalizations

Ohio State University

The same personalized algorithms that deliver online content based on your previous choices on social media sites like YouTube also impair learning, a new study suggests.

Researchers found that when an algorithm controlled what information was shown to study participants on a subject they knew nothing about, they tended to narrow their focus and only explore a limited subset of the information that was available to them.

As a result, these participants were often wrong when tested on the information they were supposed to learn – but were still overconfident in their incorrect answers.

The results are concerning, said Giwon Bahg, who led the study as part of his doctoral dissertation in psychology at The Ohio State University.

Many studies on personalized algorithms tend to focus on how they may guide people’s beliefs on political or social issues about which they are somewhat familiar.

“But our study shows that even when you know nothing about a topic, these algorithms can start building biases immediately and can lead to a distorted view of reality,” said Bahg, who is now a postdoctoral scholar at Pennsylvania State University.

The study was published in the Journal of Experimental Psychology: General.

The results suggest that many people may have little problem taking the limited knowledge they get from following personalized algorithms and building sweeping generalizations, said study co-author Brandon Turner, professor of psychology at Ohio State.

“People miss information when they follow an algorithm, but they think what they do know generalizes to other features and other parts of the environment that they’ve never experienced,” Turner said.

In the paper, the researchers gave an example of how algorithmic personalization could lead to inaccurate generalizations during learning: Imagine a person who has never watched movies from a certain country but wants to try them. An on-demand streaming service recommends movies to try.

The person chooses an action-thriller film randomly because it is first on the suggestion list. As a result, the algorithm suggests more movies of the same genre, which the person also watches.

“If this person’s goal, whether explicit or implicit, was in fact to understand the overall landscape of movies in this country, the algorithmic recommendation ends up seriously biasing one’s understanding,” the authors wrote.

This person is likely to miss other great movies in different genres. This person may also draw unfounded and overstretching conclusions about popular culture and society based on only seeing action-thriller and related movies, the authors said.

Bahg and his colleagues tested how this could happen in an online experiment with 346 participants.

In order to test learning, the researchers used a totally fictional setup that participants knew nothing about.

Participants studied categories of crystal-like aliens that had six features. The features varied between different types of aliens. For example, one part of the aliens was a square box could be dark black for some types of aliens and pale gray for others.

The goal was to learn how to correctly identify the aliens in the study, without knowing the total number of alien types.

In the experiment, the features of the aliens were hidden behind gray boxes. In one condition, participants had to sample all the features so they could get a complete picture of which features belong to which aliens.

Others were given choices of which features to click – and then a personalization algorithm chose study items from which they are likely to sample as many features as possible. The algorithm even encouraged them to continue to sample the same feature as the experiment went on. They were also allowed to pass on reviewing other features. But crucially, these participants still had the opportunity to reveal any of the features they wanted.

But the findings showed that when participants were fed features by the personalized algorithm, they sampled fewer features in a consistently selective way. When participants were tested on new information they had not seen before, they often incorrectly categorized the new information based on their limited knowledge. Still, they were sure they were right.

“They were even more confident when they were actually incorrect about their choices than when they were correct, which is concerning because they had less knowledge,” Bahg said.

Turner said this has real-world implications.

“If you have a young kid genuinely trying to learn about the world, and they’re interacting with algorithms online that prioritize getting users to consume more content, what is going to happen?” Turner said.

“Consuming similar content is often not aligned with learning. This can cause problems for users and ultimately for society.”

Vladimir Sloutsky, professor of psychology at Ohio State, was also a co-author.

Journal

Journal of Experimental Psychology General

DOI

10.1037/xge0001763

Method of Research

Experimental study

Subject of Research

People

Article Title

Algorithmic Personalization of Information Can Cause Inaccurate Generalization and Overconfidence

Subscribe to: Comments (Atom)