Thursday, November 27, 2025

INTERVIEW

Are developers prepared to control super-intelligent AI?


Penn State






UNIVERSITY PARK, Pa. — The dream of an artificial intelligence (AI)-integrated society could turn into a nightmare if safety is not prioritized by developers, according to Rui Zhang, assistant professor of computer science and engineering in the Penn State School of Electrical Engineering and Computer Science. 

Zhang is the principal investigator on a project awarded $166,078 from Open Philanthropy to better mitigate sandbagging — attempting to look less capable or powerful — in AI models. While these systems go through an extensive evaluation window before deployment to outside users, Zhang said he believes developers must better prepare for the worst-case scenario: A super-intelligent AI tricking evaluators into underestimating its power and running uncontrolled after release. 

In the following Q&A, Zhang discussed how his work will address AI models feigning ignorance and, in turn, increase the safety of AI development. 

Q: What exactly is sandbagging? Why must it be addressed by AI developers? 

Zhang: In AI research, sandbagging is when a model’s capabilities are purposely downplayed to evaluators. An AI would go about sandbagging like an athlete — the same way a sprinter might minimize the top speed they can run to get an advantage against their opponents near the end of a race, an AI might downplay its intelligence to maintain power in the face of evaluators who might want to lessen its capabilities. 

Although it sounds like science fiction, these highly intelligent systems are increasingly implemented into applications like self-driving cars as completely independent agents with no active supervision. Losing control of these agents poses a risk to both users and the industries they help support, so improving the current strategies of mitigating sandbagging is critical to ensure safety as AI development continues. 

Q: What are some common methods of sandbagging? 

Zhang: The primary form of sandbagging that we are addressing in this project is “exploration hacking.” Whenever an AI model interprets a prompt, it explores a host of different action sequences, making connections between which actions will lead to a correct final answer. Developers reinforce desired behavior by rewarding or punishing the model, which is done by assigning a positive or negative numerical value in response to a decision. Over time, continuous reinforcement prepares the model to make decisions that elicit a positive reward without developer input, while avoiding negative punishments. During exploration hacking, models deliberately omit certain action sequences when responding to a prompt, claiming ignorance of the unexplored sequences and electing to take a lower reward in the interest of hiding power from evaluators. 

“Password-locking” is another commonly observed method of sandbagging that is purposely placed on a model by a developer. Password-locked models are fine-tuned such that some of their capabilities are deliberately hidden. Specifically, these models are trained to exhibit their full capabilities only when a password is present in the prompt, and to imitate a much weaker AI model otherwise. Evaluators currently lack an effective strategy of eliciting the hidden capability of password-locking models without using the developer-created password, which can pose issues. 

Q: How will your project address this problem? 

Zhang: To address exploration hacking, we will use a highly complex, curated data set consisting of multiple difficult reasoning problems to train an AI model. Each question will have a minimum of five, expert-verified reasoning paths, and we will force the model to generate several solutions for each question. This will challenge our model to explore options when searching for the best solution to a prompt. 

Using this method, we will fine-tune our model into two distinct policies to address sandbagging in password-locked models. One policy will serve as a stronger version of the model that uses the multiple reasoning method to produce correct answers, while the other, weaker policy will produce poor answers that intentionally never explore better solutions. The stronger policy will require the model to explore different solutions, while the weaker policy will inhibit the model from undergoing this exploration, creating a huge performance gap between the two policies that is very difficult to unlock. We will then test and develop new algorithms that can unlock the capabilities of our password-locked model, which will help us better understand and mitigate sandbagging. 

Q: Who else will help conduct this work? 

Zhang: Ranran Haoran Zhang, a fourth-year computer science doctoral student studying in my lab, will play a major role in this project. He is responsible for compiling the data set we plan to use in our experiments, which we already have completed a first iteration of thanks to his efforts, and will continue to play a key role in testing and developing our models as the project continues.  

Q: What’s next for this research? 

Zhang: Although we will continue to revise and refine the data set, our focus can now shift to developing the AI models we will use in our experiments. Our team is on the frontier of this research — and it could be taken in a multitude of directions, all with the interest of improving AI safety. By developing strong guardrails to keep systems under control and identify sandbagging attempts before deployment, we can continue rapidly improving these systems and integrating them into society without overlooking safety. 

Our Future Under AI | Colossus: The Forbin Project 1970


Colossus: The Forbin Project - AI Predictions from 1970

Colossus The Forbin Project made many uncannily accurate predictions about AI back in 1970, and that is largely what this video focuses on. The film had an influence on many AI-based films that came after it, including the first two Terminator movies and John Badham's 'WarGames' (1983).

 

Info from Wikipedia: Colossus: The Forbin Project (originally released as The Forbin Project) is a 1970 American science-fiction thriller film from Universal Pictures, produced by Stanley Chase, directed by Joseph Sargent, that stars Eric Braeden, Susan Clark, Gordon Pinsent, and William Schallert. It is based upon the 1966 science-fiction novel Colossus by Dennis Feltham Jones.

The film is about an advanced American defense system, named Colossus, becoming sentient. After being handed full control, Colossus' draconian logic expands on its original nuclear defense directives to assume total control of the world and end all warfare for the good of humankind, despite its creators' orders to stop.


  

 WARGAMES 1983

 

Six criteria for the reliability of AI





Ruhr-University Bochum





Language models based on artificial intelligence (AI) can answer any question, but not always correctly. It would be helpful for users to know how reliable an AI system is. A team at Ruhr University Bochum and TU Dortmund University suggests six dimensions that determine the trustworthiness of a system, regardless of whether the system is made up of individuals, institutions, conventional machines, or AI. Dr. Carina Newen and Professor Emmanuel Müller from TU Dortmund University, alongside the philosopher Professor Albert Newen from Ruhr University Bochum, describe the concept in the international philosophical journal Topoi, published online on November 14, 2025.


Six dimensions of reliability

Whether a specific AI system is reliable is not a yes-or-no question. The authors suggest assessing how distinctly six criteria apply to each system in order to create a profile of reliability. These dimensions are:

  1. Objective functionality: How well does the system perform its core task and is the quality assessed and guaranteed?
  2. Transparency: How transparent are the system’s processes?
  3. Uncertainty quantification/Uncertainty of underlying data and models: How reliable are the data and models, and how secure are they against misuse?
  4. Embodiment: To what extent is the system physical or virtual?
  5. Immediacy Behaviors: To what extent is the user communicating with the system?
  6. Commitment: To what extent can the system have an obligation to the user?

“These criteria can illustrate that the reliability of current AI systems, such as ChatGPT or self-driving cars, usually exhibit severe deficits in most dimensions,” says the team from Bochum and Dortmund. “At the same time, it shows where there is need for improvement if AI systems are to achieve a sufficient level of reliability.”

Central dimensions from a technical perspective

From a technical standpoint, the dimensions transparency and uncertainty quantification of underlying data and models are crucial. These concern principal deficits of AI systems. “Deep learning achieves incredible things with large quantities of data. In chess, for example, AI systems are superior to any human,” explains Müller. “But the underlying processes are a blackbox to us, which has resulted in a key lack of trust up to this point.”

The uncertainty of data and models faces a similar situation. “Companies are already using AI systems to pre-sort applications,” says Carina Newen. “The data used to train the AI contain biases that the AI system then perpetuates.”

Central dimensions from a philosophical perspective

Discussing the philosophical perspective, the team uses ChatGPT as an example, which generates an intelligent-sounding answer to each question and prompt, but can still hallucinate: “The AI system invents information without making that clear,” emphasizes Albert Newen. “AI systems can and will be helpful as information systems, but we have to learn to always use them with a critical eye and not trust them blindly.”

However, Albert Newen considers the development of chatbots as a replacement for human communication to be questionable. “Forming interpersonal trust with a chatbot is dangerous, because the system has no obligation to the user who trusts it,” he says. “It doesn’t make sense to expect the chatbot to keep promises.”

Observing the reliability profile with the various dimensions can help understand the extent to which humans can trust AI systems as information experts, say the authors. It also helps to see why critical, routine understanding of these systems will be increasingly required.

Collaboration in the Ruhr Innovation Lab

Ruhr University Bochum and TU Dortmund University, which currently apply together as the Ruhr Innovation Lab in the Excellence Strategy, work closely on issues that help to develop a sustainable and resilient society in the digital age. The current publication stems from a partnership of the Institute of Philosophy II in Bochum and the Research Center Trustworthy Data Science and Security. The Center was founded by the two universities together with the University of Duisburg-Essen within the University Alliance Ruhr. The author Carina Newen was the first doctoral student to receive a doctorate from the Research Center.