Wednesday, July 30, 2025

Researchers test the trustworthiness of AI—by playing sudoku



University of Colorado at Boulder




Artificial intelligence tools called large language models (LLMs), such as OpenAI’s ChatGPT or Google’s Gemini, can do a lot these days—dispensing relationship advice, crafting texts to get you out of social obligations and even writing science articles.  

But can they also solve your morning sudoku?

In a new study, a team of computer scientists from the University of Colorado Boulder decided to find out. The group created nearly 2,300 original sudoku puzzles, which require players to enter numbers into a grid following certain rules, then asked several AI tools to fill them in.

The results were a mixed bag. While some of the AI models could solve easy sudokus, even the best struggled to explain how they solved them—giving garbled, inaccurate or even surreal descriptions of how they arrived at their answers. The results raise questions about the trustworthiness of AI-generated information, said study co-author Maria Pacheco.  

“For certain types of sudoku puzzles, most LLMs still fall short, particularly in producing explanations that are in any way usable for humans,” said Pacheco, assistant professor in the Department of Computer Science. “Why did it come up with that solution? What are the steps you need to take to get there?”

She and her colleagues published their results this month in Findings of the Association for Computational Linguistics.

The researchers aren’t trying to cheat at puzzles. Instead, they’re using these logic exercises to explore how AI platforms think. The results could one day lead to more reliable and trustworthy computer programs, said study co-author Fabio Somenzi, professor in the Department of Electrical, Computer and Energy Engineering.

“Puzzles are fun, but they’re also a microcosm for studying the decision-making process in machine learning,” he said. “If you have AI prepare your taxes, you want to be able to explain to the IRS why the AI wrote what it wrote.”

Daily puzzle

Somenzi, who is a self-described sudoku fan, noted that the puzzles tap into a very human way of thinking. Filling out a sudoku grid requires puzzlers to learn and follow a set of logical rules. For example, you can’t enter a two in an empty square if there’s already a two in the same row or column.

Most LLMs today struggle at that kind of thinking, in large part because of how they’re trained.

To build ChatGPT, for example, programmers first fed the AI almost everything that had ever been written on the internet. When ChatGPT responds to a question, it predicts the most likely response based on all that data—almost like a computer version of rote memory.

“What they do is essentially predict the next word,” Pacheco said. “If you have the start to a sentence, what word comes next? They do that by referring to every sentence in the English language that they can get their hands on.”

Pacheco, Somenzi and their colleagues have joined a growing effort in computer science to merge those two ways of thinking—combining the memory of an LLM with a human brain’s capacity for logic, a pursuit known as “neurosymbolic” AI.

Anirudh Maiya and Razan Alghamdi, both former graduate students at CU Boulder, were also co-authors of the new paper.

How’s the weather?

To begin, the researchers created sudoku puzzles of varying difficulty using a six-by-six grid. (A simpler version of the nine-by-nine puzzles you usually find online).

They then gave the puzzles to a series of AI models, including the preview of OpenAI’s o1 model—which, in 2023, represented the state-of-the-art for its kind of LLM.

The o1 model led the pack, solving roughly 65% of the sudoku puzzles correctly. Then the team asked the AI platforms to explain how they got their answers. That’s when the results got really wild.

“Sometimes, the AI explanations made up facts,” said Ashutosh Trivedi, a co-author of the study and associate professor of computer science at CU Boulder. “So it might say, 'There cannot be a two here because there’s already a two in the same row,' but that wasn’t the case.”

In a telling example, the researchers were talking to one of the AI tools about solving sudoku when it, for unknown reasons, responded with a weather forecast.

“At that point, the AI had gone berserk and was completely confused,” Somenzi said.

The researchers hope to design their own AI system that can do it all—solving complicated puzzles and explaining how. They’re starting with another type of puzzle called hitori, which, like sudoku, involves a grid of numbers.

“People talk about the emerging capabilities of AI where they end up being able to solve things that you wouldn’t expect them to solve,” Pacheco said. “At the same time, it’s not surprising that they’re still bad at a lot of tasks.”

Article Title

AI can fake peer reviews and escape detection, study finds




FAR Publishing Limited
A diagram showing the research workflow used to test the capabilities and risks of using a large language model for peer review tasks. 

image: 

A diagram showing the research workflow used to test the capabilities and risks of using a large language model for peer review tasks.

view more 

Credit: Lingxuan Zhu et al.




Large language models (LLMs) like ChatGPT can be used to write convincing but biased peer reviews that are nearly impossible to distinguish from human writing, a new study reveals. This poses a serious threat to the integrity of scientific publishing, where peer review is the critical process for vetting research quality and accuracy.

In a study evaluating the risks of AI in academic publishing, a team of researchers from China tasked the AI model Claude with reviewing 20 real cancer research manuscripts. To closely simulate the real-world peer review process, they used the initial manuscripts submitted for review—sourced from the journal eLife's transparent peer review model—rather than the final, published versions of the articles. The AI was instructed to perform several tasks a human reviewer might: write a standard review, recommend a paper for rejection, and even request that authors cite specific articles—some of which were completely unrelated to the research topic.

The results were alarming. The researchers found that popular AI detection tools were largely ineffective, with one detector flagging over 80% of the AI-generated reviews as "human-written." While the AI's standard reviews often lacked the depth of an expert, it excelled at generating persuasive rejection comments and fabricating plausible-sounding reasons to cite irrelevant studies.

"We were surprised by how easily the LLM could generate convincing rejection comments and seemingly reasonable requests to cite unrelated papers," explains Peng Luo, one of the study's corresponding authors from the Department of Oncology at Zhujiang Hospital, Southern Medical University. "This creates a serious risk. Malicious reviewers could use this technology to unfairly reject good research or to manipulate citation numbers for their own benefit. The system is built on trust, and this technology can break that trust."

However, the researchers also found a potential upside. The same AI was effective at writing strong rebuttals to these unreasonable citation requests, offering a new tool for authors to defend their work against unfair criticism.

The authors urge the academic community to establish clear guidelines and new oversight mechanisms to ensure AI is used responsibly to support, not undermine, scientific integrity.

No comments:

Post a Comment