Researchers test the trustworthiness of AI—by playing sudoku
Artificial intelligence tools called large language models (LLMs), such as OpenAI’s ChatGPT or Google’s Gemini, can do a lot these days—dispensing relationship advice, crafting texts to get you out of social obligations and even writing science articles.
But can they also solve your morning sudoku?
In a new study, a team of computer scientists from the University of Colorado Boulder decided to find out. The group created nearly 2,300 original sudoku puzzles, which require players to enter numbers into a grid following certain rules, then asked several AI tools to fill them in.
The results were a mixed bag. While some of the AI models could solve easy sudokus, even the best struggled to explain how they solved them—giving garbled, inaccurate or even surreal descriptions of how they arrived at their answers. The results raise questions about the trustworthiness of AI-generated information, said study co-author Maria Pacheco.
“For certain types of sudoku puzzles, most LLMs still fall short, particularly in producing explanations that are in any way usable for humans,” said Pacheco, assistant professor in the Department of Computer Science. “Why did it come up with that solution? What are the steps you need to take to get there?”
She and her colleagues published their results this month in Findings of the Association for Computational Linguistics.
The researchers aren’t trying to cheat at puzzles. Instead, they’re using these logic exercises to explore how AI platforms think. The results could one day lead to more reliable and trustworthy computer programs, said study co-author Fabio Somenzi, professor in the Department of Electrical, Computer and Energy Engineering.
“Puzzles are fun, but they’re also a microcosm for studying the decision-making process in machine learning,” he said. “If you have AI prepare your taxes, you want to be able to explain to the IRS why the AI wrote what it wrote.”
Artificial intelligence tools called large language models (LLMs), such as OpenAI’s ChatGPT or Google’s Gemini, can do a lot these days—dispensing relationship advice, crafting texts to get you out of social obligations and even writing science articles.
But can they also solve your morning sudoku?
In a new study, a team of computer scientists from the University of Colorado Boulder decided to find out. The group created nearly 2,300 original sudoku puzzles, which require players to enter numbers into a grid following certain rules, then asked several AI tools to fill them in.
The results were a mixed bag. While some of the AI models could solve easy sudokus, even the best struggled to explain how they solved them—giving garbled, inaccurate or even surreal descriptions of how they arrived at their answers. The results raise questions about the trustworthiness of AI-generated information, said study co-author Maria Pacheco.
“For certain types of sudoku puzzles, most LLMs still fall short, particularly in producing explanations that are in any way usable for humans,” said Pacheco, assistant professor in the Department of Computer Science. “Why did it come up with that solution? What are the steps you need to take to get there?”
She and her colleagues published their results this month in Findings of the Association for Computational Linguistics.
The researchers aren’t trying to cheat at puzzles. Instead, they’re using these logic exercises to explore how AI platforms think. The results could one day lead to more reliable and trustworthy computer programs, said study co-author Fabio Somenzi, professor in the Department of Electrical, Computer and Energy Engineering.
“Puzzles are fun, but they’re also a microcosm for studying the decision-making process in machine learning,” he said. “If you have AI prepare your taxes, you want to be able to explain to the IRS why the AI wrote what it wrote.”
Daily puzzle
Somenzi, who is a self-described sudoku fan, noted that the puzzles tap into a very human way of thinking. Filling out a sudoku grid requires puzzlers to learn and follow a set of logical rules. For example, you can’t enter a two in an empty square if there’s already a two in the same row or column.
Most LLMs today struggle at that kind of thinking, in large part because of how they’re trained.
To build ChatGPT, for example, programmers first fed the AI almost everything that had ever been written on the internet. When ChatGPT responds to a question, it predicts the most likely response based on all that data—almost like a computer version of rote memory.
“What they do is essentially predict the next word,” Pacheco said. “If you have the start to a sentence, what word comes next? They do that by referring to every sentence in the English language that they can get their hands on.”
Pacheco, Somenzi and their colleagues have joined a growing effort in computer science to merge those two ways of thinking—combining the memory of an LLM with a human brain’s capacity for logic, a pursuit known as “neurosymbolic” AI.
Anirudh Maiya and Razan Alghamdi, both former graduate students at CU Boulder, were also co-authors of the new paper.
Somenzi, who is a self-described sudoku fan, noted that the puzzles tap into a very human way of thinking. Filling out a sudoku grid requires puzzlers to learn and follow a set of logical rules. For example, you can’t enter a two in an empty square if there’s already a two in the same row or column.
Most LLMs today struggle at that kind of thinking, in large part because of how they’re trained.
To build ChatGPT, for example, programmers first fed the AI almost everything that had ever been written on the internet. When ChatGPT responds to a question, it predicts the most likely response based on all that data—almost like a computer version of rote memory.
“What they do is essentially predict the next word,” Pacheco said. “If you have the start to a sentence, what word comes next? They do that by referring to every sentence in the English language that they can get their hands on.”
Pacheco, Somenzi and their colleagues have joined a growing effort in computer science to merge those two ways of thinking—combining the memory of an LLM with a human brain’s capacity for logic, a pursuit known as “neurosymbolic” AI.
Anirudh Maiya and Razan Alghamdi, both former graduate students at CU Boulder, were also co-authors of the new paper.
How’s the weather?
To begin, the researchers created sudoku puzzles of varying difficulty using a six-by-six grid. (A simpler version of the nine-by-nine puzzles you usually find online).
They then gave the puzzles to a series of AI models, including the preview of OpenAI’s o1 model—which, in 2023, represented the state-of-the-art for its kind of LLM.
The o1 model led the pack, solving roughly 65% of the sudoku puzzles correctly. Then the team asked the AI platforms to explain how they got their answers. That’s when the results got really wild.
“Sometimes, the AI explanations made up facts,” said Ashutosh Trivedi, a co-author of the study and associate professor of computer science at CU Boulder. “So it might say, 'There cannot be a two here because there’s already a two in the same row,' but that wasn’t the case.”
In a telling example, the researchers were talking to one of the AI tools about solving sudoku when it, for unknown reasons, responded with a weather forecast.
“At that point, the AI had gone berserk and was completely confused,” Somenzi said.
The researchers hope to design their own AI system that can do it all—solving complicated puzzles and explaining how. They’re starting with another type of puzzle called hitori, which, like sudoku, involves a grid of numbers.
“People talk about the emerging capabilities of AI where they end up being able to solve things that you wouldn’t expect them to solve,” Pacheco said. “At the same time, it’s not surprising that they’re still bad at a lot of tasks.”
To begin, the researchers created sudoku puzzles of varying difficulty using a six-by-six grid. (A simpler version of the nine-by-nine puzzles you usually find online).
They then gave the puzzles to a series of AI models, including the preview of OpenAI’s o1 model—which, in 2023, represented the state-of-the-art for its kind of LLM.
The o1 model led the pack, solving roughly 65% of the sudoku puzzles correctly. Then the team asked the AI platforms to explain how they got their answers. That’s when the results got really wild.
“Sometimes, the AI explanations made up facts,” said Ashutosh Trivedi, a co-author of the study and associate professor of computer science at CU Boulder. “So it might say, 'There cannot be a two here because there’s already a two in the same row,' but that wasn’t the case.”
In a telling example, the researchers were talking to one of the AI tools about solving sudoku when it, for unknown reasons, responded with a weather forecast.
“At that point, the AI had gone berserk and was completely confused,” Somenzi said.
The researchers hope to design their own AI system that can do it all—solving complicated puzzles and explaining how. They’re starting with another type of puzzle called hitori, which, like sudoku, involves a grid of numbers.
“People talk about the emerging capabilities of AI where they end up being able to solve things that you wouldn’t expect them to solve,” Pacheco said. “At the same time, it’s not surprising that they’re still bad at a lot of tasks.”
Article Title
Explaining Puzzle Solutions in Natural Language: An Exploratory Study on 6x6 Sudoku
Explaining Puzzle Solutions in Natural Language: An Exploratory Study on 6x6 Sudoku
AI can fake peer reviews and escape detection, study finds
FAR Publishing Limited
image:
A diagram showing the research workflow used to test the capabilities and risks of using a large language model for peer review tasks.
view moreCredit: Lingxuan Zhu et al.
Large language models (LLMs) like ChatGPT can be used to write convincing but biased peer reviews that are nearly impossible to distinguish from human writing, a new study reveals. This poses a serious threat to the integrity of scientific publishing, where peer review is the critical process for vetting research quality and accuracy.
In a study evaluating the risks of AI in academic publishing, a team of researchers from China tasked the AI model Claude with reviewing 20 real cancer research manuscripts. To closely simulate the real-world peer review process, they used the initial manuscripts submitted for review—sourced from the journal eLife's transparent peer review model—rather than the final, published versions of the articles. The AI was instructed to perform several tasks a human reviewer might: write a standard review, recommend a paper for rejection, and even request that authors cite specific articles—some of which were completely unrelated to the research topic.
The results were alarming. The researchers found that popular AI detection tools were largely ineffective, with one detector flagging over 80% of the AI-generated reviews as "human-written." While the AI's standard reviews often lacked the depth of an expert, it excelled at generating persuasive rejection comments and fabricating plausible-sounding reasons to cite irrelevant studies.
"We were surprised by how easily the LLM could generate convincing rejection comments and seemingly reasonable requests to cite unrelated papers," explains Peng Luo, one of the study's corresponding authors from the Department of Oncology at Zhujiang Hospital, Southern Medical University. "This creates a serious risk. Malicious reviewers could use this technology to unfairly reject good research or to manipulate citation numbers for their own benefit. The system is built on trust, and this technology can break that trust."
However, the researchers also found a potential upside. The same AI was effective at writing strong rebuttals to these unreasonable citation requests, offering a new tool for authors to defend their work against unfair criticism.
The authors urge the academic community to establish clear guidelines and new oversight mechanisms to ensure AI is used responsibly to support, not undermine, scientific integrity.
Journal
Clinical and Translational Discovery
Method of Research
Experimental study
Subject of Research
Not applicable
Article Title
Evaluating the potential risks of employing large language models in peer review.
COI Statement
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Ai predicts mental health crises from sparse digital data
image:
Real-time risk prediction from multimodal data.
view moreCredit: Wang et al./Med Research
A ground-breaking approach using "small data" machine learning could revolutionize mental healthcare by predicting individual crises before they escalate. Published in Med Research, the study demonstrates how models like Tabular Prior-data Fitted Networks (TabPFN) analyze sparse, irregular digital footprints—such as sleep patterns, typing dynamics, and movement—to forecast depressive relapses or manic episodes with clinical-level accuracy.
Traditional mental health assessments rely on infrequent clinical interviews, missing subtle real-time warning signs. In contrast, this method detects deterioration from fragmented data streams, even with fewer than 100 data points per patient. For example, GPS-derived social withdrawal combined with erratic typing patterns predicted bipolar episodes 24 hours in advance during trials.
"We bridge the gap between sparse digital phenotyping and actionable clinical insights," said lead author Dr. Peng Wang of Vrije Universiteit Amsterdam and Erasmus Universiteit Rotterdam. "This isn’t just forecasting symptoms—it’s pre-empting crises by translating hidden behavioral shifts into personalized risk scores."
The framework addresses key challenges:
- Real-time adaptation: Updates predictions within hours, not days.
- Uncertainty quantification: Provides confidence intervals (e.g., "72% relapse risk ±8%").
- Clinical integration: Alerts flow directly into electronic health records, prompting timely interventions.
Future steps include clinical validation and deploying these models on edge devices like smartwatches for privacy-preserving monitoring.
Journal
Med Research
Method of Research
Commentary/editorial
Subject of Research
Not applicable
Article Title
Harnessing Small-Data Machine Learning for Transformative Mental Health Forecasting: Towards Precision Psychiatry With Personalised Digital Phenotyping.
NYU Tandon researchers develop AI agent that solves cybersecurity challenges autonomously
New framework called EnIGMA demonstrates improved performance in automated vulnerability detection using interactive tools
NYU Tandon School of Engineering
Artificial intelligence agents — AI systems that can work independently toward specific goals without constant human guidance — have demonstrated strong capabilities in software development and web navigation. Their effectiveness in cybersecurity has remained limited, however.
That may soon change, thanks to a research team from NYU Tandon School of Engineering, NYU Abu Dhabi and other universities that developed an AI agent capable of autonomously solving complex cybersecurity challenges.
The system, called EnIGMA, was presented this month at the International Conference on Machine Learning (ICML) 2025 in Vancouver, Canada.
"EnIGMA is about using Large Language Model agents for cybersecurity applications," said Meet Udeshi, a NYU Tandon Ph.D. student and co-author of the research. Udeshi is advised by Ramesh Karri, Chair of NYU Tandon's Electrical and Computer Engineering Department (ECE) and a faculty member of the NYU Center for Cybersecurity and NYU Center for Advanced Technology in Telecommunications (CATT), and by Farshad Khorrami, ECE professor and CATT faculty member. Both Karri and Khorrami are co-authors on the paper, with Karri serving as a senior author.
To build EnIGMA, the researchers started with an existing framework called SWE-agent, which was originally designed for software engineering tasks. However, cybersecurity challenges required specialized tools that didn't exist in previous AI systems. "We have to restructure those interfaces to feed it into an LLM properly. So we've done that for a couple of cybersecurity tools," Udeshi explained.
The key innovation was developing what they call "Interactive Agent Tools" that convert visual cybersecurity programs into text-based formats the AI can understand. Traditional cybersecurity tools like debuggers and network analyzers use graphical interfaces with clickable buttons, visual displays, and interactive elements that humans can see and manipulate.
"Large language models process text only, but these interactive tools with graphical user interfaces work differently, so we had to restructure those interfaces to work with LLMs," Udeshi said.
The team built their own dataset by collecting and structuring Capture The Flag (CTF) challenges specifically for large language models. These gamified cybersecurity competitions simulate real-world vulnerabilities and have traditionally been used to train human cybersecurity professionals.
"CTFs are like a gamified version of cybersecurity used in academic competitions. They're not true cybersecurity problems that you would face in the real world, but they are very good simulations," Udeshi noted.
Paper co-author Minghao Shao, a NYU Tandon Ph.D. student and Global Ph.D. Fellow at NYU Abu Dhabi who is advised by Karri and Muhammad Shafique, Professor of Computer Engineering at NYU Abu Dhabi and ECE Global Network Professor at NYU Tandon, described the technical architecture: "We built our own CTF benchmark dataset and created a specialized data loading system to feed these challenges into the model." Shafique is also a co-author on the paper.
The framework includes specialized prompts that provide the model with instructions tailored to cybersecurity scenarios.
EnIGMA demonstrated superior performance across multiple benchmarks. The system was tested on 390 CTF challenges across four different benchmarks, achieving state-of-the-art results and solving more than three times as many challenges as previous AI agents.
During the research conducted approximately 12 months ago, "Claude 3.5 Sonnet from Anthropic was the best model, and GPT-4o was second at that time," according to Udeshi.
The research also identified a previously unknown phenomenon called "soliloquizing," where the AI model generates hallucinated observations without actually interacting with the environment, a discovery that could have important consequences for AI safety and reliability.
Beyond this technical finding, the potential applications extend outside of academic competitions. "If you think of an autonomous LLM agent that can solve these CTFs, that agent has substantial cybersecurity skills that you can use for other cybersecurity tasks as well," Udeshi explained. The agent could potentially be applied to real-world vulnerability assessment, with the ability to "try hundreds of different approaches" autonomously.
The researchers acknowledge the dual-use nature of their technology. While EnIGMA could help security professionals identify and patch vulnerabilities more efficiently, it could also potentially be misused for malicious purposes. The team has notified representatives from major AI companies including Meta, Anthropic, and OpenAI about their results.
In addition to Karri, Khorrami, Shafique, Udeshi and Shao, the paper's authors are Talor Abramovich (Tel Aviv University), Kilian Lieret (Princeton University), Haoran Xi (NYU Tandon), Kimberly Milner (NYU Tandon), Sofija Jancheska (NYU Tandon), John Yang (Stanford University), Carlos E. Jimenez (Princeton University), Prashanth Krishnamurthy (NYU Tandon), Brendan Dolan-Gavitt (NYU Tandon), Karthik Narasimhan (Princeton University), and Ofir Press (Princeton University).
Funding for the research came from Open Philanthropy, Oracle, the National Science Foundation, the Army Research Office, the Department of Energy, and NYU Abu Dhabi Center for Cybersecurity and Center for Artificial Intelligence and Robotics.
Method of Research
Computational simulation/modeling
AI system helps researchers unlock hidden potential in newly discovered materials
Developed by U of T Engineering researchers, the tool uses early-stage data to predict the potential real-world use for a new material
Each year, researchers around the world create thousands of new materials — but many of them never reach their full potential. A new AI tool from the University of Toronto's Faculty of Applied Science & Engineering could change that by predicting how a new material could best be used, right from the moment it’s made.
In a study published in Nature Communications, a team led by Professor Seyed Mohamad Moosavi introduces a multimodal AI tool that can predict how well a new material might perform in the real world.
The system focuses on a class of porous materials known as metal-organic frameworks (MOFs). Moosavi says that last year alone, materials scientists created more than 5,000 different types of MOFs, which have tunable properties that lead to a wide range of potential applications.
For example, MOFs can be used to separate CO2 from other gases in a waste stream, preventing the carbon from reaching the atmosphere and contributing to climate change. They can also be used to deliver drugs to particular areas of the body, or to add new functions to advanced electronic devices.
According to Moosavi, one major challenge facing the field is that a MOF created for one purpose often turns out to have the ideal properties for a completely different application.
For example, in one of their previous studies, it was found that a material originally synthesized for photocatalysis was instead very effective for carbon capture — but this discovery was only made seven years later.
“In materials discovery, the typical question is, ‘What is the best material for this application?’” says Moosavi.
“We flipped the question and asked, ‘What’s the best application for this new material?’ With so many materials made every day, we want to shift the focus from ‘what material do we make next’ to ‘what evaluation should we do next.’”
This approach aims to reduce the time lag between discovery and deployment of MOFs.
To help make this possible, PhD student Sartaaj Khan developed a multimodal machine learning system trained on various types of data typically available immediately after synthesis — specifically, the precursor chemicals used to make the material, and its powder X-ray diffraction (PXRD) pattern.
“Multimodality matters,” says Khan. “Just as humans use different senses — such as vision and language — to understand the world, combining different types of material data gives our model a more complete picture.”
The AI system uses a multimodal pretraining strategy to gain insights into a material’s geometry and chemical environment, enabling it to make accurate property predictions without needing post-synthesis structural characterization.
This can speed up the discovery process and help researchers recognize promising materials before they’re overlooked or shelved.
To test the model, the team conducted a ‘time-travel’ experiment. They trained the AI on material data available before 2017 and asked it to evaluate materials synthesized after that date.
The system successfully flagged several materials — originally developed for other purposes — as strong candidates for carbon capture. Some of those are now undergoing experimental validation in collaboration with the National Research Council of Canada.
Looking ahead, Moosavi plans to integrate the AI into the self-driving laboratories (SDLs) at University of Toronto’s Acceleration Consortium, a global hub for automated materials discovery.
“SDLs automate the process of designing, synthesizing and testing new materials,” he says.
“When one lab creates a new material, our system could evaluate it — and potentially reroute it to another lab better equipped to assess its full potential. That kind of seamless inter-lab coordination could accelerate materials discovery.”
Journal
Nature Communications
\
No comments:
Post a Comment