Researchers found a better way to teach large language models new skills
North Carolina State University
Researchers have developed a technique that significantly improves the performance of large language models without increasing the computational power necessary to fine-tune the models. The researchers demonstrated that their technique improves the performance of these models over previous techniques in tasks including commonsense reasoning, arithmetic reasoning, instruction following, code generation, and visual recognition.
Large language models are artificial intelligence systems that are pretrained on huge data sets. After pretraining, these models predict which words should follow each other in order to respond to user queries. However, the nonspecific nature of pretraining means that there is ample room for improvement with these models when the user queries are focused on specific topics, such as when a user requests the model to answer a math question or to write computer code.
“In order to improve a model’s ability to perform more specific tasks, you need to fine-tune the model,” says Tianfu Wu, co-corresponding author of a paper on the work and an associate professor of computer engineering at North Carolina State University. “However, these models are so large that it is not feasible to re-train the entire model. Instead, you want to determine the smallest number of changes necessary to improve the model’s performance. We’ve developed a technique, called WeGeFT (pronounced wee-gift), that represents a significant advance for fine-tuning these large models.”
The big break-through for fine-tuning these large models was called LoRA, which came out in 2022. LoRA works by using mathematical tools to identify a small subset of key parameters that are most likely to improve a model’s performance on a specific task. There have been many attempts to improve upon LoRA, but Wu and his collaborators found these previous efforts either required significantly more computational power to improve performance, or used the same amount of computing power without improving performance.
“WeGeFT builds on LoRA, but incorporates additional mathematical tools that allow us to determine which of the key parameters the model is already familiar with and which parameters the model would need to ‘learn,’” says Wu. “By placing more weight on the truly novel parameters, we are able to improve model performance compared to LoRA without incorporating significant new computational demands.”
In proof-of-concept testing, the researchers found that WeGeFT performed as well as or better than LoRA and its many variants across a variety of downstream tasks: commonsense reasoning, arithmetic reasoning, instruction following, code generation, and visual recognition.
“We think this is a valuable step forward,” Wu says. “We are now exploring ways that WeGeFT could also be used to identify elements of the model that are responsible for harmful outputs, with the goal of improving AI alignment and ‘surgery’ to improve model safety and outputs. We expect that work to be forthcoming.”
The paper, “WeGeFT: Weight-Generative Fine-Tuning for Multi-Faceted Efficient Adaptation of Large Models,” will be presented July 17 at the International Conference on Machine Learning, being held in Vancouver, Canada. Co-corresponding author of the paper is Chinmay Savadikar, a Ph.D. student at NC State. The paper was co-authored by Xi Song, an independent researcher.
This work was done with support from the National Science Foundation under grants 1909644, 2024688 and 2013451; and from the Army Research Office under grants W911NF1810295 and W911NF2210010.
Method of Research
Experimental study
Subject of Research
Not applicable
Article Title
WeGeFT: Weight-Generative Fine-Tuning for Multi-Faceted Efficient Adaptation of Large Models
BU creates new language model to help people obtain more accurate answers to science, health questions
PodGPT combines expert knowledge from podcasts with the latest research articles
(Boston)—The rise of generative artificial intelligence (AI), particularly large language models (LLMs), has marked a transformative shift in data analysis, interpretation and content generation. These models, trained on extensive textual datasets, have demonstrated the ability to generate contextually accurate and linguistically rich outputs, with profound implications for fields such as science and medicine, where models like OpenAI’s GPT-4 have shown remarkable aptitude. However, the full potential of LLMs in science, technology, engineering, mathematics and medicine (STEMM) remains underexplored, particularly in integrating non-traditional data modalities such as audio content.
In a new study, researchers from Boston University introduce a newly created computer program called PodGPT that learns from science and medicine podcasts to become smarter at understanding and answering scientific questions.
“By integrating spoken content, we aim to enhance our model’s understanding of conversational language and extend its application to more specialized contexts within STEMM disciplines,” explains corresponding author Vijaya B. Kolachalama, PhD, FAHA, associate professor of medicine and computer science at Boston University Chobanian & Avedisian School of Medicine. “This is special because it uses real conversations, like expert interviews and talks, instead of just written material, helping it better understand how people actually talk about science in real life.”
Kolachalama and his colleagues collected more than 3,700 hours of publicly available science and medicine podcasts and turned the speech into text using advanced software. They then trained a computer model to learn from this information. Following this, they tested the model on a variety of quizzes in subjects like biology, math, and medicine, including questions in different languages, to see how well it performed. The results demonstrated that incorporating STEMM audio podcast data enhanced their model’s ability to understand and generate precise and comprehensive information.
According to the researchers, this study shows that voice-based content like podcasts can be used to train AI tools. Kolachalama is also a Founding Member of Faculty of Computing & Data Sciences at Boston University, and an affiliate of Hariri Institute of Computing at Boston University.
“This opens the door to using all kinds of audio, like lectures or interviews, to build smarter and more human-like technology. It also shows promise in making science more accessible in many languages, helping people across the world learn and stay informed,” said Kolachalama.
Not only do the researchers believe that this technology will help make scientific and medical knowledge easier to access, but that listening to the conversations of experts in their field will assist people in making more informed decisions about their health and education.
“This could help improve understanding and diagnosis in many health conditions such as Alzheimer's disease, cardiovascular disease, infectious diseases, cancer and mental health. It may also support learning in areas like public health and planetary health,” said Kolachalama.
These findings appear online in the journal npj Biomedical Innovations.
This project was supported by grants from the Karen Toffler Charitable Trust (V.B.K.), the National Institute on Aging’s Artificial Intelligence and Technology Collaboratories (P30-AG073104 and P30-AG073105, V.B.K.), the American Heart Association (20SFRN35460031, V.B.K. & R.A.), Gates Ventures (R.A. & V.B.K.), and the National Institutes of Health (R01-HL159620 [V.B.K.], R01-AG083735 [R.A. & V.B.K.], R01-AG062109 [R.A. & V.B.K.], and U19-AG068753 [R.A.]).
Note to editors: V.B.K. is a co-founder and equity holder of deepPath Inc. and CogniScreen, Inc. He also serves on the scientific advisory board of Altoida Inc. R.A. is a scientific advisor to Signant Health and NovoNordisk. The remaining authors declare no competing interests.
Journal
npj Biomedical Innovations
Method of Research
Computational simulation/modeling
Subject of Research
Not applicable
Article Title
PodGPT: an audio-augmented large language model for research and education
Article Publication Date
7-Jul-2025
COI Statement
V.B.K. is a co-founder and equity holder of deepPath Inc. and CogniScreen, Inc. He also serves on the scientific advisory board of Altoida Inc. R.A. is a scientific advisor to Signant Health and NovoNordisk. The remaining authors declare no competing interests.
From position to meaning: how AI learns to read
A study in JSTAT describes the sharp shift in text comprehension strategies during neural network training
The language capabilities of today’s artificial intelligence systems are astonishing. We can now engage in natural conversations with systems like ChatGPT, Gemini, and many others, with a fluency nearly comparable to that of a human being. Yet we still know very little about the internal processes in these networks that lead to such remarkable results.
A new study published in the Journal of Statistical Mechanics: Theory and Experiment (JSTAT) reveals a piece of this mystery. It shows that when small amounts of data are used for training, neural networks initially rely on the position of words in a sentence. However, as the system is exposed to enough data, it transitions to a new strategy based on the meaning of the words. The study finds that this transition occurs abruptly, once a critical data threshold is crossed — much like a phase transition in physical systems. The findings offer valuable insights for understanding the workings of these models.
Just like a child learning to read, a neural network starts by understanding sentences based on the positions of words: depending on where words are located in a sentence, the network can infer their relationships (are they subjects, verbs, objects?). However, as the training continues — the network “keeps going to school” — a shift occurs: word meaning becomes the primary source of information.
This, the new study explains, is what happens in a simplified model of self-attention mechanism — a core building block of transformer language models, like the ones we use every day (ChatGPT, Gemini, Claude, etc.). A transformer is a neural network architecture designed to process sequences of data, such as text, and it forms the backbone of many modern language models. Transformers specialize in understanding relationships within a sequence and use the self-attention mechanism to assess the importance of each word relative to the others.
“To assess relationships between words,” explains Hugo Cui, a postdoctoral researcher at Harvard University and first author of the study, “the network can use two strategies, one of which is to exploit the positions of words.” In a language like English, for example, the subject typically precedes the verb, which in turn precedes the object. “Mary eats the apple” is a simple example of this sequence.
“This is the first strategy that spontaneously emerges when the network is trained,” Cui explains. “However, in our study, we observed that if training continues and the network receives enough data, at a certain point — once a threshold is crossed — the strategy abruptly shifts: the network starts relying on meaning instead.”
“When we designed this work, we simply wanted to study which strategies, or mix of strategies, the networks would adopt. But what we found was somewhat surprising: below a certain threshold, the network relied exclusively on position, while above it, only on meaning.”
Cui describes this shift as a phase transition, borrowing a concept from physics. Statistical physics studies systems composed of enormous numbers of particles (like atoms or molecules) by describing their collective behavior statistically. Similarly, neural networks — the foundation of these AI systems — are composed of large numbers of “nodes,” or neurons (named by analogy to the human brain), each connected to many others and performing simple operations. The system’s intelligence emerges from the interaction of these neurons, a phenomenon that can be described with statistical methods.
This is why we can speak of an abrupt change in network behavior as a phase transition, similar to how water, under certain conditions of temperature and pressure, changes from liquid to gas.
“Understanding from a theoretical viewpoint that the strategy shift happens in this manner is important,” Cui emphasizes. “Our networks are simplified compared to the complex models people interact with daily, but they can give us hints to begin to understand the conditions that cause a model to stabilize on one strategy or another. This theoretical knowledge could hopefully be used in the future to make the use of neural networks more efficient, and safer.”
The research by Hugo Cui, Freya Behrens, Florent Krzakala, and Lenka Zdeborová, titled “A Phase Transition between Positional and Semantic Learning in a Solvable Model of Dot-Product Attention”, is published in JSTAT as part of the Machine Learning 2025 special issue and is included in the proceedings of the NeurIPS 2024 conference.
Journal
Journal of Statistical Mechanics Theory and Experiment
Method of Research
Data/statistical analysis
Article Title
A Phase Transition between Positional and Semantic Learning in a Solvable Model of Dot-Product Attention
Article Publication Date
7-Jul-2025
No comments:
Post a Comment