How to use AI for discovery — without leading science astray
A new statistical technique allows researchers to safely use machine learning predictions to test scientific hypotheses
Over the past decade, AI has permeated nearly every corner of science: Machine learning models have been used to predict protein structures, estimate the fraction of the Amazon rainforest that has been lost to deforestation and even classify faraway galaxies that might be home to exoplanets.
But while AI can be used to speed scientific discovery — helping researchers make predictions about phenomena that may be difficult or costly to study in the real world — it can also lead scientists astray. In the same way that chatbots sometimes “hallucinate,” or make things up, machine learning models can sometimes present misleading or downright false results.
In a paper published online today (Thursday, Nov. 9) in Science, researchers at the University of California, Berkeley, present a new statistical technique for safely using the predictions obtained from machine learning models to test scientific hypotheses.
The technique, called prediction-powered inference (PPI), uses a small amount of real-world data to correct the output of large, general models — such as AlphaFold, which predicts protein structures — in the context of specific scientific questions.
“These models are meant to be general: They can answer many questions, but we don't know which questions they answer well and which questions they answer badly — and if you use them naively, without knowing which case you're in, you can get bad answers,” said study author Michael Jordan, the Pehong Chen Distinguished Professor of electrical engineering and computer science and of statistics at UC Berkeley. “With PPI, you're able to use the model, but correct for possible errors, even when you don’t know the nature of those errors at the outset.”
The risk of hidden biases
When scientists conduct experiments, they’re not just looking for a single answer — they want to obtain a range of plausible answers. This is done by calculating a “confidence interval,” which, in the simplest case, can be found by repeating an experiment many times and seeing how the results vary.
In most science studies, a confidence interval usually refers to a summary or combined statistic, not individual data points. Unfortunately, machine learning systems focus on individual data points, and thus do not provide scientists with the kinds of uncertainty assessments that they care about. For instance, AlphaFold predicts the structure of a single protein, but it doesn't provide a notion of confidence for that structure, nor a way to obtain confidence
intervals that refer to general properties of proteins.
Scientists may be tempted to use the predictions from AlphaFold as if they were data to compute classical confidence intervals, ignoring the fact that these predictions are not data. The problem with this approach is that machine learning systems have many hidden biases that can skew the results. These biases arise, in part, from the data on which they are trained, which are generally existing scientific research that may not have had the same focus as the current study.
“Indeed, in scientific problems, we're often interested in phenomena which are at the edge between the known and the unknown,” Jordan said. “Very often, there aren’t much data from the past that are at that edge, and that makes generative AI models even more likely to ‘hallucinate,’ producing output that is unrealistic.”
Calculating valid confidence intervals
PPI allows scientists to incorporate the predictions from models like AlphaFold without making any assumptions about how the model was built or the data it was trained on. To do this, PPI requires a small amount of data that is unbiased, with respect to the specific hypothesis being investigated, paired with machine learning predictions corresponding to that data. By bringing these two sources of evidence together, PPI is able to form valid confidence intervals.
For example, the research team applied the PPI technique to algorithms that can pinpoint areas of deforestation in the Amazon using satellite imagery. These models were accurate, overall, when tested individually on regions in the forest; however, when these assessments were combined to estimate deforestation across the entire Amazon, the confidence intervals became highly skewed. This is likely because the model struggled to recognize certain newer patterns of deforestation.
With PPI, the team was able to correct for the bias in the confidence interval using a small number of human-labeled regions of deforestation.
The team also showed how the technique can be applied to a variety of other research, including questions about protein folding, galaxy classification, gene expression levels, counting plankton, and the relationship between income and private health insurance.
“There’s really no limit on the type of questions that this approach could be applied to,” Jordan said. “We think that PPI is a much-needed component of modern data-intensive, model-intensive and collaborative science.”
Additional co-authors include Anastasios N. Angelopoulos, Stephen Bates, Clara Fannjiang and Tijana Zrnic of UC Berkeley. This research was supported by the Office of Naval Research (N00014-21-1-2840) and the National Science Foundation.
JOURNAL
Science
METHOD OF RESEARCH
Data/statistical analysis
SUBJECT OF RESEARCH
Not applicable
ARTICLE TITLE
Prediction-powered inference
ARTICLE PUBLICATION DATE
9-Nov-2023
New AI noise-canceling headphone technology lets wearers pick which sounds they hear
Most anyone who’s used noise-canceling headphones knows that hearing the right noise at the right time can be vital. Someone might want to erase car horns when working indoors, but not when walking along busy streets. Yet people can’t choose what sounds their headphones cancel.
Now, a team led by researchers at the University of Washington has developed deep-learning algorithms that let users pick which sounds filter through their headphones in real time. The team is calling the system “semantic hearing.” Headphones stream captured audio to a connected smartphone, which cancels all environmental sounds. Either through voice commands or a smartphone app, headphone wearers can select which sounds they want to include from 20 classes, such as sirens, baby cries, speech, vacuum cleaners and bird chirps. Only the selected sounds will be played through the headphones.
The team presented its findings Nov. 1 at UIST ’23 in San Francisco. In the future, the researchers plan to release a commercial version of the system.
“Understanding what a bird sounds like and extracting it from all other sounds in an environment requires real-time intelligence that today’s noise canceling headphones haven’t achieved,” said senior author Shyam Gollakota, a UW professor in the Paul G. Allen School of Computer Science & Engineering. “The challenge is that the sounds headphone wearers hear need to sync with their visual senses. You can’t be hearing someone’s voice two seconds after they talk to you. This means the neural algorithms must process sounds in under a hundredth of a second.”
Because of this time crunch, the semantic hearing system must process sounds on a device such as a connected smartphone, instead of on more robust cloud servers. Additionally, because sounds from different directions arrive in people’s ears at different times, the system must preserve these delays and other spatial cues so people can still meaningfully perceive sounds in their environment.
Tested in environments such as offices, streets and parks, the system was able to extract sirens, bird chirps, alarms and other target sounds, while removing all other real-world noise. When 22 participants rated the system’s audio output for the target sound, they said that on average the quality improved compared to the original recording.
In some cases, the system struggled to distinguish between sounds that share many properties, such as vocal music and human speech. The researchers note that training the models on more real-world data might improve these outcomes.
Additional co-authors on the paper were Bandhav Veluri and Malek Itani, both UW doctoral students in the Allen School; Justin Chan, who completed this research as a doctoral student in the Allen School and is now at Carnegie Mellon University; and Takuya Yoshioka, director of research at AssemblyAI.
For more information, contact semantichearing@cs.washington.edu.
ARTICLE TITLE
Semantic Hearing: Programming Acoustic Scenes with Binaural Hearables
MIT engineers are on a failure-finding mission
The team’s new algorithm finds failures and fixes in all sorts of autonomous systems, from drone teams to power grids.
From vehicle collision avoidance to airline scheduling systems to power supply grids, many of the services we rely on are managed by computers. As these autonomous systems grow in complexity and ubiquity, so too could the ways in which they fail.
Now, MIT engineers have developed an approach that can be paired with any autonomous system, to quickly identify a range of potential failures in that system before they are deployed in the real world. What’s more, the approach can find fixes to the failures, and suggest repairs to avoid system breakdowns.
The team has shown that the approach can root out failures in a variety of simulated autonomous systems, including a small and large power grid network, an aircraft collision avoidance system, a team of rescue drones, and a robotic manipulator. In each of the systems, the new approach, in the form of an automated sampling algorithm, quickly identifies a range of likely failures as well as repairs to avoid those failures.
The new algorithm takes a different tack from other automated searches, which are designed to spot the most severe failures in a system. These approaches, the team says, could miss subtler though significant vulnerabilities that the new algorithm can catch.
“In reality, there’s a whole range of messiness that could happen for these more complex systems,” says Charles Dawson, a graduate student in MIT’s Department of Aeronautics and Astronautics. “We want to be able to trust these systems to drive us around, or fly an aircraft, or manage a power grid. It’s really important to know their limits and in what cases they’re likely to fail.”
Dawson and Chuchu Fan, assistant professor of aeronautics and astronautics at MIT, are presenting their work this week at the Conference on Robotic Learning.
Sensitivity over adversaries
In 2021, a major system meltdown in Texas got Fan and Dawson thinking. In February of that year, winter storms rolled through the state, bringing unexpectedly frigid temperatures that set off failures across the power grid. The crisis left more than 4.5 million homes and businesses without power for multiple days. The system-wide breakdown made for the worst energy crisis in Texas’ history.
“That was a pretty major failure that made me wonder whether we could have predicted it beforehand,” Dawson says. “Could we use our knowledge of the physics of the electricity grid to understand where its weak points could be, and then target upgrades and software fixes to strengthen those vulnerabilities before something catastrophic happened?”
Dawson and Fan’s work focuses on robotic systems and finding ways to make them more resilient in their environment. Prompted in part by the Texas power crisis, they set out to expand their scope, to spot and fix failures in other more complex, large-scale autonomous systems. To do so, they realized they would have to shift the conventional approach to finding failures.
Designers often test the safety of autonomous systems by identifying their most likely, most severe failures. They start with a computer simulation of the system that represents its underlying physics and all the variables that might affect the system’s behavior. They then run the simulation with a type of algorithm that carries out “adversarial optimization” — an approach that automatically optimizes for the worst-case scenario by making small changes to the system, over and over, until it can narrow in on those changes that are associated with the most severe failures.
“By condensing all these changes into the most severe or likely failure, you lose a lot of complexity of behaviors that you could see,” Dawson notes. “Instead, we wanted to prioritize identifying a diversity of failures.”
To do so, the team took a more “sensitive” approach. They developed an algorithm that automatically generates random changes within a system and assesses the sensitivity, or potential failure of the system, in response to those changes. The more sensitive a system is to a certain change, the more likely that change is associated with a possible failure.
The approach enables the team to route out a wider range of possible failures. By this method, the algorithm also allows researchers to identify fixes by backtracking through the chain of changes that led to a particular failure.
“We recognize there’s really a duality to the problem,” Fan says. “There are two sides to the coin. If you can predict a failure, you should be able to predict what to do to avoid that failure. Our method is now closing that loop.”
Hidden failures
The team tested the new approach on a variety of simulated autonomous systems, including a small and large power grid. In those cases, the researchers paired their algorithm with a simulation of generalized, regional-scale electricity networks. They showed that, while conventional approaches zeroed in on a single power line as the most vulnerable to fail, the team’s algorithm found that, if combined with a failure of a second line, a complete blackout could occur.
“Our method can discover hidden correlations in the system,” Dawson says. “Because we’re doing a better job of exploring the space of failures, we can find all sorts of failures, which sometimes includes even more severe failures than existing methods can find.”
The researchers showed similarly diverse results in other autonomous systems, including a simulation of avoiding aircraft collisions, and coordinating rescue drones. To see whether their failure predictions in simulation would bear out in reality, they also demonstrated the approach on a robotic manipulator — a robotic arm that is designed to push and pick up objects.
The team first ran their algorithm on a simulation of a robot that was directed to push a bottle out of the way without knocking it over. When they ran the same scenario in the lab with the actual robot, they found that it failed in the way that the algorithm predicted — for instance, knocking it over or not quite reaching the bottle. When they applied the algorithm’s suggested fix, the robot successfully pushed the bottle away.
“This shows that, in reality, this system fails when we predict it will, and succeeds when we expect it to,” Dawson says.
In principle, the team’s approach could find and fix failures in any autonomous system as long as it comes with an accurate simulation of its behavior. Dawson envisions one day that the approach could be made into an app that designers and engineers can download and apply to tune and tighten their own systems before testing in the real world.
“As we increase the amount that we rely on these automated decision-making systems, I think the flavor of failures is going to shift,” Dawson says. “Rather than mechanical failures within a system, we’re going to see more failures driven by the interaction of automated decision-making and the physical world. We’re trying to account for that shift by identifying different types of failures, and addressing them now.”
This research is supported, in part, by NASA, the National Science Foundation, and the U.S. Air Force Office of Scientific Research.
###
Written by Jennifer Chu, MIT News
Paper: “A Bayesian approach to breaking things: efficiently predicting and repairing failure modes via sampling”
https://openreview.net/forum?id=fNLBmtyBiC
ARTICLE TITLE
“A Bayesian approach to breaking things: efficiently predicting and repairing failure modes via sampling”
No comments:
Post a Comment