Keeping a human in the loop: Managing the ethics of AI in medicine
Artificial intelligence (AI)—of ChatGPT fame—is increasingly used in medicine to improve diagnosis and treatment of diseases, and to avoid unnecessary screening for patients. But AI medical devices could also harm patients and worsen health inequities if they are not designed, tested, and used with care, according to an international task force that included a University of Rochester Medical Center bioethicist.
Jonathan Herington, PhD, was a member of the AI Task Force of the Society for Nuclear Medicine and Medical Imaging, which laid out recommendations on how to ethically develop and use AI medical devices in two papers published in the Journal of Nuclear Medicine. In short, the task force called for increased transparency about the accuracy and limits of AI and outlined ways to ensure all people have access to AI medical devices that work for them—regardless of their race, ethnicity, gender, or wealth.
While the burden of proper design and testing falls to AI developers, health care providers are ultimately responsible for properly using AI and shouldn’t rely too heavily on AI predictions when making patient care decisions.
“There should always be a human in the loop,” said Herington, who is assistant professor of Health Humanities and Bioethics at URMC and was one of three bioethicists added to the task force in 2021. “Clinicians should use AI as an input into their own decision making, rather than replacing their decision making.”
This requires that doctors truly understand how a given AI medical device is intended to be used, how well it performs at that task, and any limitations—and they must pass that knowledge on to their patients. Doctors must weigh the relative risks of false positives versus false negatives for a given situation, all while taking structural inequities into account.
When using an AI system to identify probable tumors in PET scans, for example, health care providers must know how well the system performs at identifying this specific type of tumor in patients of the same sex, race, ethnicity, etc., as the patient in question.
“What that means for the developers of these systems is that they need to be very transparent,” said Herington.
According to the task force, it’s up to the AI developers to make accurate information about their medical device’s intended use, clinical performance, and limitations readily available to users. One way they recommend doing that is to build alerts right into the device or system that informs users about the degree of uncertainty of the AI’s predictions. That might look like heat maps on cancer scans that show whether areas are more or less likely to be cancerous.
To minimize that uncertainty, developers must carefully define the data they use to train and test their AI models, and should use clinically relevant criteria to evaluate the model’s performance. It’s not enough to simply validate algorithms used by a device or system. AI medical devices should be tested in so-called “silent trials”, meaning their performance would be evaluated by researchers on real patients in real time, but their predictions would not be available to the health care provider or applied to clinical decision making.
Developers should also design AI models to be useful and accurate in all contexts in which they will be deployed.
“A concern is that these high-tech, expensive systems would be deployed in really high-resource hospitals, and improve outcomes for relatively well-advantaged patients, while patients in under-resourced or rural hospitals wouldn't have access to them—or would have access to systems that make their care worse because they weren’t designed for them,” said Herington.
Currently, AI medical devices are being trained on datasets in which Latino and Black patients are underrepresented, meaning the devices are less likely to make accurate predictions for patients from these groups. In order to avoid deepening health inequities, developers must ensure their AI models are calibrated for all racial and gender groups by training them with datasets that represent all of the populations the medical device or system will ultimately serve.
Though these recommendations were developed with a focus on nuclear medicine and medical imaging, Herington believes they can and should be applied to AI medical devices broadly.
“The systems are becoming ever more powerful all the time and the landscape is shifting really quickly,” said Herington. “We have a rapidly closing window to solidify our ethical and regulatory framework around these things.”
JOURNAL
Journal of Nuclear Medicine
METHOD OF RESEARCH
Commentary/editorial
SUBJECT OF RESEARCH
Not applicable
ARTICLE TITLE
Ethical Considerations for Artificial Intelligence in Medical Imaging: Data Collection, Development, and Evaluation
ARTICLE PUBLICATION DATE
12-Oct-2023
COI STATEMENT
Melissa McCradden acknowledges funding from the SickKids Foundation pertaining to her role as the John and Melinda Thompson Director of AI in Medicine at the Hospital for Sick Children. Abhinav Jha acknowledges support from NIH R01EB031051-02S1. Sven Zuehlsdorff is a full-time employee of Siemens Medical Solutions USA, Inc. No other potential conflict of interest relevant to this article was reported.
To excel at engineering design, generative AI must learn to innovate, study finds
AI models that prioritize similarity falter when asked to design something completely new.
ChatGPT and other deep generative models are proving to be uncanny mimics. These AI supermodels can churn out poems, finish symphonies, and create new videos and images by automatically learning from millions of examples of previous works. These enormously powerful and versatile tools excel at generating new content that resembles everything they’ve seen before.
But as MIT engineers say in a new study, similarity isn’t enough if you want to truly innovate in engineering tasks.
“Deep generative models (DGMs) are very promising, but also inherently flawed,” says study author Lyle Regenwetter, a mechanical engineering graduate student at MIT. “The objective of these models is to mimic a dataset. But as engineers and designers, we often don’t want to create a design that’s already out there.”
He and his colleagues make the case that if mechanical engineers want help from AI to generate novel ideas and designs, they will have to first refocus those models beyond “statistical similarity.”
“The performance of a lot of these models is explicitly tied to how statistically similar a generated sample is to what the model has already seen,” says co-author Faez Ahmed, assistant professor of mechanical engineering at MIT. “But in design, being different could be important if you want to innovate.”
In their study, Ahmed and Regenwetter reveal the pitfalls of deep generative models when they are tasked with solving engineering design problems. In a case study of bicycle frame design, the team shows that these models end up generating new frames that mimic previous designs but falter on engineering performance and requirements.
When the researchers presented the same bicycle frame problem to DGMs that they specifically designed with engineering-focused objectives, rather than only statistical similarity, these models produced more innovative, higher-performing frames.
The team’s results show that similarity-focused AI models don’t quite translate when applied to engineering problems. But, as the researchers also highlight in their study, with some careful planning of task-appropriate metrics, AI models could be an effective design “co-pilot.”
“This is about how AI can help engineers be better and faster at creating innovative products,” Ahmed says. “To do that, we have to first understand the requirements. This is one step in that direction.”
The team’s new study appeared recently online, and will be in the December print edition of the journal Computer Aided Design. The research is a collaboration between computer scientists at MIT-IBM Watson AI Lab and mechanical engineers in MIT’s DeCoDe Lab. The study’s co-authors include Akash Srivastava and Dan Gutreund at the MIT-IBM Watson AI Lab.
Framing a problem
As Ahmed and Regenwetter write, DGMs are “powerful learners, boasting unparalleled ability” to process huge amounts of data. DGM is a broad term for any machine-learning model that is trained to learn distribution of data and then use that to generate new, statistically similar content. The enormously popular ChatGPT is one type of deep generative model known as a large language model, or LLM, which incorporates natural language processing capabilities into the model to enable the app to generate realistic imagery and speech in response to conversational queries. Other popular models for image generation include DALL-E and Stable Diffusion.
Because of their ability to learn from data and generate realistic samples, DGMs have been increasingly applied in multiple engineering domains. Designers have used deep generative models to draft new aircraft frames, metamaterial designs, and optimal geometries for bridges and cars. But for the most part, the models have mimicked existing designs, without improving the performance on existing designs.
“Designers who are working with DGMs are sort of missing this cherry on top, which is adjusting the model’s training objective to focus on the design requirements,” Regenwetter says. “So, people end up generating designs that are very similar to the dataset.”
In the new study, he outlines the main pitfalls in applying DGMs to engineering tasks, and shows that the fundamental objective of standard DGMs does not take into account specific design requirements. To illustrate this, the team invokes a simple case of bicycle frame design and demonstrates that problems can crop up as early as the initial learning phase. As a model learns from thousands of existing bike frames of various sizes and shapes, it might consider two frames of similar dimensions to have similar performance, when in fact a small disconnect in one frame — too small to register as a significant difference in statistical similarity metrics — makes the frame much weaker than the other, visually similar frame.
Beyond “vanilla”
The researchers carried the bicycle example forward to see what designs a DGM would actually generate after having learned from existing designs. They first tested a conventional “vanilla” generative adversarial network, or GAN — a model that has widely been used in image and text synthesis, and is tuned simply to generate statistically similar content. They trained the model on a dataset of thousands of bicycle frames, including commercially manufactured designs and less conventional, one-off frames designed by hobbyists.
Once the model learned from the data, the researchers asked it to generate hundreds of new bike frames. The model produced realistic designs that resembled existing frames. But none of the designs showed significant improvement in performance, and some were even a bit inferior, with heavier, less structurally sound frames.
The team then carried out the same test with two other DGMs that were specifically designed for engineering tasks. The first model is one that Ahmed previously developed to generate high-performing airfoil designs. He built this model to prioritize statistical similarity as well as functional performance. When applied to the bike frame task, this model generated realistic designs that also were lighter and stronger than existing designs. But it also produced physically “invalid” frames, with components that didn’t quite fit or overlapped in physically impossible ways.
“We saw designs that were significantly better than the dataset, but also designs that were geometrically incompatible because the model wasn’t focused on meeting design constraints,” Regenwetter says.
The last model the team tested was one that Regenwetter built to generate new geometric structures. This model was designed with the same priorities as the previous models, with the added ingredient of design constraints, and prioritizing physically viable frames, for instance, with no disconnections or overlapping bars. This last model produced the highest-performing designs, that were also physically feasible.
“We found that when a model goes beyond statistical similarity, it can come up with designs that are better than the ones that are already out there,” Ahmed says. “It’s a proof of what AI can do, if it is explicitly trained on a design task.”
For instance, if DGMs can be built with other priorities, such as performance, design constraints, and novelty, Ahmed foresees “numerous engineering fields, such as molecular design and civil infrastructure, would greatly benefit. By shedding light on the potential pitfalls of relying solely on statistical similarity, we hope to inspire new pathways and strategies in generative AI applications outside multimedia.”
###
Written by Jennifer Chu, MIT News
Paper: “Beyond Statistical Similarity: Rethinking Metrics for Deep Generative Models in Engineering Design”
https://www.sciencedirect.com/science/article/abs/pii/S0010448523001410
JOURNAL
Computer-Aided Design
ARTICLE TITLE
“Beyond Statistical Similarity: Rethinking Metrics for Deep Geerative Models in Engineering Design”
No comments:
Post a Comment