Thursday, January 02, 2025

Artificial intelligence: Algorithms improve medical image analysis

Algorithms based on deep learning can detect tumors - KIT researchers among the best teams at the International AutoPET Competition

Karlsruher Institut für Technologie (KIT)

Automated methods enable the analysis of PET/CT scans (left) to accurately predict tumor location and size (right). — image:
*Automated methods enable the analysis of PET/CT scans (left) to accurately predict tumor location and size (right) for improved diagnosis and treatment planning.*
view more
Credit: Gatidis S, Kuestner T. (2022) A whole-body FDG-PET/CT dataset with manually annotated tumor lesions (FDG-PET-CT-Lesions) [Dataset]. The Cancer Imaging Archive. DOI: 10.7937/gkr0-xv29

Imaging techniques play a key role in the diagnosis of cancer. Precisely determining the location, size, and type of tumors is essential for choosing the right therapy. The most important imaging techniques include positron emission tomography (PET) and computer tomography (CT). PET uses radionuclides to visualize metabolic processes in the body. The metabolic rate of malign tumors is considerably higher than that of benign tissues. Radioactively labeled glucose, usually fluorine-18-deoxyglucose (FDG), is used for this purpose. In CT, the body is scanned layer by layer in an X-ray tube to visualize the anatomy and localize tumors.

Automation Can Save Time and Improve Evaluation

Cancer patients sometimes have hundreds of lesions, i.e. pathological changes caused by the growth of tumors. To obtain a uniform picture, it is necessary to capture all lesions. Doctors determine the size of the tumor lesions by manually marking 2D slice images – an extremely time-consuming task. “Automated evaluation using an algorithm would save an enormous amount of time and improve the results,” explains Professor Rainer Stiefelhagen, Head of the Computer Vision for Human-Computer Interaction Lab (cv:hci) at KIT.

Rainer Stiefelhagen and Zdravko Marinov, a doctoral student at cv:hci, took part in the international autoPET competition in 2022 and came in fifth out of 27 teams involving 359 participants from all over the world. The Karlsruhe researchers formed a team with Professor Jens Kleesiek and Lars Heiliger from the Essen-based IKIM – Institute for Artificial Intelligence in Medicine. Organized by the Tübingen University Hospital and the LMU Hospital Munich, autoPET combined imaging and machine learning. The task was to automatically segment metabolically active tumor lesions visualized on a whole-body PET/CT. For the algorithm training, the participating teams had access to a large annotated PET/CT dataset. All algorithms submitted for the final phase of the competition are based on deep learning methods. This is a variant of machine learning that uses multi-layered artificial neural networks to recognize complex patterns and correlations in large amounts of data. The seven best teams from the autoPET competition have now reported on the possibilities of automated analysis of medical image data in the Nature Machine Intelligence journal.

Algorithm Ensemble Excels in the Detection Tumor Lesions

As the researchers explain in their publication, an ensemble of the top-rated algorithms proved to be superior to individual algorithms. The ensemble of algorithms is able to detect tumor lesions efficiently and precisely. “While the performance of the algorithms in image data evaluation partly depends indeed on the quantity and quality of the data, the algorithm design is another crucial factor, for example with regard to the decisions made in the post-processing of the predicted segmentation,” explains Stiefelhagen. Further research is needed to improve the algorithms and make them more resistant to external influences so that they can be used in everyday clinical practice. The aim is to fully automate the analysis of medical PET and CT image data in the near future. (or)

Original publication

Sergios Gatidis, Marcel Früh, Matthias P. Fabritius, Sijing Gu, Konstantin Nikolaou, Christian La Fougère, Jin Ye, Junjun He, Yige Peng, Lei Bi, Jun Ma, Bo Wang, Jia Zhang, Yukun Huang, Lars Heiliger, Zdravko Marinov, Rainer Stiefelhagen, Jan Egger, Jens Kleesiek, Ludovic Sibille, Lei Xiang, Simone Bendazzoli, Mehdi Astaraki, Michael Ingrisch, Clemens C. Cyran & Thomas Küstner: Results from the autoPET challenge on fully automated lesion segmentation in oncologic PET/CT imaging. Nature Machine Intelligence, 2024. DOI: 10.1038/s42256-024-00912-9

More about the cv:hci of KIT

Journal

Nature Machine Intelligence

DOI

10.1038/s42256-024-00912-9

Subject of Research

People

Article Title

Results from the autoPET challenge on fully automated lesion segmentation in oncologic PET/CT imaging.

How good are AI doctors at medical conversations?

Researchers design a more realistic test to evaluate AI's clinical communication skill

Harvard Medical School

Artificial intelligence tools such as ChatGPT have been touted for their promise to alleviate clinician workload by triaging patients, taking medical histories and even providing preliminary diagnoses.

These tools, known as large-language models, are already being used by patients to make sense of their symptoms and medical tests results.

But while these AI models perform impressively on standardized medical tests, how well do they fare in situations that more closely mimic the real world?

Not that great, according to the findings of a new study led by researchers at Harvard Medical School and Stanford University.

For their analysis, published Jan. 2 in Nature Medicine, the researchers designed an evaluation framework — or a test — called CRAFT-MD (Conversational Reasoning Assessment Framework for Testing in Medicine) and deployed it on four large-language models to see how well they performed in settings closely mimicking actual interactions with patients.

All four large-language models did well on medical exam-style questions, but their performance worsened when engaged in conversations more closely mimicking real-world interactions.

This gap, the researchers said, underscores a two-fold need: First, to create more realistic evaluations that better gauge the fitness of clinical AI models for use in the real world and, second, to improve the ability of these tools to make diagnosis based on more realistic interactions before they are deployed in the clinic.

Evaluation tools like CRAFT-MD, the research team said, can not only assess AI models more accurately for real-world fitness but could also help optimize their performance in clinic.

"Our work reveals a striking paradox - while these AI models excel at medical board exams, they struggle with the basic back-and-forth of a doctor's visit," said study senior author Pranav Rajpurkar, assistant professor of biomedical informatics at Harvard Medical School. “The dynamic nature of medical conversations - the need to ask the right questions at the right time, to piece together scattered information, and to reason through symptoms - poses unique challenges that go far beyond answering multiple choice questions. When we switch from standardized tests to these natural conversations, even the most sophisticated AI models show significant drops in diagnostic accuracy."

A better test to check AI’s real-world performance

Right now, developers test the performance of AI models by asking them to answer multiple choice medical questions, typically derived from the national exam for graduating medical students or from tests given to medical residents as part of their certification.

“This approach assumes that all relevant information is presented clearly and concisely, often with medical terminology or buzzwords that simplify the diagnostic process, but in the real world this process is far messier,” said study co-first author Shreya Johri, a doctoral student in the Rajpurkar Lab at Harvard Medical School. “We need a testing framework that reflects reality better and is, therefore, better at predicting how well a model would perform.”

CRAFT-MD was designed to be one such more realistic gauge.

To simulate real-world interactions, CRAFT-MD evaluates how well large-language models can collect information about symptoms, medications, and family history and then make a diagnosis. An AI agent is used to pose as a patient, answering questions in a conversational, natural style. Another AI agent grades the accuracy of final diagnosis rendered by the large-language model. Human experts then evaluate the outcomes of each encounter for ability to gather relevant patient information, diagnostic accuracy when presented with scattered information, and for adherence to prompts.

The researchers used CRAFT-MD to test four AI models — both proprietary or commercial and open-source ones — for performance in 2,000 clinical vignettes featuring conditions common in primary care and across 12 medical specialties.

All AI models showed limitations, particularly in their ability to conduct clinical conversations and reason based on information given by patients. That, in turn, compromised their ability to take medical histories and render appropriate diagnosis. For example, the models often struggled to ask the right questions to gather pertinent patient history, missed critical information during history taking, and had difficulty synthesizing scattered information. The accuracy of these models declined when they were presented with open-ended information rather than multiple choice answers. These models also performed worse when engaged in back-and-forth exchanges — as most real-world conversations are — rather than when engaged in summarized conversations.

Recommendations for optimizing AI’s real-world performance

Based on these findings, the team offers a set of recommendations both for AI developers who design AI models and for regulators charged with evaluating and approving these tools.

These include:

Use of conversational, open-ended questions that more accurately mirror unstructured doctor-patient interactions in the design, training, and testing of AI tools
Assessing models for their ability to ask the right questions and to extract the most essential information
Designing models capable of following multiple conversations and integrating information from them
Designing AI models capable of integrating textual (notes from conversations) with and non-textual data (images, EKGs)
Designing more sophisticated AI agents that can interpret non-verbal cues such as facial expressions, tone, and body language

Additionally, the evaluation should include both AI agents and human experts, the researchers recommend, because relying solely on human experts is labor-intensive and expensive. For example, CRAFT-MD outpaced human evaluators, processing 10,000 conversations in 48 to 72 hours, plus 15-16 hours of expert evaluation. In contrast, human-based approaches would require extensive recruitment and an estimated 500 hours for patient simulations (nearly 3 minutes per conversation) and about 650 hours for expert evaluations (nearly 4 minutes per conversation). Using AI evaluators as first line has the added advantage of eliminating the risk of exposing real patients to unverified AI tools.

The researchers said they expect that CRAFT-MD itself will also be updated and optimized periodically to integrate improved patient-AI models.

“As a physician scientist, I am interested in AI models that can augment clinical practice effectively and ethically,” said study co-senior author Roxana Daneshjou, assistant professor of Biomedical Data Science and Dermatology at Stanford University. “CRAFT-MD creates a framework that more closely mirrors real-world interactions and thus it helps move the field forward when it comes to testing AI model performance in health care.”

Authorship, funding, disclosures

Publication DOI 10.1038/s41591-024-03328-5

Additional authors included Jaehwan Jeong and Hong-Yu Zhou, Harvard Medical School; Benjamin A. Tran, Georgetown University; Daniel I. Schlessinger, Northwestern University; Shannon Wongvibulsin, University of California-Los Angeles; Leandra A. Barnes, Zhuo Ran Cai and David Kim, Sandford University; and Eliezer M. Van Allen, Dana-Farber Cancer Institute.

The work was supported by the HMS Dean’s Innovation Award and a Microsoft Accelerate Foundation Models Research grant awarded to Pranav Rajpurkar. SJ received further support through the IIE Quad Fellowship.

Daneshjou reported receiving personal fees from DWA, personal fees from Pfizer, personal fees from L'Oreal, personal fees from VisualDx, stock options from MDAlgorithms and Revea outside the submitted work, and a patent for TrueImage pending. Schlessinger is the co-founder of FixMySkin Healing Balms, a shareholder in Appiell Inc. and K-Health, a consultant with Appiell Inc and LuminDx, and an investigator for Abbvie and Sanofi. Van Allen serves as an advisor to Enara Bio, Manifold Bio, Monte Rosa, Novartis Institute for Biomedical Research, Serinus Bio. E.M.V.A provides research support to Novartis, BMS, Sanofi, NextPoint. Van Allen holds equity in Tango Therapeutics, Genome Medical, Genomic Life, Enara Bio, Manifold Bio, Microsoft, Monte Rosa, Riva Therapeutics, Serinus Bio, Syapse. Van Allen has filed for institutional patents on chromatin mutations and immunotherapy response, and methods for clinical interpretation; intermittent legal consulting on patents for Foaley & Hoag, and serves on the editorial board of Science Advances.

Journal

Nature Medicine

DOI

10.1038/s41591-024-03328-5

Method of Research

Computational simulation/modeling

Subject of Research

Not applicable

Article Title

An evaluation framework for clinical use of large language models in patient interaction tasks

Article Publication Date

2-Jan-2025

LA REVUE GAUCHE - Left Comment

Thursday, January 02, 2025

Artificial intelligence: Algorithms improve medical image analysis

Journal

DOI

Subject of Research

Article Title

How good are AI doctors at medical conversations?

Journal

DOI

Method of Research

Subject of Research

Article Title

Article Publication Date

No comments:

About Me