Brains vs. bytes: Study compares diagnoses made by AI and clinicians
University of Maine
A University of Maine study compared how well artificial intelligence models and human clinicians handled complex or sensitive medical cases.
The study published in the Journal of Health Organization and Management in May evaluated more than 7,000 anonymized medical queries from the United States and Australia. The findings outlined where the technology showed promise and what limitations need to be addressed before AI is unleashed on patients — and may inform the future development of AI tools, clinical procedures and public policy. The study also informs efforts to use AI to support healthcare professionals at a time when workforce shortages are growing and clinician burnout is increasing.
The results showed that the accuracy of most AI-generated responses aligned with expert standards of information, especially with factual and procedural queries, but often struggled with “why” and “how” questions.
The study also found that while responses were consistent within a given session, inconsistencies appeared when users posed the same questions in later tests. These discrepancies raise concerns, particularly when a patient’s health is at stake. The findings add to a growing body of evidence that will define AI’s role in healthcare.
“This isn’t about replacing doctors and nurses,” said C. Matt Graham, author of the study and associate professor of information systems and security management at the Maine Business School. “It’s about augmenting their abilities. AI can be a second set of eyes; it can help clinicians sift through mountains of data, recognize patterns and offer evidence-based recommendations in real time.”
The study also compared health metrics, including patient satisfaction, cost and treatment efficacy, across both countries. In Australia, which has a universal healthcare model, patients reported higher satisfaction and one-quarter of cost compared to those in the U.S., where patients also waited twice as long to see providers. Graham notes in the study that health system, regulatory and cultural differences like these will ultimately influence how AI is received and used and that models should be trained to account for these variations.
Artificial emotional intelligence
While the accuracy of a diagnosis matters, so does the way it is delivered. In the study, AI responses frequently lacked the emotional engagement and empathetic nuance often conveyed by human clinicians.
The length of AI responses were strikingly consistent, with most varying between 400 and 475 words. Responses by human clinicians showed far more variation, with more concise answers written in response to simpler questions.
Vocabulary analysis revealed that AI regularly used clinical terms in its responses, which may be hard to understand or feel insensitive to some patients. In situations involving topics such as mental health or terminal illness, AI struggled to convey the compassion that is critical in effective patient-provider relationships.
“Healthcare professionals offer healing that is grounded in human connection, through sight, touch, presence and communication — experiences that AI cannot replicate,” said Kelley Strout, associate professor of UMaine’s School of Nursing, who was not involved in the study. “The synergy between AI and clinicians’ judgment, compassion and application of evidence-based practice has the potential to transform healthcare systems but only if accompanied by rigorous standards, ethical frameworks and safeguards to monitor for errors and unintended consequences.”
A stretched health system
The study arrives amid widespread and growing shortages in the U.S. healthcare workforce. Across the country, patients face long wait times, high costs and a shortage of primary care and specialty providers. These barriers are particularly acute in rural regions, where limited access often leads to delayed diagnoses and worsening health outcomes.
A report published by the Health Resources and Services Administration in 2024projected that nonmetro areas will face a 42% shortage of primary care physicians by 2037. While a growing number of nurse practitioners and physician assistants are stepping in to fill the gap, demand for care is growing faster. Between 2022 and 2026, the population of people 65 and older in the U.S. is projected to increase 54%, a trend harboring significant implications for the demand of health services.
Strout said that while AI could help improve patient access and alleviate challenges — such as burnout, which affects more than half of primary care physicians in the U.S. — its use must be carefully approached.
Prioritizing providers and patients
AI-powered tools could support round-the-clock virtual assistance and complement provider-to-patient communication through tools like online patient portals, which have skyrocketed in popularity since 2020. The technology, however, also raises fears of job displacement, and experts warn that rapid implementation without ethical guardrails may exacerbate disparities and compromise care quality.
“Technology is only one part of the solution,” said Graham. “We need regulatory standards, human oversight and inclusive datasets. Right now, most AI tools are trained on limited populations. If we’re not careful, we risk building systems that reflect and even magnify existing inequalities.”
Strout added that as health care systems integrate AI into clinical practice, administrators must ensure that these tools are designed with patients and providers in mind. Lessons from past integration of technology, which at times failed to enhance care delivery, offer valuable guidance for AI developers.
“We must learn from past missteps. The electronic health record (EHR), for example, was largely developed around billing models rather than patient outcomes or provider workflows,” Strout said. “As a result, EHR systems have often contributed to frustration among providers and diminished patient satisfaction. We cannot afford to repeat that history with AI.”
Other factors, such as accountability for mistakes and patient privacy, are top of mind for medical ethicists, policy makers and AI researchers. Solutions to these ethical questions may vary depending on where they are adopted to account for different cultural and regulatory environments.
As AI continues to develop, many experts believe it will enhance the service efficiency and decision-making that providers offer to patients. The study’s findings support the growing consensus that AI’s limited ethical and emotional adaptability means that human clinicians remain indispensable. Graham says that, in addition to improving the performance of AI tools, future research should focus on managing ethical risks and adapting AI to diverse healthcare contexts to ensure the technology augments rather than undermines human care.
"Technology should enhance the humanity of medicine, not diminish it," Graham said. "That means designing systems that support clinicians in delivering care, not replacing them altogether."
Journal
Journal of Health Organization and Management
Article Title
Artificial intelligence vs human clinicians: a comparative analysis of complex medical query handling across the USA and Australia
'AI scientist’ suggests combinations of widely available non-cancer drugs can kill cancer cells
An ‘AI scientist’, working in collaboration with human scientists, has found that combinations of cheap and safe drugs – used to treat conditions such as high cholesterol and alcohol dependence – could also be effective at treating cancer, a promising new approach to drug discovery.
The research team, led by the University of Cambridge, used the GPT-4 large language model (LLM) to identify hidden patterns buried in the mountains of scientific literature to identify potential new cancer drugs.
To test their approach, the researchers prompted GPT-4 to identify potential new drug combinations that could have a significant impact on a breast cancer cell line commonly used in medical research. They instructed it to avoid standard cancer drugs, identify drugs that would attack cancer cells while not harming healthy cells, and prioritise drugs that were affordable and approved by regulators.
The drug combinations suggested by GPT-4 were then tested by human scientists, both in combination and individually, to measure their effectiveness against breast cancer cells.
In the first lab-based test, three of the 12 drug combinations suggested by GPT-4 worked better than current breast cancer drugs. The LLM then learned from these tests and suggested a further four combinations, three of which also showed promising results.
The results, reported in the Journal of the Royal Society Interface, represent the first instance of a closed-loop system where experimental results guided an LLM, and LLM outputs – interpreted by human scientists – guided further experiments. The researchers say that tools such as LLMs are not replacement for scientists, but could instead be supervised AI researchers, with the ability to originate, adapt and accelerate discovery in areas like cancer research.
Often, LLMs such as GPT-4 return results that aren’t true, known as hallucinations. But in scientific research, hallucinations can sometimes be a benefit, if they lead to new ideas that are worth testing.
“Supervised LLMs offer a scalable, imaginative layer of scientific exploration, and can help us as human scientists explore new paths that we hadn’t thought of before,” said Professor Ross King from Cambridge’s Department of Chemical Engineering and Biotechnology, who led the research. “This can be useful in areas such as drug discovery, where there are many thousands of compounds to search through.”
Based on the prompts provided by the human scientists, GPT-4 selected drugs based on the interplay between biological reasoning and hidden patterns in the scientific literature.
“This is not automation replacing scientists, but a new kind of collaboration,” said co-author Dr Hector Zenil from King’s College London. “Guided by expert prompts and experimental feedback, the AI functioned like a tireless research partner—rapidly navigating an immense hypothesis space and proposing ideas that would take humans alone far longer to reach.”
The hallucinations – normally viewed as flaws – became a feature, generating unconventional combinations worth testing and validating in the lab. The human scientists inspected the mechanistic reasons the LLM found to suggest these combinations in the first place, feeding the system back and forth in multiple iterations.
By exploring subtle synergies and overlooked pathways, GPT-4 helped identify six promising drug pairs, all tested through lab experiments. Among the combinations, simvastatin (commonly used to lower cholesterol) and disulfiram (used in alcohol dependence) stood out against breast cancer cells. Some of these combinations show potential for further research in therapeutic repurposing.
These drugs, while not traditionally associated with cancer care, could be potential cancer treatments, although they would first have to go through extensive clinical trials.
“This study demonstrates how AI can be woven directly into the iterative loop of scientific discovery, enabling adaptive, data-informed hypothesis generation and validation in real time,” said Zenil.
“The capacity of supervised LLMs to propose hypotheses across disciplines, incorporate prior results, and collaborate across iterations marks a new frontier in scientific research,” said King. “An AI scientist is no longer a metaphor without experimental validation: it can now be a collaborator in the scientific process.”
The research was supported in part by the Alice Wallenberg Foundation and the UK Engineering and Physical Sciences Research Council (EPSRC).
Journal
Journal of The Royal Society Interface
Method of Research
Experimental study
Subject of Research
Not applicable
Article Title
Scientific Hypothesis Generation by Large Language Models: Laboratory Validation in Breast Cancer Treatment
Article Publication Date
4-Jun-2025
No comments:
Post a Comment