China’s emerging AI regulation could foster an open and safe future for AI
Summary author: Walter Beckwith
American Association for the Advancement of Science (AAAS)
In a Policy Forum, Yue Zhu and colleagues provide an overview of China’s emerging regulation for artificial intelligence (AI) technologies and its potential contributions to global AI governance. Open-source AI systems from China are rapidly expanding worldwide, even as the country’s regulatory framework remains in flux. In general, AI governance suffers from fragmented approaches, a lack of clarity, and difficulty reconciling innovation with risk management, making global coordination especially hard in the face of rising controversy. Although no official AI law has yet been enacted, experts in China have drafted two influential proposals – the Model AI Law and the AI Law (Scholar’s Proposal) – which serve as key references for ongoing policy discussions. As the nation’s lawmakers prepare to draft a consolidated AI law, Zhu et al. note that the decisions will shape not only China’s innovation, but also global collaboration on AI safety, openness, and risk mitigation. Here, the authors discuss China’s emerging AI regulation as structured around 6 pillars, which, combined, stress exemptive laws, efficient adjudication, and experimentalist requirements, while safeguarding against extreme risks. This framework seeks to balance responsible oversight with pragmatic openness, allowing developers to innovate for the long term and collaborate across the global research community. According to Zhu et al., despite the need for greater clarity, harmonization, and simplification, China’s evolving model is poised to shape future legislation and contribute meaningfully to global AI governance by promoting both safety and innovation at a time when international cooperation on extreme risks is urgently needed.
Journal
Science
Article Title
China’s emerging AI regulation towards an open future for AI
Article Publication Date
9-Oct-2025
Collaborative AI passes U.S. medical exams
A group of five AI systems trained to deliberate answers as a council scored higher on USMLE exams than any single chatbot alone, suggesting a new paradigm for AI implementation
PLOS
image:
Researchers use collaborative A.I. to take U.S. medical exams.
view moreCredit: Nguyen Dang Hoang Nhu, Unsplash (CC0, https://creativecommons.org/publicdomain/zero/1.0/)
A council of five AI models working together, discussing their answers through an iterative process, achieved 97%, 93%, and 94% accuracy on 325 medical exam questions spanning the three stages of the U.S. Medical Licensing Examination (USMLE), according to a new study published October 9th in the open-access journal PLOS Medicine by researcher Yahya Shaikh of Baltimore, USA, and colleagues.
Over the past several years, many studies have evaluated the performance of Large Language Models (LLMs) on medical knowledge and licensing exams. While scores have improved across LLMS, varying performance has been noted when the same question is asked to an LLM multiple times—a variety of responses are generated, some of which are incorrect or hallucinations.
In the new study, researchers developed a method to create a council of AI agents—composed of multiple instances of OpenAI’s GPT-4—that undergo coordinated and iterative exchanges designed to arrive at a consensus response. A facilitator algorithm facilitates a deliberative process when there are divergent responses, summarizing the reasoning in each response and asking the Council to deliberate and re-answer the original question.
When the council was given 325 publicly available USMLE questions, including those focused on foundational biomedical sciences as well as clinical diagnosis and management, the system achieved consensus responses that were correct 97%, 93%, and 94% of the time for Step 1, Step 2 CK, and Step 3, respectively, outperforming single-instance GPT-4 models. In cases where there wasn’t an initial unanimous response, the Council of AI deliberations achieved a consensus that was the correct answer 83% of the time. For questions that required deliberation, the Council corrected over half (53%) of responses that majority vote had gotten incorrect.
The authors suggest that collective decision-making among AIs can enhance accuracy and lead to more trustworthy tools for healthcare, where accuracy is critical. However, they note that the paradigm has not yet been tested in real clinical scenarios.
“By demonstrating that diverse AI perspectives can refine answers, we challenge the notion that consistency alone defines a ‘good’ AI,” say the authors. “Instead, embracing variability through teamwork might unlock new possibilities for AI in medicine and beyond.”
Yahya Shaikh says, “Our study shows that when multiple AIs deliberate together, they achieve the highest-ever performance on medical licensing exams, scoring 97%, 93%, and 94% across Steps 1–3, without any special training on or access to medical data. This demonstrates the power of collaboration and dialogue between AI systems to reach more accurate and reliable answers. Our work provides the first clear evidence that AI systems can self-correct through structured dialogue, with a performance of the collective better that the performance of any single AI.”
Zishan Siddiqui notes, “This study isn’t about evaluating AI’s USMLE test-taking prowess, the kind that would make its mama proud, its papa brag, and grab headlines. Instead, we describe a method that improves accuracy by treating AI’s natural response variability as a strength. It allows the system to take a few tries, compare notes, and self-correct, and it should be built into future tools for education and, where appropriate, clinical care.”
Zainab Asiyah notes, “Semantic entropy didn't just measure data, but it told a story. It shows a struggle, ups and downs, and a resolution, so much like a human journey. It revealed a human side to LLMs. The numbers show how LLMs could actually convince each other to take on viewpoints and converse to change each other’s minds…even if it was the wrong answer.”
In your coverage, please use this URL to provide access to the freely available paper in PLOS Digital Health: https://plos.io/4n7eY5L
Citation: Shaikh Y, Jeelani-Shaikh ZA, Jeelani MM, Javaid A, Mahmud T, Gaglani S, et al. (2025) Collaborative intelligence in AI: Evaluating the performance of a council of AIs on the USMLE. PLOS Digit Health 4(10): e0000787. https://doi.org/10.1371/journal.pdig.0000787
Author countries: United States, Malaysia, Pakistan
Funding: The author(s) received no specific funding for this work.
Journal
PLOS Digital Health
Method of Research
Computational simulation/modeling
An AI system with detailed diagnostic reasoning makes its case
“Dr. CaBot” goes head-to-head with a human expert to work through a challenging medical case
Harvard Medical School
video:
A narrated video presentation of a challenging medical case produced by Dr. CaBot.
view moreCredit: Manrai lab
At a glance:
- Researchers have developed an AI system called Dr. CaBot that spells out its reasoning as it works through challenging medical cases and reaches a diagnosis.
- For the first time, the New England Journal of Medicine has published an AI-generated diagnosis — produced by Dr. CaBot — alongside one from a human clinician in its medical case study series.
- The tool holds potential for use in medical education and research.
Except for one key aspect, the setup is a familiar one in medicine: An expert diagnostician presents a particularly challenging case to a roomful of colleagues, carefully walking them through the patient’s symptoms and initial test results. The physician explains her reasoning in detail as she breaks down the case and every possibility she considered, aided by a slide deck. At the end of the five-minute talk, she reveals her diagnosis and the next steps she would recommend.
The twist? This time, the physician in question is an artificial intelligence system called Dr. CaBot.
Researchers at Harvard Medical School are developing Dr. CaBot as a medical education tool. The system, which operates in both presentation and written formats, shows how it reasons through a case, offering what’s called a differential diagnosis — a comprehensive list of possible conditions that explain what’s going on — and narrowing down the possibilities until it reaches a final diagnosis.
Dr. CaBot’s ability to spell out its “thought process” rather than focusing solely on reaching an accurate answer distinguishes it from other AI diagnostic tools. It is also one of only a few models designed to tackle more complex medical cases.
“We wanted to create an AI system that could generate a differential diagnosis and explain its detailed, nuanced reasoning at the level of an expert diagnostician,” said Arjun (Raj) Manrai, assistant professor of biomedical informatics in the Blavatnik Institute at HMS. Manrai created Dr. CaBot with Thomas Buckley, a Harvard Kenneth C. Griffin School of Arts and Sciences doctoral student and a member of the Manrai lab.
Although the system is not yet ready for use in the clinic, Manrai and his team have been providing demonstrations of Dr. CaBot at Boston-area hospitals. Now, Dr. CaBot has a chance to prove itself by going head-to-head with an expert diagnostician in The New England Journal of Medicine’s famed Case Records of the Massachusetts General Hospital, also known as clinicopathological conferences, or CPCs. It marks the first time the journal is publishing an AI-generated diagnosis.
The resulting medical case discussion, published Oct. 8 in NEJM, offers a window into Dr. CaBot’s capabilities, showcasing its usefulness for medical educators and students — and hinting at its potential for physicians in the clinic. As the researchers continue to improve Dr. CaBot, they hope that it will serve as a useful model for other medical-AI teams around the world.
One hundred years of medical cases
The concept of CPCs dates back to the late 1800s, when physicians at Massachusetts General Hospital began using patient case studies for medical education. In 1900, Mass General pathologist Richard Cabot — for whom Dr. CaBot is named — formalized these as part of the curriculum for HMS doctors-in-training. Since 1923, NEJM has been continuously publishing the cases as CPCs to teach physicians how other physicians reason through complex cases.
“The cases are pretty legendary. They’re known to be extremely challenging, filled with distractions and red herrings,” Manrai said.
Each CPC consists of a detailed presentation of the case from the patient’s doctors. Then, an expert not involved in the case is invited to give a presentation to colleagues at Mass General explaining their reasoning, step-by-step, and providing a differential diagnosis before homing in on the most likely possibility. After that, the patient’s doctors reveal the actual diagnosis. The diagnostician’s write-up is published in NEJM along with the case presentation.
The Oct. 8 NEJM article includes a typical case presentation along with a carefully reasoned differential diagnosis from expert diagnostician Gurpreet Dhaliwal of San Francisco Veterans Affairs Medical Center and the University of California, San Francisco, whom Manrai describes as “a real, modern Dr. House.” After that, Dr. CaBot’s differential diagnosis appears.
Manrai and Buckley were encouraged to see that although Dr. CaBot reasoned through the case differently than Dhaliwal, it reached a comparable final diagnosis.
From Dr. Cabot to Dr. CaBot
During graduate school, Manrai became fascinated by how CPCs demystify the process that physicians use to arrive at a diagnosis. They reminded him of the mystery novels he enjoyed growing up.
More recently, his lab and others have studied the accuracy of AI models for providing patient diagnoses. Manrai wondered whether it was possible to design a system that could go further.
The core of Dr. CaBot is OpenAI’s o3 large language reasoning model. In building the system, Buckley, who is a Dunleavy Fellow in HMS’ AI in Medicine track, needed to augment o3 with new abilities.
One is Dr. CaBot’s ability to efficiently search millions of clinical abstracts from high-impact journals, which helps it properly cite its work and avoid factual hallucinations. Dr. CaBot can also search its “brain” of several thousand CPCs and use these examples to replicate the style of an expert diagnostician in NEJM. The team is working closely with clinician collaborators at Beth Israel Deaconess Medical Center and other Harvard-affiliated hospitals to continue refining the system.
Dr. CaBot delivers two main products.
The first is a roughly five-minute, narrated, slide-based video presentation of a case, in which the system explains how it reasoned through the possibilities to come to a diagnosis. The presentations are “surprisingly lifelike,” Buckley said, complete with filler words like “um,” “uh,” and “you know” as well as colloquial phrases.
During the team’s demonstrations, “the realness of the narrated presentation seems to connect with physicians,” Manrai said.
The other is a detailed written version of Dr. CaBot’s reasoning and diagnosis.
Taking Dr. CaBot on the road
The researchers are eager for physicians to engage with Dr. CaBot and provide expert feedback. To this end, they are planning more demonstrations at local hospitals, and they published a paper describing the system on a preprint server. They see the NEJM CPC as another opportunity for input.
“Dr. CaBot’s AI-generated discussion has not been analyzed for correctness; any factual errors present have been retained so that the reader can observe the strengths and limitations of the system,” the editor’s note on the CPC reads, concluding, “whether AI has a legitimate use in clinical decision making is up to the reader to determine.”
Dr. CaBot is also available online, where users can test the system on new cases for educational and research purposes, and review presentations and write-ups for 15 existing cases ranging from “A Newborn Girl With Skin Lesions” to “An 89-Year-Old Man With Progressive Dyspnea.”
“We’re really trying to stick our necks out,” Manrai said. “There’s great potential to be embarrassed, but you learn a lot by playing a video for actual clinicians for five minutes. We’re getting so much feedback that way.”
Although the primary use case for Dr. CaBot is as an educational tool, its ability to rapidly sift through millions of clinical abstracts could also make it a valuable research aid.
According to Manrai and Buckley, the tool would need further improvement, validation, and the addition of patient privacy protections before it could be considered for implementation in real-world settings. However, the team noted that physicians are already expressing interest.
The advantages of an AI system are that it is always available, doesn’t get tired, isn’t juggling responsibilities, and can quickly search vast quantities of medical literature, they said.
Manrai added that there’s evidence physicians are using AI tools “in amounts that I think would surprise a lot of folks,” including ChatGPT and a physician-specific platform called OpenEvidence.
“We’re very nascent in human-AI collaboration,” Manrai said, but the field is evolving rapidly. Eventually, Dr. CaBot might join the AI toolbox that physicians are already exploring as they determine how to best help their patients.
Authorship, funding, disclosures
Additional authors on the NEJM CPC include Michael Hood, Akwi Asombang, and Elizabeth Hohmann.
Disclosure forms provided by the authors are available with the full text of the article at NEJM.org.
Journal
New England Journal of Medicine
Article Title
Case 28-2025: A 36-Year-Old Man with Abdominal Pain, Fever, and Hypoxemia
Article Publication Date
8-Oct-2025
No comments:
Post a Comment