New report: U.S. Government is using AI more, but still has a long way to go
Larger agencies leading the way
As is every large organization, the U.S. government is assessing how to best integrate artificial intelligence into its procedures and workflows. While AI has undeniable risks, it also has the potential to make work significantly more efficient and effective in a broad range of ways, from automating simpler tasks to unearthing unexpected insights.
Over the past decade, the federal government has made the adoption of AI a priority. Both the Biden administration and the two Trump administrations have emphasized the need for federal government AI adoption to improve service delivery, foster data-driven analysis, promote national competitiveness, and strengthen national security.
New research from the Brookings Institution has found that while the scope and pace of this adoption have accelerated over the past three years, AI use across the federal government remains concentrated in a few large agencies More widespread adoption has been slowed by several factors, including workforce capacity constraints, a risk-averse culture, funding challenges, and a lack of trust in AI’s usefulness and safety.
“While the federal government has made progress on using AI, there’s still a long way to go,” says Brookings fellow Valerie Wirtschafter, the author of the report. To understand the current state of AI adoption across the federal government, she analyzed data on federal government AI use from 2023 to 2025 as well as federal jobs data. In addition, she interviewed current and former technology specialists across eight federal agencies.
Over the past few years, AI use by the federal government has grown. More agencies are using it, and the amount they use it has also increased. In 2023, 21 agencies, including 13 large agencies and eight midsize agencies, reported using AI; no small agencies participated. By last year, 41 agencies (13 large, 17 midsize, and 11 small) reported AI use. In 2025, 41 agencies documented more than 3,600 distinct projects that used AI, a 69% increase from the previous year and five times the number reported in 2023. While many of these cases focused on streamlining operations and facilitating back-office processes, others involved more mission-oriented work, including benefits delivery, health and medical services, and law enforcement.
However, there are still significant disparities among agencies. Over the past three years, five agencies accounted for over half of the total AI use. In 2025, large agencies (more than 15,000 employees) accounted for more than three-quarters of all AI use. While more small and midsize agencies are starting to experiment with AI, large agencies are scaling their efforts more aggressively. It is important to note that overall, AI-focused workers continue to represent a small fraction of the overall federal technological workforce.
Wirtschafter identified several key bottlenecks to adoption. Some of these apply only to certain agencies, such as those handling sensitive health or security data.[NL1] Others stem from issues that have hindered federal adoption of technology for decades, such as outdated equipment and infrastructure.
Hiring challenges remain a key obstacle to integrating AI into federal agencies. Among the issues: The federal government has a slow hiring timeline, and limited pathways for career advancement for technologists. The Executive branch has rolled out efforts to improve hiring timelines, and Congress has explored possibilities for improving AI-focused hiring across agencies.
It is worth noting that since the second Trump administration laid off nearly 300,000 federal workers last year, the number of AI-focused federal job listings has dropped significantly, part of an overall decline in hiring. Wirtschafter argues that these layoffs may have undermined efforts to recruit AI expertise into the federal government because many recent hires were still probationary. She says that it’s likely that the layoffs led to the departure of at least some AI-focused employees.
Moreover, the federal government tends to have a risk-averse culture that discourages experimentation and innovation. In addition, the opaqueness of AI processes—it’s often unclear how a program came to its conclusions—can undermine trust and deter use, especially for sensitive work. Moreover, the growing politicization of some large language models (LLMs) is another challenge that could impede the adoption process. For example, Grok, developed by Elon Musk’s xAI, has a well-documented history of reflecting his political values and generating questionable content, while Anthropic’s Claude has been dubiously labeled a “supply chain risk” by the Department of Defense following contract disagreements with the agency.
Wirtschafter offers a series of recommendations to help the federal government more effectively adopt AI. These include:
- Streamlining the hiring process for AI-related jobs;
- Creating new job paths so AI-focused workers have a chance to advance;
- Investing in AI literacy and treating it as a core job requirement;
- Documenting and sharing AI success stories across the government;
- Increasing transparency around AI usage across agencies; and
- Focusing AI investment on high‑impact projects that clearly improve people’s lives.
Read the full report here.
[NL1]which agencies?
Method of Research
Survey
Subject of Research
People
Article Title
Assessing the state of AI adoption across the federal government
AI system automates coding for scientific research
Empirical Research Assistance out-performs software written by experts
Key Takeaways
- A new AI tool called Empirical Research Assistance (ERA) can automatically write high-performance scientific software.
- ERA could significantly accelerate scientific discovery across many domains.
A research team at Google co-led by Michael Brenner, Catalyst Professor of Applied Mathematics and Physics at the Harvard John A. Paulson School of Engineering and Applied Sciences (SEAS) and Google research scientist, has produced a new artificial intelligence system that can automatically write scientific software programs that surpass the performance of human-written programs.
Published in Nature, the system is called Empirical Research Assistance (ERA), and the project was co-led by Brenner and Shibl Mourad from Google DeepMind. Harvard Ph.D. students Qian-Ze Zhu, Ryan Krueger, and Sarah Martinson contributed as Google student researchers while working in Brenner’s group. The research was done in Brenner's capacity as a Catalyst Professor, a position established by the University to enhance relationships between academia and the private sector by supporting senior faculty in research roles at external companies.
Across modern science, customized software is constantly used to test specific hypotheses or interpret complex data. The authors refer to this type of computer program as “empirical software” – a program whose sole purpose is to maximize how well it does on a scientific task, like making weather predictions or forecasting hospitalizations during a disease outbreak. Any problem that can be expressed as a numerical value – its “score” — is called a scorable task.
Empirical software for solving such scorable tasks underpins major advances across many fields, including three recent chemistry Nobel prizes. But the specialized, custom-built software to tackle these experiments is labor-intensive, requiring a human to test and sharpen code many times over.
The new ERA system removes this bottleneck by essentially automating the full cycle of scientific software design and refinement – a process that can normally take months or even years by human experts.
The system combined the Google Gemini large language model with a search strategy to explore and refine thousands of pieces of code – far faster and with greater breadth than a human could.
Starting with a baseline piece of code aimed at a specific problem, the new AI system proposes modifications by adding new components or switching out algorithms, toward the goal of improving a predefined quality score – for example, how accurately can this model predict the spread of a disease, based on past hospitalization numbers? How well does this model predict the shape of proteins based on these amino acid sequences?
The system uses a method called tree search — also used in game-playing systems like AlphaGo — to decide which promising ideas to pursue and which to discard in order to get better at the task of predicting hospitalization numbers, predicting protein shapes, etc.
The AI does not work in isolation. In the process, it can be guided by research ideas in papers or textbooks. These ideas can be provided directly by a user or retrieved automatically and incorporated into later versions of the code.
“This ability to integrate and recombine research ideas enables the system to find “needle-in-a-haystack” solutions that human research might never get to test,” Brenner said.
To prove it, the Harvard and Google team applied the ERA system to a diverse set of scientific problems. Zhu’s role in the project was to use ERA to predict the activity of more than 70,000 neurons in the brain of a zebrafish and compare it against actual neural data.
In one experiment, the team prompted ERA to use an existing neuron-modeling library to build more physically accurate simulations of neural activity. This task would have taken weeks or months for Zhu of learning a new software package, but ERA could assemble and tune the models automatically.
“This new system is going to accelerate scientific discovery by allowing you to explore a lot of ideas at the same time,” Zhu said. “Previously it might take you a week to implement some specific methods, but now you can just run them in parallel in a few hours.”
On one test, the ERA system generated 14 models for predicting COVID-19 hospitalizations that outperformed the best U.S. Centers for Disease Control models used during the pandemic.
In another experiment, ERA discovered four new methods for integrating single-cell RNA sequencing datasets, beating top human-designed approaches.
By reducing the time required for exploration of a set of ideas from months to hours or days, the new system could save significant time for scientists to focus on “truly creative and critical challenges, and to continue to define and prioritize the fundamental research questions and societal challenges that scientific research can help address,” according to a Google blog post about the breakthrough.
Journal
Nature
Method of Research
Computational simulation/modeling
Subject of Research
Not applicable
Article Title
An AI system to help scientists write expert-level empirical software
Article Publication Date
19-May-2026
Watching the detectors: Researchers probe efficacy – and danger – of AI detection tools
University of Florida
image:
Patrick Traynor, Ph.D., professor and interim chair, UF Department of Computer & Information Science & Engineering
view moreCredit: University of Florida
Patrick Traynor, Ph.D., has questions.
When the professor and interim chair of the University of Florida Department of Computer & Information Science & Engineering saw reports in the media positing that scientific literature is increasingly being generated by artificial intelligence, he wondered, “How do they know?”
Traynor knows the detectors that determine the presence of AI-generated text, known as AIGT, in publications are themselves AI systems. They use the same large language models, called LLMs, that less than honest researchers could be using to generate their text.
How good could they be?
Spoiler alert: They’re not very good.
In a paper to be presented at this week’s 2026 IEEE Symposium on Security and Privacy, Traynor and his co-authors assert that current AIGT detectors are not effective or robust tools for determining the presence of AI-generated text. The results, the researchers said, indicate that commercially available AIGT detectors are “poorly suited for deployment in academic or high-stakes contexts.”
In other words, when the results matter, the tools are not effective. The paper, “AI Wrote My Paper and All I Got Was This False Negative: Measuring the Efficacy of Commercial AI Text Detectors,” co-authored by Seth Layton, Ph.D., Bernardo B.P. Madeiros and Kevin Butler, Ph.D., examined common commercially available solutions that test for AIGT and found wildly inconsistent efficacy rates — false positive rates between 0.05% and 68.6%, false negative rates of between 0.3% and 99.6%.
Researchers also found that with a simple tweak to the LLM, the detectors were rendered basically useless, incapable of distinguishing AIGT from human-generated text.
“These current tools are not reliable or robust enough to use to measure the problem,” co-author Traynor said. “We really can’t use them to adjudicate these decisions. People’s careers are on the line here.”
A recent article in Nature asserts the problem is widespread.
“The fear of many in the research community is that poor-quality or entirely fabricated research produced by large language models could overwhelm the ability of current quality-control systems to detect it, thereby polluting the scientific canon," Nature reporter Miryam Naddaf noted in the article.
However, as Traynor and his team point out, this conclusion could not be reliably reached using current tools. In short, while it may feel like the use of AI in academic writing is widespread, current evidence to support that claim is inadequate.
The issue, of course, is personal for Traynor. He pointed out that in academia, individual merit is literally measured by an author’s intellectual output and publications. Suspicion or accusations of having used AI-generated text in scholarly submissions can stain a researcher’s reputation and can adversely affect their career.
The study’s methodology was very meta. Using a set of all papers submitted to top-tier security conferences prior to the advent of ChatGPT (about 6,000 papers), they directed LLMs to create AIGT clones of the very same papers. The combined dataset was then evaluated by the five most popular commercial AIGT detectors on the market.
While two of the five detectors performed well, researchers found that by making trivial changes to the AIGT, the reliability fell dramatically. Researchers simply asked the LLM to generate the AI version of the papers using more complex vocabulary (researchers call this a lexical complexity attack), and the detectors were more easily fooled.
AIGT detectors, it seems, are easily swayed by fancy words.
And while the research concludes AIGT detectors pose high-stakes risks in academia, don’t think Traynor and his team are AI naysayers. They believe LLMs and AI have great potential to speed up science, to help us find new insights. But Traynor cautions against a prevailing notion that AI is all-knowing.
“It’s not an oracle. It doesn’t always know the answer,” Traynor said. “It’s happy to give us answers, but whether or not those answers have value, we still need people to figure that out. This paper shows us that for as many studies as we see claiming that a certain percentage of academic work is AI-generated, we actually don’t have tools to measure any of that.”
Co-author Layton agreed, adding that the research should remind the public to view all AI claims skeptically, just as scientists view all evidence with skepticism.
“We demand that such claims include substantial proof that they are correct,” Traynor reiterated.

No comments:
Post a Comment