Wednesday, August 02, 2023

Speech deepfakes frequently fool humans, even after training on how to detect them


New study suggests improving automated detectors may be the best tactic to deal with speech deepfakes


Peer-Reviewed Publication

PLOS

Warning: Humans cannot reliably detect speech deepfakes 

IMAGE: THE RESEARCHERS SUGGEST THAT TRAINING PEOPLE TO DETECT SPEECH DEEPFAKES IS UNREALISTIC, AND EFFORTS SHOULD FOCUS ON IMPROVING AUTOMATED DETECTORS. view more 

CREDIT: ADRIAN SWANCAR, UNSPLASH, CC0 (HTTPS://CREATIVECOMMONS.ORG/PUBLICDOMAIN/ZERO/1.0/)




In a study involving more than 500 people, participants correctly identified speech deepfakes only 73 percent of the time, and efforts to train participants to detect deepfakes had minimal effects. Kimberly Mai and colleagues at University College London, UK, presented these findings in the open-access journal PLOS ONE on August 2, 2023.

Speech deepfakes are synthetic voices produced by machine-learning models. Deepfakes may resemble a specific real person’s voice, or they may be unique. Tools for making speech deepfakes have recently improved, raising concerns about security threats. For instance, they have already been used to trick bankers into authorizing fraudulent money transfers. Research on detecting speech deepfakes has primarily focused on automated, machine-learning detection systems, but few studies have addressed humans’ detection abilities.

Therefore, Mai and colleagues asked 529 people to complete an online activity that involved identifying speech deepfakes among multiple audio clips of both real human voices and deepfakes. The study was run in both English and Mandarin, and some participants were provided with examples of speech deepfakes to help train their detection skills.

Participants correctly identified deepfakes 73 percent of the time. Training participants to recognize deepfakes helped only slightly. Because participants were aware that some of the clips would be deepfakes—and because the researchers did not use the most advanced speech synthesis technology—people in real-world scenarios would likely perform worse than the study participants.

English and Mandarin speakers showed similar detection rates, though when asked to describe the speech features they used for detection, English speakers more often referenced breathing, while Mandarin speakers more often referenced cadence, pacing between words, and fluency.

The researchers also found that participants’ individual-level detection capabilities were worse than that of top-performing automated detectors. However, when averaged at the crowd-level, participants performed about as well as automated detectors and better handled unknown conditions for which automated detectors may not have been directly trained.

Speech deepfakes are likely to only become more difficult to detect. Given their findings, the researchers conclude that training people to detect speech deepfakes is unrealistic, and efforts should focus on improving automated detectors. However, they suggest that crowdsourcing evaluations on potential deepfake speech is a reasonable mitigation for now.

The authors add: “The study finds that humans could only detect speech deepfakes 73% of the time, and performance was the same in English and Mandarin.”

#####

In your coverage please use this URL to provide access to the freely available article in PLOS ONEhttps://journals.plos.org/plosone/article?id=10.1371/journal.pone.0285333

Citation: Mai KT, Bray S, Davies T, Griffin LD (2023) Warning: Humans cannot reliably detect speech deepfakes. PLoS ONE 18(5): e0285333. https://doi.org/10.1371/journal.pone.0285333

Author Countries: UK

Funding: KM and SB are supported by the Dawes Centre for Future Crime (https://www.ucl.ac.uk/future-crime/). KM is supported by EPSRC under grant EP/R513143/1 (https://www.ukri.org/councils/epsrc). SB is supported by EPSRC under grant EP/S022503/1. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Humans unable to detect over a quarter of deepfake speech samples


Peer-Reviewed Publication

UNIVERSITY COLLEGE LONDON




The study, published today in PLOS ONE, is the first to assess human ability to detect artificially generated speech in a language other than English.

Deepfakes are synthetic media intended to resemble a real person’s voice or appearance. They fall under the category of generative artificial intelligence (AI), a type of machine learning (ML) that trains an algorithm to learn the patterns and characteristics of a dataset, such as video or audio of a real person, so that it can reproduce original sound or imagery.

While early deepfake speech algorithms may have required thousands of samples of a person’s voice to be able to generate original audio, the latest pre-trained algorithms can recreate a person’s voice using just a three-second clip of them speaking1. Open-source algorithms are freely available and while some expertise would be beneficial, it would be feasible for an individual to train them within a few days2.

Tech firm Apple recently announced software for iPhone and iPad that allows a user to create a copy of their voice using 15 minutes of recordings3.

Researchers at UCL used a text-to-speech (TTS) algorithm trained on two publicly available datasets, one in English and one in Mandarin, to generate 50 deepfake speech samples in each language. These samples were different from the ones used to train the algorithm to avoid the possibility of it reproducing the original input.

These artificially generated samples and genuine samples were played for 529 participants to see whether they could detect the real thing from fake speech. Participants were only able to identify fake speech 73% of the time, which improved only slightly after they received training to recognise aspects of deepfake speech.

Kimberly Mai (UCL Computer Science), first author of the study, said: “Our findings confirm that humans are unable to reliably detect deepfake speech, whether or not they have received training to help them spot artificial content. It’s also worth noting that the samples that we used in this study were created with algorithms that are relatively old, which raises the question whether humans would be less able to detect deepfake speech created using the most sophisticated technology available now and in the future.”

The next step for the researchers is to develop better automated speech detectors as part of ongoing efforts to create detection capabilities to counter the threat of artificially generated audio and imagery.

Though there are benefits from generative AI audio technology, such as greater accessibility for those whose speech may be limited or who may lose their voice due to illness, there are growing fears that such technology could be used by criminals and nation states to cause significant harm to individuals and societies.

Documented cases of deepfake speech being used by criminals include one 2019 incident where the CEO of a British energy company was convinced to transfer hundreds of thousands of pounds to a false supplier by a deepfake recording of his boss’s voice4.

Professor Lewis Griffin (UCL Computer Science), senior author of the study, said: “With generative artificial intelligence technology getting more sophisticated and many of these tools openly available, we’re on the verge of seeing numerous benefits as well as risks. It would be prudent for governments and organisations to develop strategies to deal with abuse of these tools, certainly, but we should also recognise the positive possibilities that are on the horizon.”

Notes to Editors:

For more information, see the Microsoft website.

The Alan Turing Institute has published a report on voice cloning at scale, available here.

See the Apple website for more details.

More details of the case can be found on the Wall Street Journal.

For more information, please contact:

 Dr Matt Midgley

+44 (0)20 7679 9064

m.midgley@ucl.ac.uk

Publication:

Kimberly Mai et al. ‘Warning: humans cannot reliably detect speech deepfakes’ is published in PLOS ONE and is strictly embargoed until 2 August 2023 19:00 BST / 14:00 ET.

DOI: https://doi.org/10.1371/journal.pone.0285333

About UCL – London’s Global University

UCL is a diverse global community of world-class academics, students, industry links, external partners, and alumni. Our powerful collective of individuals and institutions work together to explore new possibilities.

Since 1826, we have championed independent thought by attracting and nurturing the world's best minds. Our community of more than 50,000 students from 150 countries and over 16,000 staff pursues academic excellence, breaks boundaries and makes a positive impact on real world problems.

We are consistently ranked among the top 10 universities in the world and are one of only a handful of institutions rated as having the strongest academic reputation and the broadest research impact.

We have a progressive and integrated approach to our teaching and research – championing innovation, creativity and cross-disciplinary working. We teach our students how to think, not what to think, and see them as partners, collaborators and contributors.  

For almost 200 years, we are proud to have opened higher education to students from a wide range of backgrounds and to change the way we create and share knowledge.

We were the first in England to welcome women to university education and that courageous attitude and disruptive spirit is still alive today. We are UCL.

www.ucl.ac.uk | Follow @uclnews on Twitter | Read news at www.ucl.ac.uk/news/ | Listen to UCL podcasts on SoundCloud | Find out what’s on at UCL Minds

No comments: