Humans Struggle to Detect Deepfake Speech, New Research Reveals
New research conducted by UCL has revealed that humans face challenges when it comes to detecting artificially generated speech, with an accuracy rate of only 73%. Surprisingly, the accuracy remains consistent whether the speech is in English or Mandarin.
Deepfakes are a type of synthetic media designed to mimic a person’s voice or appearance. They fall under the category of generative artificial intelligence (AI) and employ machine learning algorithms to recreate original sound or imagery by learning patterns and characteristics from datasets of real individuals.
In the past, deepfake speech algorithms required a large number of samples to generate authentic audio. However, the latest pre-trained algorithms can now generate a person’s voice using just a three-second clip of them speaking. Open-source algorithms are readily available, and with some expertise, individuals can train them within a few days.
Even tech giant Apple recently introduced software that allows users to create a replica of their own voice using only 15 minutes of recordings.
To investigate the detection capabilities of humans when it comes to differentiating between real and fake speech, researchers at UCL employed a text-to-speech (TTS) algorithm trained on two publicly available datasets, one in English and the other in Mandarin. The algorithm was used to generate 50 deepfake speech samples in each language, distinct from the training data, to prevent the algorithm from reproducing the original input.
These artificially generated samples, along with genuine samples, were played for 529 participants to evaluate their ability to identify authentic speech. The results revealed that participants could only recognize fake speech with a mere 73% accuracy, which improved only slightly after receiving training to recognize aspects of deepfake speech.
Kimberly Mai, first author of the study and affiliated with UCL Computer Science, commented on the findings saying, Our research confirms that humans are unable to consistently detect deepfake speech, regardless of whether they have received training to identify artificial content. It is worth noting that the samples used in our study were created using relatively old algorithms, raising concerns about human detection capabilities when faced with deepfake speech generated using the most advanced technology available today and in the future.
Moving forward, the next step for researchers is to develop more sophisticated automated speech detectors. This ongoing effort aims to create detection capabilities to counteract the threat posed by artificially generated audio and imagery.
While generative AI audio technology offers benefits such as improving accessibility for those with limited speech abilities or those suffering from voice loss due to illness, there are growing concerns about criminals and nation states exploiting such technology to cause harm.
Instances of deepfake speech being used by criminals have been documented, including a case in 2019 where the CEO of a British energy company fell victim to a deepfake recording of his boss’s voice, which convinced him to transfer large sums of money to a false supplier.
Professor Lewis Griffin, senior author of the study and affiliated with UCL Computer Science, highlighted the need for governments and organizations to devise strategies to address potential misuse of generative artificial intelligence tools. Nonetheless, the professor also acknowledged the positive possibilities on the horizon, urging the recognition of the benefits that lie ahead alongside these risks.
As generative artificial intelligence technology advances and its tools become more accessible, a careful balance must be struck to maximize its advantages while mitigating potential abuse. It is crucial to strike a balance between harnessing the positive potential and protecting individuals and societies from the negative consequences.
In conclusion, deepfake speech remains a concerning and challenging phenomenon. The ability of artificial intelligence algorithms to generate remarkably realistic audio poses significant risks that demand urgent attention. Researchers continue to work towards improving detection capabilities, but a comprehensive strategy involving both technological advancements and societal awareness is essential to effectively navigate this evolving landscape of artificial media.