Welcome to the wonderful world of voice cloning (VC). In the same way digital technology can be used to alter photos, VC can be used to literally create speech of someone’s actual voice. All that’s needed is a five second sample of the target speaker and you can make them say anything!
“…Researchers …are now claiming that they can clone a person’s voice using just a 5-second clip. They explain that this can be done because they have trained a neural network, what we often call artificial intelligence or machine learning, on hours and hours of a wide variety of speakers so that it understands how we speak and then it can take a 5 second clip from an individual it has not heard before and clone a voice and get them to say things that were not in the clip.”
The researchers from Cornell University call this “Text-To-Speech . Synthesis”. From the Abstract:
We describe a neural network-based system for text-to-speech (TTS) synthesis that is able to generate speech audio in the voice of different speakers, including those unseen during training. Our system consists of three independently trained compo-nents: (1) a speaker encoder network, trained on a speaker verification task using an independent dataset of noisy speech without transcripts from thousands of speakers, to generate a fixed-dimensional embedding vector from only seconds of reference speech from a target speaker; (2) a sequence-to-sequence synthesis network based on Tacotron 2 that generates a mel spectrogram from text, conditioned on the speaker embedding; (3) an auto-regressive WaveNet-based vocoder network that converts the mel spectrogram into time domain waveform samples. We demonstrate that the proposed model is able to transfer the knowledge of speaker variability learned by the discriminatively-trained speaker encoder to the multispeaker TTStask, and is able to synthesize natural speech from speakers unseen during training.We quantify the importance of training the speaker encoder on a large and diverse speaker set in order to obtain the best generalization performance. Finally, we show that randomly sampled speaker embeddings can be used to synthesize speech in the voice of novel speakers dissimilar from those used in training, indicating that the model has learned a high quality speaker representation.
The Cornell team has provided some astonishing examples of how this sounds. You can listen to some of them in this two minute video explaining the research. They provide actual recordings of 5 second samples of 5 different speakers, and then they create multiple phrases and sentences completely synthesized that sound remarkably like the actual speaker.
Do you find the idea that anyone could potentially be made to say anything a bit scary? Certainly the implications for court cases relying on audio recording evidence gets murky. In the age of “fake news”, things can go to a whole new level. So, yeah, it is a bit scary. Granted, the technology hasn’t advanced to the point where trained listeners and researchers couldn’t identify the false from the genuine, but how far away can the day be when that becomes problematic?
Developing the means to distinguish the genuine from the fabricated is proper research from an ID perspective. The designed “fake” speech would likely have detectable characteristics that would lead to a design inference for the cloned speech. Such analysis would require a design paradigm.