Voice Cloning Is Here. What Did You Say, Dear?

Share: Facebook; Twitter/X; LinkedIn; Flipboard; Print; Email

Welcome to the wonderful world of voice cloning (VC). In the same way digital technology can be used to alter photos, VC can be used to literally create speech of someone’s actual voice. All that’s needed is a five second sample of the target speaker and you can make them say anything!

“…Researchers …are now claiming that they can clone a person’s voice using just a 5-second clip. They explain that this can be done because they have trained a neural network, what we often call artificial intelligence or machine learning, on hours and hours of a wide variety of speakers so that it understands how we speak and then it can take a 5 second clip from an individual it has not heard before and clone a voice and get them to say things that were not in the clip.”

The researchers from Cornell University call this “Text-To-Speech . Synthesis”. From the Abstract:

We describe a neural network-based system for text-to-speech (TTS) synthesis that is able to generate speech audio in the voice of different speakers, including those unseen during training. Our system consists of three independently trained compo-nents: (1) a speaker encoder network, trained on a speaker verification task using an independent dataset of noisy speech without transcripts from thousands of speakers, to generate a fixed-dimensional embedding vector from only seconds of reference speech from a target speaker; (2) a sequence-to-sequence synthesis network based on Tacotron 2 that generates a mel spectrogram from text, conditioned on the speaker embedding; (3) an auto-regressive WaveNet-based vocoder network that converts the mel spectrogram into time domain waveform samples. We demonstrate that the proposed model is able to transfer the knowledge of speaker variability learned by the discriminatively-trained speaker encoder to the multispeaker TTStask, and is able to synthesize natural speech from speakers unseen during training.We quantify the importance of training the speaker encoder on a large and diverse speaker set in order to obtain the best generalization performance. Finally, we show that randomly sampled speaker embeddings can be used to synthesize speech in the voice of novel speakers dissimilar from those used in training, indicating that the model has learned a high quality speaker representation.

The Cornell team has provided some astonishing examples of how this sounds. You can listen to some of them in this two minute video explaining the research. They provide actual recordings of 5 second samples of 5 different speakers, and then they create multiple phrases and sentences completely synthesized that sound remarkably like the actual speaker.

Do you find the idea that anyone could potentially be made to say anything a bit scary? Certainly the implications for court cases relying on audio recording evidence gets murky. In the age of “fake news”, things can go to a whole new level. So, yeah, it is a bit scary. Granted, the technology hasn’t advanced to the point where trained listeners and researchers couldn’t identify the false from the genuine, but how far away can the day be when that becomes problematic?

Developing the means to distinguish the genuine from the fabricated is proper research from an ID perspective. The designed “fake” speech would likely have detectable characteristics that would lead to a design inference for the cloned speech. Such analysis would require a design paradigm.

Comments

As the movie The Matrix asked: "What is real?"DonaldM_{November 16, 2019
November
11
Nov
16
16
2019
09:49 AM
9
09
49
AM
PDT}

This would be a god-send for some politicians. They could claim they were misquoted, and then if given a recording, say that their voice was synthesized: "no I never said that". It could also be used to confuse legal matters, as someone could swap another person's voice for something said, thereby confusing a judge or jury. With care, it could even be done with video recordings - changing just enough similar-sounding words to shift the meaning without a lip reader knowing any difference. All sorts of possibilities for disrupting things, undermining evidence, placing false audio for fake news.Fasteddious_{November 15, 2019
November
11
Nov
15
15
2019
12:51 PM
12
12
51
PM
PDT}

"This has been possible, and demonstrated, for a LONG time. Digital speech processing could do this in the ’70s." Not quite. Certainly we've had the ability to synthesize speech (and even singing) for quite a while. But, the ability to extract the timbre of a specific voice coupled with the ability to input any text or phrase we want the voice to say, that's new. There's been some of that available with things like Symphonic Choirs for music production, for example. But, even a casual listener, listening to the track in isolation, would know it was synthesized speech (or singing). The VC program, takes all that to a whole new level we haven't seen before.DonaldM_{November 15, 2019
November
11
Nov
15
15
2019
11:58 AM
11
11
58
AM
PDT}

This has been possible, and demonstrated, for a LONG time. Digital speech processing could do this in the '70s. The only difference is that it takes a whole lot less time and labor with neural networks.polistra_{November 15, 2019
November
11
Nov
15
15
2019
09:30 AM
9
09
30
AM
PDT}

You must be logged in to post a comment.

Leave a Reply