It takes just 3.7 seconds of audio to clone a voice. This impressive—and a bit alarming—feat was announced by Chinese tech giant Baidu. A year ago, the company’s voice cloning tool called Deep Voice required 30 minutes of audio to do the same. This illustrates just how fast the technology to create artificial voices is accelerating. In just a short time, the capabilities of AI voice generation have expanded and become more realistic which makes it easier for the technology to be misused.
Capabilities of AI Voice Generation
Like all artificial intelligence algorithms, the more data voice cloning tools such as Deep Voice receive to train with the more realistic the results. When you listen to several cloning examples, it’s easier to appreciate the breadth of what the technology can do including being able to switch the gender of the voice as well as alter accents and styles of speech.
Google unveiled Tacotron 2, a text-to-speech system that leverages the company’s deep neural network and speech generation method WaveNet. WaveNet analyses a visual representation of audio called a spectrogram to generate audio. It is used to generate the voice for Google Assistant. This iteration of the technology is so good; it’s nearly impossible to tell what’s AI generated and what voice is human generated. The algorithm has learned how to pronounce challenging words and names that would have been a tell-tale sign of a machine as well as how to better enunciate words.
These advances in Google’s voice generation technology have allowed for Google Assistant to offer celebrity cameos. John Legend’s voice is now an option on any device in the United States with Google Assistant such as Google Home, Google Home Hub, and smartphones. The crooner’s voice will only respond to certain questions such as “What’s the weather” and “How far away is the moon” and is available to sing happy birthday on command. Google anticipates that we’ll soon have more celebrity cameos to choose from.
Another example of just how precise the technology has become, a Jordan Peterson (the author of 12 Rules for Life) AI model sounds just like him rapping Eminem’s “Lose Yourself” song. The creator of the AI algorithm used just six hours of Peterson talking (taken from readily available recordings of him online) to train the machine learning algorithm to create the audio. It takes short audio clips and learns how to synthesise speech in the style of the speaker. Take a listen, and you’ll see just how successful it was.
This advanced technology opens the door for companies such as Lyrebird to provide new services and products. Lyrebird uses artificial intelligence to create voices for chatbots, audiobooks, videos games, text readers and more. They acknowledge on their website that “with great innovation comes great responsibility” underscoring the importance of pioneers of this technology to take great care to avoid misuse of the technology.
How This Technology Could Be Misused
Similar to other new technologies, artificial voice can have many benefits but can be also be used to mislead individuals as well. As the AI algorithms get better and it becomes difficult to discern what’s real and what’s artificial, there will be more opportunities to use it to fabricate the truth.
According to research, our brains don’t register significant differences between real and artificial voices. In fact, it’s harder for our brains to distinguish fake voices than to detect fakes images.
Now that these AI systems only require a short amount of audio to train in order to create a viable artificial voice that mimics the speaking style and tone of an individual, the opportunity for abuse increases. So far, researchers weren’t able to identify a neural distinction for how a brain can distinguish between real and fake. Consider how artificial voices might be used in an interview, news segment or press conference to make listeners believe they are listening to an authority figure in the government or a CEO of a company.
Raising awareness that this technology exists and how sophisticated it is will be the first step to safeguard listeners from falling for artificial voices when they are used to mislead us. The real fear is that people can be fooled to act on something that is fake because it sounds like it’s coming from somebody real. Some people are attempting to find a technical solution to safeguard us. However, a technical solution will not be 100% foolproof. Our ability to critically assess a situation, evaluate the source of information and verify its validity will become increasingly important.