When Machines Mimic Your Voice

Last month, in a Facebook post, Prime Minister Lee Hsien Loong brought to attention what is purportedly the first publicly reported AI-heist. The known facts of the case are simple. On a Friday afternoon in March this year, the managing director of a British energy company was duped into believing that he was in a voice call with his boss, with the voice at the other end demanding that he immediately wire over hundreds of thousands of dollars to a supplier in Hungary to help save the company in late payment fines. Apart from the detailed homework done by the thief or thieves to make the request appear convincing, what made the subordinate comply was the quality of the fake voice. As the unfortunate managing director later recalled, the voice was so realistic, down to the tonality, punctuation and German accent.

While the case might seem Hollywood-like, those in the AI community are already cognizant of the possibilities of such technologies. Back in 2017, the startup Lyrebird published a video of what appeared to be Barack Obama making a pitch for the company. Judge for yourself how realistic the voice sounds.

Published by Lyrebird on YouTube (2017 Sep 4)

According to the company, now the AI research division of Descript, only one minute of audio data was required to generate the Obama pitch. Running on publicly available cloud compute resources, machine learning models were trained to determine the features that make every voice unique. These include accent, pitch, cadence and speed. Features that are not pre-defined by humans, but automatically learned by the models over training cycles.

Fast forward to 2019 and we can readily appreciate the progress since then. At the Neural Information Processing Systems Conference (formerly NIPS, now renamed NeurIPS) at the end of last year, Google – building upon the previous work of others – published a paper detailing how speech audio in the voice of many different speakers, including those unseen during training, can be generated with the help of transfer learning. To provide a simplified technical view, the system can be broken down into 3 components which are trained independently.

Simplified view of system (Original source)

A more in-depth technical explanation would be beyond the scope of this article and the interested reader is encouraged to follow the references. Other readers who are more inclined toward engineering work would be interested to note that an open source implementation of the system was recently released by a masters graduate from the University of Liège. Do check it out and see how well it works.

In the 1991 action film, Terminator 2, the protagonist John Connor’s mother was killed by the T-1000 assassin android, but not before it learned to mimic the way she talked. While other properties of the shape-shifting machine still belong firmly in the realm of fiction, the ability to clone a human being’s voice no longer appears very far fetched.

Top YouTube comment : “This scene is about two robots talking to each other, both trying to convince the other that they’re human.”

References & Further Readings