- The new deep learning model named MelNet can produce human intonation with uncanny accuracy.
- Once trained, it can regenerate anybody’s voice over a few seconds.
- Researchers demonstrate how precisely it can clone Bill Gates’ voice.
There have been huge advances in machine learning techniques in recent years. These techniques have worked really well in recognizing objects, faces and generating realistic images.
However, when it comes to audio, artificial intelligence is something of a disappointment. Even the best text-to-speech systems lack the basic features, such as changes in intonation. Have you heard the machine-generated voice of Stephen Hawking? Sometimes, it gets really hard to understand his sentences.
Now, scientists at Facebook AI Research have developed a method to overcome the limitations of existing text-to-speech systems. They have built a generative model — named MelNet — that can produce human intonation with uncanny accuracy. In fact, it can speak fluently with anybody’s voice.
How MelNet Is Different From Existing Machine Speech?
Most deep-learning algorithms are trained on large audio databases to regenerate real speech patterns. The major issue with this methodology is the type of data. Typically, these algorithms are trained on audio waveform recordings, which have complex structures at drastically varying timescales.
These recordings represent how the amplitude of sound varies with time: one second of audio contains tens of thousands of time steps. Such waveforms reflect particular patterns at a number of different scales.
Existing generative models of waveforms (such as SampleRNN and WaveNet) can only backpropagate through a fraction of a second. Therefore, they cannot capture the high-level structure that emerges on the scale of several seconds.
MelNet, on the other hand, uses spectrograms (instead of audio waveforms) to train deep-learning networks. Spectrograms are 2D time-frequency representations that show the entire spectrum of audio frequencies and how they vary with time.
Spectrogram and waveform patterns of the same 4-second audio content
While 1D time-domain waveforms capture the change over time of one variable (amplitude), spectrograms capture the change over different frequencies. Thus, audio information is packed more densely in spectrograms.
This enables MelNet to produce unconditional speech and music samples with consistency over several seconds. It is also capable of conditional speech generation and text-to-speech synthesis, entirely end-to-end.
Reference: arXiv:1906.01083 | GitHub
To reduce information loss and limit over-smoothing, they modeled high-resolution spectrograms and used a highly expressive autoregressive model, respectively.
Results Are Impressive
Researchers trained MelNet on numerous Ted talks, and it was then able to regenerate the speaker’s voice saying random phrases over a few seconds. Below are two examples of MelNet using Bill Gates’s voice to say random phrases.
“Port is a strong wine with a smoky taste.”
“We frown when events take a bad turn.”
More examples are available on GitHub.
Although MelNet creates remarkably lifelike audio clips, it cannot generate longer sentences, or paragraphs. Nevertheless, the system could improve the computer-human interaction.
Many customer-care conversations involve short phrases. MelNet can be used to automate such interactions or replace the current automated voice system to improve caller experience.
Read: Facebook AI Converts Music From One Style To Another
On a negative note, the technology raises the specter of a new era of fake audio content. And like other advances in artificial intelligence, it raises more ethical questions than it answers.