- Google develops Tacotron 2 that makes machine generated speech sound less robotic and more like a human.
- They used neural networks trained on text transcripts and speech examples.
- The system synthesizes speech with WaveNet-level audio quality and Tacotron-level prosody.
Research on generating natural speech from a given text (text-to-speech synthesis, TTS) has been going on for decades. In a last few couple of years, there has been impressive progress.
You are familiar with Google voice service, it’s available in both male and female voices. The robotic voice is a staple in our culture, like Microsoft’s Cortana or Apple’s Siri. As the years have gone by Google’s AI voice has started to sound less robotic and more like a human. And now, it is almost indistinguishable from humans.
Google engineers incorporated ideas from past work like WaveNet and Tacotron, and enhanced the techniques to end up with new system, Tacotron 2. In order to achieve human-like speech, they used neural networks trained on only text transcripts and speech examples, rather than using any complicated linguistic and acoustic features as input.
The system contains two main components –
- A recurrent sequence-to-sequence feature prediction network optimized for TTS to map sequence of letters to a sequence of features, encoding the audio.
- An improved version of WaveNet that produces time-domain waveform samples based on the predicted spectrogram frames.
Tacotron 2’s model architecture
The sequence-to-sequence model features an 80 dimensional audio spectrogram (with frames measured every 12.5 milliseconds) that captures words, speed, volume and intonation. These features are eventually converted into 16-bit samples at 24 kHz waveform using an enhanced-WaveNet version.
The resulting system synthesizes speech with WaveNet-level audio quality and Tacotron-level prosody. It can be trained on data without relying on any complicated feature engineering, and accomplishes state-of-the-art sound quality very close to that of natural human voice.
Unlike other core artificial intelligence research the company does, this technology is immediately useful to Google. For instance, first appeared in 2016, WaveNet is now used in Google Assistant. Tacotron 2 would be a more powerful addition to the service.
Reference: arXiv | 1712.05884
Below, we have attached some samples. Each sentence is generated by artificial intelligence program and the other is a human. Can you figure out which one is AI?
“That girl did a video about Star Wars lipstick.”
“George Washington was the first President of the United States.”
“She earned a doctorate in sociology at Columbia University.”
In an evaluation, Google asked humans to rate the naturalness of the speech. The model achieved a Mean Opinion Score (MOS) of 4.53 comparable to 4.58 MOS for professionally recorded speech.
More Samples: Google.Github.io
Additional Capabilities of Tacotron 2
It can pronounce complex and out-of-the-context words.
“Basilar membrane and otolaryngology are not auto-correlations.”
It takes care of spelling errors.
“Thisss isrealy awhsome.”
It learns stress and intonation (capitalizing words changes the overall intonation).
“The buses aren’t the problem, they actually provide a solution.”
“The buses aren’t the PROBLEM, they actually provide a SOLUTION.”
It is good at tongue twisters.
“Peter Piper picked a peck of pickled peppers. How many pickled peppers did Peter Piper pick?”
The sample sounds great, but there are still a few problems to be solved. The system faces issues while pronouncing complicated words like “merlot” and “decorum”. In extreme cases, it randomly creates strange noises.
For now, the system can’t generate audio in realtime and generated speech can’t be controlled, like directing it to sound sad or happy. Furthermore, it is only trained to mimic a female voice; to speak like another female or like a male, developers would need to train the system again.