- Google develops a new AI that can focus on a particular voice in a crowded area.
- It uses a combination of both visual and auditory signals to separate the voices.
- The technology also has potential to provide better video captioning system for overlapping speakers, by pre-processing speech recognition.
Humans are exceptionally good at picking a particular voice in a crowded area, muting all other sounds. However, this remains a tough challenge for machines. They are still not good at separating individual speech when two or more people talking, or in the presence of background noise.
Now Google has developed an audio-visual model based on deep learning that can focus on a single audio signal from a mixture of voices and background noise. The AI can analyze the video and enhance voices of certain people while suppressing all other sounds.
It doesn’t require any special audio or video format; it works on all common video formats with one audio track. User can select a particular face in a video he/she wants to listen, or let the algorithm do it based on context.
The technology uses a combination of both visual and auditory signals of a video to separate the voices. Algorithms can identify which person is currently speaking based on his/her mouth’s movements. These visual signals significantly improve the quality of speech separation in mixed speech, and associate sound tracks with visible speakers.
How It’s Made?
Engineers collected a huge amount of quality YouTube videos of talkshows and lectures to produce training samples. Then they filtered 2,000 hours of clips from these videos. The filtered-video that had a clean voice – no audience noise, mixed music and background interference.
Then they used this content to create a combination of face videos with their associated speech and background noise from different sources. They trained a multi-stream convolutional neural network to separate the voices of individual speakers from mixed-speech video.
Both spectrogram representation of soundtrack and face thumbnails of speakers in each frame (extracted from video) are inserted into the neural network. The network gradually learns (training period) how to encode auditory and visual signals and fuse them together to create a single audio-visual content.
In the meantime, network also learns to provide time-frequency masks for individual speakers. Then it multiplies the noisy input spectrograms to masks, in order to output a clean speech, while crushing interference and noise.
The network is implemented on TensorFlow (open source machine learning framework), and its operations are used to perform waveform and short-time Fourier transform. All network layers, excluding mask layer, are followed by Rectified Linear Unit activations.
Batch normalization is performed for all convolutional layers. To do this, they used a batch size of 6 samples and trained for 5 million batches (steps). Audios are resampled to 16 KHz, and stereo audio is turned into mono to calculate the short-time Fourier transform.
All face embeddings are resampled to 25 frame per second before training, which resulted in an input visual stream of 75 face embeddings. They used zero vectors when missing frames were encountered in a specific sample.
The technology could have countless applications, from audio recognition in videos to speech enhancement, especially where multiple people are speaking. It would broaden the types of microphones that can be used within various audio environments. But for now, YouTube and Hangouts seem like two easy places to begin. Ultimately, it could be applied to voice amplifying earbuds and Google Glasses.
Also, the technique has potential to provide better video captioning system for overlapping speakers, by pre-processing speech recognition. This feature would make it easier for deaf people to participate in teleconferences and enjoy movie videos.