- A new self-supervised machine learning system named PixelPlayer can locate specific parts of the image that generate sounds in a music video.
- It can separate and alter the volume of individual (unlabeled) instruments.
Musician spend hours on YouTube videos to learn how to play certain parts of their favorite songs. Wouldn’t it be great if there was a way to play a music video and listen to the only instrument you wanted to learn or hear?
Recently, researchers at MIT tried to achieve the same. They developed a deep learning system that can analyze music videos and isolate the sounds coming from particular instrument, while making them softer or louder (as per your requirement).
It’s a self-supervised machine learning system, meaning it does not require you to specify what the instruments are or what sound they produce. Let’s dig deeper and find out how did they develop this.
The system, named PixelPlayer, is trained on more than 600 hours of unlabeled videos, so that it can learn to locate specific parts of the image that generate sounds, and separate the input sounds into a group, representing sound associated with each pixel.
Until now, scientists have emphasized only on audio to split the sources of sound. PixelPlayer brings a new element: vision. Since vision offers self-supervision, labelling each instrument is not necessary in this case. .
The technique focuses on the natural synchronization of the audio and video parameters, which jointly parse images and sounds. For instance, it can take a video of trumpet and tuba duet and separate the sound coming from each instrument.
Image credit: MIT/CSAIL
More specifically, the AI splits the input sound into ‘N’ channels, where each channel corresponds to a different instrument category. Also, it’s capable of localizing the audios and assign different soundwaves to each pixel in the input clip.
The one neural network processes the audio, one processes the visuals, and the third ‘synthesizer’ links specific pixels with soundwaves. Although the researchers were unable to tell every aspect of how exactly the model learns which sounds are produced by which instruments, they reported that the system recognizes actual elements of the audio.
For instance, some harmonic frequencies correspond to violin (and similar instruments), whereas rapid pulse-like sequences correlate to instruments like a xylophone.
This type of AI opens up a lot of possibilities: one can listen and edit the sound of specific instruments by clicking on the video. It can be integrated into robots to better analyze the surrounding sounds, like sounds of vehicles or animals.
Furthermore, the ability to alter the individual instruments’ volume could potentially help engineers enhance the quality of audio in old concert videos. Producers could use this AI to preview what certain combination of instruments would sounds like. For example, an acoustic guitar replaced by an electric one.
In this study, the researchers demonstrated the system that can analyze and detect sounds of over 20 usual instruments. The network finds it hard to differentiate between subcategories on instruments, like a tenor vs alto sax. However, with more training data, it would be able to detect many more instruments.