- The Facebook AI research team builds a universal music translation network.
- It replicates the audio it hears and plays it back in various styles, genres and instruments.
- It can process unheard musical sources, like claps or whistles, and produce high-quality audio.
When it comes to music, humans have always been creative in replicating songs and turning it into various other forms by clapping, whistling or playing it on different instruments.
Although music is one of the first areas to be digitized and processed by computing machines and algorithms, today’s artificial intelligence is still much inferior to humans in mimicking audios.
Now Facebook AI research team has developed a universal music translation network that can convert music from one form to another. It replicates music it hears and plays it back in different styles, genres and instruments.
How Did They Do It?
This AI system is based on 2 latest technologies
- Synthesizing high-quality audio by auto-regressive models
- Transforming between domains in an unsupervised manner
The auto-regressive models are trained as decoders and they can produce high quality and realistic audios. The 2nd technology is responsible for making things more practical, since managing learning problems in supervised environments would require a large dataset of numerous musical instruments.
Researchers developed and applied a universal encoder to each input. This removed the burden of training entire network, and enabled the conversion of unheard musical domains to any other domains encountered.
Network architecture | Domain confusion is applied only during training
They trained universal encoder [via domain confusion network] while ensuring that the domain-specific data isn’t encoded. The universal encoder doesn’t memorize the input data, but encodes it in a semantic manner. To do this, researchers distorted the input signal (audio format) by random local pitch modulation.
Since the network is trained as a denoising auto-encoder, it’s capable of recovering undistorted form of the original input signal. The system gradually learns to project out-of-domain input signals to the appropriate output domain.
Researchers trained their network on 6 types of classical music domain, including thousands of samples from those domains. They executed cuDNN-accelerated PyTorch deep learning framework on 8 NVIDIA Tesla V100 GPUs. It took them 8 days to fully train the network.
The AI is not as good as professional musicians, but several times, listeners find it difficult to tell which one is the original audio and which one is artificially generated.
The system can effectively process unheard musical sources, like claps or whistles, and produce superior-quality audio. One can integrate new musical instruments without having to retrain the complete network.
According to the developers, their work may open new doors for other complex tasks, like automatic composition and transcription of music. Moreover, one can make decoders more ‘creative’ by decreasing the latent space size, which enables it to generate exciting natural outputs in the sense that the association with original audio is lost.