- New deep learning based system can show individuals mirroring the moves of their favorite dance stars.
- The algorithm doesn’t require any expensive 3D or motion capture data to generate high-quality videos.
Artificial Intelligence (AI) is changing everything: from the way we interact with electronic devices to the space exploration. It has become an essential part of the technology industries and there is no going back.
Recently, researches at the University of California presented a deep learning based algorithm that can transfer motions of a person dancing from one video to another, making any target (amateur) look like a professional dancer.
The algorithm follows a simple approach: ‘do as I do’. It takes a few minutes to transfer motions to the target subject performing standard moves. So get excited because now you can turn yourself into a world-class ballerina or a pop star like Michael Jackson.
How Does It Work?
To perform frame-by-frame motion transfer between 2 video subjects, developers mapped pictures of the 2 individuals (source and target). They found that keypoint-based pose — encodes body position without capturing appearance — can be used as an intermediate representation between these 2 subjects.
Therefore, they designed the intermediate representation that looks like pose stick figures. Then, they obtained pose sticks of each frame for the target video, using a supervised algorithm.
To transfer motion from source to target, they fed pose stick figures to the trained model. This gave them target pictures in the same pose as the source. They further merged these 2 modules to enhance the result quality.
Overall, the task is divided into three stages:
- Pose detection
- Global pose normalization
- Mapping pose stick figures to the target
To obtain temporally smooth results, they combined the data of the current frame’s pose stick figure with the previously synthesized frame. This allowed them to significantly reduced jittering in outputs. For lower frame-rate videos, they performed median smoothing, whereas, for higher frame-rate videos (120 fps), they used Gaussian smoothing of the key points over time,
Moreover, for approximating generative models or producing high-quality images with sharp details, Generative Adversarial Networks have been added.
The conditional generative adversarial networks are trained on videos of amateur dancers performing a variety of poses captured at 120fps. Each subject contributed at least 20 minutes of video.
Developers used NVIDIA GeForce GTX 1080 Ti and TITAN Xp GPUs with PyTorch accelerated by CUDA deep learning framework for both inference and training. The image translation algorithm is based on the pix2pixHD architecture designed by NVIDIA.
The algorithm is capable of generating videos where motions are transferred between a wide range of video subjects, without requiring any expensive 3D or motion capture data.
However, the system isn’t perfect yet. Although it’s integrated with temporal coherence and pre-smoothing key points, the outcomes often suffer from jittering. Errors mostly arise in the cases when the speed of motion doesn’t match the movements observed at training time.
To eliminate these issues, researchers are currently working on different pose estimation techniques that are well-optimized for motion transfer.