- Given an image of a person, the new neural network can create different poses of that person.
- It works by breaking the complex problem into smaller subtasks and train them together as a single generative neural network.
Humans have the incredible imagination power. If we can picture someone in our head, we can surely picture them doing different activities. Meanwhile, machines don’t have these abilities.
Now, MIT engineers have taken a step to change this: they have built an AI (artificial intelligence) that looks at an image of a person doing an activity like yoga, and then assigns a new pose to that person.
The system can perform this task across different activities, for instance, it can create a goofy image of anyone swinging a badminton racket on the cricket playground. The AI can even take a photo and generate videos of a particular action. It’s quite interesting because the system was not explicitly trained to do so.
How Did They Develop This?
To make an image look real, it’s necessary to retain the background, original appearance of the person, and examine body parts consistent with the new posture. It’s not easy to create new body configuration while keeping shades and edges perfect.
Different poses raise complex alterations in the image space, like numerous moving parts and self-occlusions. Also, the background pixels that become unobstructed are needed to be filled with suitable content. This could be tricky, especially when shadows, occlusions and visual gaps is not going to appear in target posture.
To deal with these problems, they trained a supervised learning model on thousands of photos and their poses. It takes a source image and a two dimensional target pose as input, and creates a final photo.
Network architecture | Credit: MIT CSAIL
The key idea is to break the complex problem into a series of simpler ‘subtasks’ that varies from one image to another, and train them together as a single generative neural network. Until now, this kind of research has relied on the network to merge motion and appearance data from the input frame.
The system divides the source image into two groups: a background layer and several foreground layers associated with different body parts. The multiple division of foreground layers enables the body parts to move spatially in target positions.
Then body parts (that have different positions in target pose) are tweaked and fused to synthesize new foreground layer, while separately filling the background with suitable content.
Finally, the background and foreground layers are joined to form the target image. All tasks are implemented as a single network, and trained consecutively (using one target pose) as a supervised label.
Researchers have demonstrated the network on images captured from more than 250 YouTube videos of people doing workouts/yoga and playing golf/tennis.
Outcomes show that the network can precisely transfer and reconstruct given poses. Using a series of poses and a source image, it can even generate a temporally coherent video portraying an action.
The system still has several limitations: it is not yet advanced enough to regenerate some elements in the original images, and often struggles with subtle nuances of faces, hands and backgrounds. Moreover, the system fails to recreate some activities (like figure skating and dancing) if people turn their back to the camera.
The researchers plan to update this system to explicitly focus on creating videos and analyzing what they can possibly do with three dimensional poses.
According to the researchers, upcoming versions of such networks could have several tangible uses. For example, it could help players visualize themselves using the correct form, and help self-driving vehicles predict future actions from various angles. .