- Facebook is developing models to precisely detect and track movements of different parts of the body (instead of just face) in real time.
- The full body tracking and segmentation technology is based on the Mask R-CNN framework.
- They’ve developed a lighter version of the framework that can be efficiently implemented on mobile processors.
Enabling mobiles and tablets to better run augmented reality applications has been one of the biggest focuses for many tech giants, including Apple and Google. Most of them have released necessary tools to enable developers to build and provide easier access to augmented reality (AR) features.
Not to be left behind, the Facebook AI Camera Team is working in the same field, with an aim to develop creative tools that can help people express themselves like never before. With current applications that include real-time face tracker, users can apply filters, add makeup, create animoji, or even replace their face with an avatar.
But what if you can make and share “full body custom animated characters” (not just a face) with an avatar? Well, it is no secret that Facebook is researching on both virtual reality and augmented reality for entertainment and communication purposes. This time, they aim to take this a step ahead, by modifying and replacing your entire body.
Basic Requirements
To replace the face with entire body, the first important thing is to precisely detect and track movements of different parts of the body in real time. It isn’t as easy as it sounds. Developing such a model includes a few complex problems because of many different poses and identities.
For instance, he/she might be wearing shorts or a long coat, and a person is often obstructed by objects or other people in the scene. These factors increase the difficulty of robust body tracking technology that uses only a mobile/tablet camera.
So far, the team has built a system that can precisely detect human body poses (in the foreground) and segment a person from background. The lightweight system (only a few MB) is still in development phase, and runs on mobile devices in real time. In near future, it will enable numerous new applications like using gestures to control games, making body masks, or de-identifying people.
Mask R-CNN2Go Architecture
The full body tracking and segmentation technology is based on the Mask R-CNN framework. It’s a simple and flexible framework that detects objects in a picture while producing a high quality segmentation mask for all instances, at the same time.
Mobile devices have limited storage and computing power as compared to GPU servers. Based on ResNet, the original Mask R-CNN model is quite big and slow to run on mobile devices. Therefore, researchers chose to develop a lighter version that can be efficiently implemented on mobile processors.
To do that, they reduced the size of the model and tuned the number of convolution layers and each layer’s width, which takes most of the processing time.
Mask R-CNN extends Faster R-CNN by integrating object mask predicting branch with the existing branch for bounding box recognition. It is easy to generalize to other tasks, and is simple to train. Mask R-CNN adds a small overhead to Faster R-CNN, running at 5 frames per second.
The lightweight Mask R-CNN has 5 major modules –
- Trunk Model – has multiple convolution layers that create deep features representations of an image.
- Region Proposal Network – suggests candidate objects at predefined aspect ratios. The features extracted from each object bounding box is sent to the detection head.
- Detection Head – tells whether the object is a person, and produces bounding box for all people in the given picture.
- Key Point Head (KPH) And Segmentation Head (SH)- provide input to an ROI-align layer to extract features.
- Both KPH and SH have similar architecture – it predicts the mask for all predefined keys on body. The final coordinates are generated by a single maximum sweeping.
Reference: arXiv | arXiv:1703.06870 | Facebook
Modular Design – Low Power
The core framework has been optimized in order to run deep learning algorithms in realtime. Utilizing GPU and CPU libraries such as SNPE, Metal and NNPack, allowed the engineers to enhance the mobile computational speed. All of this is done with a modular design, without altering the standard model definition.
Read: Google Develops Voice AI That Is Indistinguishable From Humans | Tacotron 2
For now, the Facebook AI Camera Team is focusing on new model architectures that will lead to more efficient designs and can better fit in mobile processing units while consuming less battery power.