- New convolutional network learns to copy colors from one reference frame to subsequent frames.
- While doing so, it can follow different objects and track through occlusions.
- It can also track human poses.
Teaching machines to track objects in a video is one of the most difficult tasks in computer vision, mainly because it requires a huge, labeled training dataset for tracking. Of course, recording and labeling everything going on Earth would be impractical.
That’s why it’s necessary to build a system that learns to track without human supervision, rather than utilizing an enormous amount of raw, unlabeled clips. Why does it matter so much, you asked? Well, tracking objects in videos could be useful for numerous applications, like object interaction, activity recognition, video stylization, and much more.
Now, researchers at Google have developed a convolutional network that learns to copy colors from a single reference frame. Instead of trying to estimate colors directly from a grayscale frame, the model is constrained to use colors of the first reference frame of the video.
In order to copy the right colors, the network needs to learn how to internally point to the right region. This new model can follow different objects and track through occlusions without having to be trained on large labeled datasets.
To develop this artificial intelligence system, researchers have leveraged the temporal coherency of color, which offers a huge training data for teaching convolutional network to track specific portions in the video. There are some exceptional cases when color isn’t temporally coherent, for instance, switching on lights instantly. However, in general colors remain stable over time.
Predicted colors from colorized single frame reference | Credit: Google
Firstly, the video is decolorized and then the network performs colorization steps because a scene might contain different objects of same color. By doing this, machine can learn how to track particular regions or objects.
The researchers used Kinetics dataset (contains half a million video clips depicting daily activities) to train their model. They converted all video frames, excluding the first one, into grayscale and trained the network to estimate the right colors in the following frames.
To copy original colors from a single frame, the convolutional network learned to internally point to the right colors. This forced the network to follow an explicit mechanism, which can be used for object tracking.
The network tracks object without supervision | Credit: Google
Despite the fact that the model isn’t trained on solid identities, it learns to track any object or visual portion in the video using only a single (first) frame. It can track a single point or outlined entity in the video.
To track objects from colorizing video, researchers made only one change: propagate labels representing target regions, rather than propagating colors throughout the clip.
Tracking movements of human skeleton | Credit: Google
The network is also capable of tracking human poses: It requires an initial frame labeled with key-points and does the rest of the work. However, predicting key-points in the following frames is not as easy as it sounds, because you need to have fine-grained localization of each key-point when people in the video undergo deformation.
Researchers demonstrated the network’s pose tracking feature on the JHMDB dataset (a fully annotated data set for human poses and actions) where they tracked a human joint skeleton.
The network obtains similar performance to optical flow, indicating that it could be learning some motion features. It learns to track human poses and video segments well enough to slightly outperform the latest optical flow-based techniques.
The model isn’t perfect yet. In some experiments, it failed to colorize videos and track segments. Therefore, researchers plan to further improve the video colorization process, which may ultimately translate into enhanced self-supervised tracking.