- New deep learning algorithm allows editors to quickly colorize a whole video by colorizing one frame in the scene.
- It’s highly accurate, efficient and up to 50 times faster than previous methods.
Videos are comprised of a lot of redundant data between frames and it takes a vast amount of time to manually colorize each black and white frame. These types of redundancies have been extensively examined in video encoding and compression, but less explored in advanced video processing like colorizing a clip.
There are numerous algorithms (like bilateral CNN model, similarity-guided filtering, opticalflow based warping) that process local relationships between consecutive frames to propagate data. They either use apparent motion or pre-designed pixel-level features to model the similarities among frames and pixels.
However, these algorithms have several limitations, for instance, they cannot express the high-level relationships between frames and cannot accurately reflect the structure of the picture. To overcome these limitations, researchers at NVIDIA have developed a new algorithm based on deep learning method that enables the editors to rapidly colorize a whole clip by colorizing a single frame in the scene.
How It Works?
To explicitly learn high-level similarity between consecutive frames, researchers have developed a temporal propagation network that consists of a propagation component for transferring the characteristics (like color) of one frame to another. To do this, it uses a linear transformation matrix driven by a convolutional neural network (CNN).
The CNN decides what colors should be transferred from the colorized frame and fills them in the remaining black and white frames. How this technique is different from other, you asked? Well, the better colorization can be obtained through an interactive approach in which editor annotates a portion of an image, resulting in a finished product.
For learning propagation in the temporal domain, researchers enforced 2 rules. First, the propagation between frames must be invertible. Second, the target element must be preserved throughout the whole process.
They showed that the proposed technique doesn’t require any image-based segmentation method to achieve decent results comparable to existing start of the art methodologies.
To train this network, researchers used NVIDIA Titan XP GPUs. It’s trained on hundreds of clips from several datasets for high dynamic range, color, and mask propagation. The network is configured on the ACT dataset packed with 7,260 video sequences with approximately 600,000 frames.
Advantages of Proposed Technique
- High Accuracy: The new method achieves far better video quality as compared to previous works.
- High Efficiency: It executes in real-time, which is up to 50 times faster than previous methods. It further improves the efficiency by processing all video frames in parallel.
The current technique offers a simple way to propagate data over time in clips. In the coming years, researchers will try to figure how to incorporate high-level vision cues like tracing, semantic, segmentation, for temporal propagation.