Google AI Can Create Short Video Clips From Two Still Images

  • Researchers develop a fully-convolutional deep network to extrapolate view beyond two input images. 
  • They used a layered representation to show hidden surfaces of input images and predict output views. 
  • The system can handle both indoor and outdoor pictures. 

Over the past decade, photography has changed a lot and this change has been driven by smartphones’ better hardware technology and features like synthetic defocus and dynamic range imaging. These innovations have replicated capabilities of conventional cameras.

Now smartphones are coming with new types of sensors, which includes depth sensors and multiple lenses, enabling applications far superior than conventional photography. While stereo cameras have been there for a while, dual-lens cameras with small baseline have started appearing in the market. Also, some VR devices, built with dual cameras spaced about eye-distance apart, capture stereo images and video.

Motivated by the enormous growth of these stereo cameras, researchers at Google have developed an artificial intelligence system that can create short video clips from two still images captured via VR, stereo and dual lens cameras, like iPhone 8 or X.

Google is already leading the field of AI. In the last couple of years, they have published numerous exciting studies, including a system that predicts heart disease by scanning your eyes, voice AI indistinguishable from humans, AI creating another AI that beats human code, and even spotted an exoplanet in distant space.

How Did They Do This?

Researchers focused on extrapolating views beyond the 2 input pictures. The first challenging task is to handle pictures with transparency and reflection. The next thing is to render occluded pixel.

To deal with these problems, they performed view extrapolation from a huge amount of visual data using a deep learning system. The objective is to train deep neural network to infer a global scene representation to synthesize novel views of the same scene, extrapolating beyond the given two images.

First, they need to look for a scene representation that can be predicted from input images, and then reused to predict other output images. Second, they require a representation that can show obstructed/hidden surfaces on both input images. To fulfill these conditions, they developed MultiPlane Image, a layered representation.

And of course, the training data was extracted from world’s most popular video streaming platform, YouTube. They collected stereo pairs and additional images that are a little distance from an input stereo pair.

So, in short, the study includes 3 key elements-

  1. A learning framework for stereo magnification.
  2. MultiPlane Image, a layered representation to perform view synthesis.
  3. Online videos to learn view synthesis.

System overview | Courtesy of researchers

To infer the multiplane representation, they used fully-convolutional deep network. As you can see in the above figure (system overview), for each plane, the network predicts an alpha image. Reference source and predicted background image are used to blend color image.

In the training phase, the network is configured to predict a multiplane image representation, which uses a differentiable rendering module to reconstruct target views. In the testing phase, the multiplane image representation is inferred once for each scene, which can be further utilized for synthesizing novel views with minimum computation.

Reference: arXiv:1805.09817 | Google 

Developers have trained their system on more than 7,000 real estate YouTube videos, using NVIDIA Tesla P100 GPUs and CUDA deep neural network-accelerated TensorFlow framework.


The view synthesis system based on multiplane images is trained on a large and varied dataset, and can handle both indoor and outdoor pictures.

The researchers claimed that their system performed better than previous techniques – it can efficiently magnify the narrow baseline (on the order of a centimeter) of stereo pictures shot by stereo cameras and smartphones.

However, the system has one minor drawback; it struggles to place the fine detail of complex background at the correct depth.

Read: Google AI Can Now Pick Particular Voice In A Crowd

Researchers believe that the system can generalize a wide range of task, such as extrapolating from single or more than 2 input scenes, and producing lightfields enabling view movements in multiple dimensions.

Written by
Varun Kumar

I am a professional technology and business research analyst with more than a decade of experience in the field. My main areas of expertise include software technologies, business strategies, competitive analysis, and staying up-to-date with market trends.

I hold a Master's degree in computer science from GGSIPU University. If you'd like to learn more about my latest projects and insights, please don't hesitate to reach out to me via email at [email protected].

View all articles
Leave a reply