Google’s New AI Can Create A Video With Just The Start & End Frames

  • The new 3D convolutional neural network can fill in the sequences between the start and end frame.
  • It uses a latent representation generator to produce a variety of video sequences.

Recent advances in artificial neural network architectures and generative adversarial networks have boosted the development of image/video synthesis methods. Most of the existing researches focus on two operations: unconditional video generation and video prediction. Both of them involves generating/predicting new plausible videos using a limited number of past frames.

Recently, a research team at Google focused on the problem of creating diverse and plausible video sequences, when there are only two frames (a start and an end frame) available. The process, called inbetweening, is usually performed by training/running recurrent neural networks, using either gated recurrent units or long-short-term memory.

However, in this study, researchers have shown that this problem (inbetweening) can be addressed through a 3D convolutional neural network. A major advantage of this method is simplicity. Since it uses no recurrent element, the shorter gradient paths can enable deeper networks and more stable training.

Fully Convolutional Model

In a convolutional network, it’s quite easy to enforce temporal consistency with the start and end frames (provided as inputs). The model has 3 key components –

  1. A 2D convolutional image encoder for mapping input key frames to a latent space.
  2. A 3D convolutional latent representation generator for incorporating the input frames’ data with progressively increasing temporal resolution.
  3. A video generator for decoding the latent representation into video frames.

Reference: arXiv:1905.10240 | NVIDIA

The team tried to create the video directly from the encoded representations of the start and end frames, but the outcomes were not up to the mark. That’s why they designed the latent representations generator, which stochastically fuses the key frame representations, and steadily increases the temporal resolution of the final video.


The team tested their model on various publicly available datasets, including UCF101 Action Recognition, BAIR, and KTH Action Database.

Examples of frames created by the new model | Courtesy of researchers 

The final outcomes: every single sample in the dataset contained a total of 16 frames, out of which 14 were generated by convolutional neural networks. The model was executed more than a hundred times for every single pair of keyframes, and the whole process was repeated 10x for each model variant.

Read: New AI Converts Black And White Videos To Color In Real-Time

In all cases, the model was able to create realistic video sequences, given that key frames are about 1/2 second apart from each other. Moreover, researchers showed that it’s possible to create a variety of sequences, by simply altering the input noise vector that drives the generative process. This new method can provide a valuable alternative perspective for future studies on video creation.

Written by
Varun Kumar

I am a professional technology and business research analyst with more than a decade of experience in the field. My main areas of expertise include software technologies, business strategies, competitive analysis, and staying up-to-date with market trends.

I hold a Master's degree in computer science from GGSIPU University. If you'd like to learn more about my latest projects and insights, please don't hesitate to reach out to me via email at [email protected].

View all articles
Leave a reply