- The new 3D convolutional neural network can fill in the sequences between the start and end frame.
- It uses a latent representation generator to produce a variety of video sequences.
Recent advances in artificial neural network architectures and generative adversarial networks have boosted the development of image/video synthesis methods. Most of the existing researches focus on two operations: unconditional video generation and video prediction. Both of them involves generating/predicting new plausible videos using a limited number of past frames.
Recently, a research team at Google focused on the problem of creating diverse and plausible video sequences, when there are only two frames (a start and an end frame) available. The process, called inbetweening, is usually performed by training/running recurrent neural networks, using either gated recurrent units or long-short-term memory.
However, in this study, researchers have shown that this problem (inbetweening) can be addressed through a 3D convolutional neural network. A major advantage of this method is simplicity. Since it uses no recurrent element, the shorter gradient paths can enable deeper networks and more stable training.
Fully Convolutional Model
In a convolutional network, it’s quite easy to enforce temporal consistency with the start and end frames (provided as inputs). The model has 3 key components –
- A 2D convolutional image encoder for mapping input key frames to a latent space.
- A 3D convolutional latent representation generator for incorporating the input frames’ data with progressively increasing temporal resolution.
- A video generator for decoding the latent representation into video frames.
The team tried to create the video directly from the encoded representations of the start and end frames, but the outcomes were not up to the mark. That’s why they designed the latent representations generator, which stochastically fuses the key frame representations, and steadily increases the temporal resolution of the final video.
The team tested their model on various publicly available datasets, including UCF101 Action Recognition, BAIR, and KTH Action Database.
Examples of frames created by the new model | Courtesy of researchers
The final outcomes: every single sample in the dataset contained a total of 16 frames, out of which 14 were generated by convolutional neural networks. The model was executed more than a hundred times for every single pair of keyframes, and the whole process was repeated 10x for each model variant.
In all cases, the model was able to create realistic video sequences, given that key frames are about 1/2 second apart from each other. Moreover, researchers showed that it’s possible to create a variety of sequences, by simply altering the input noise vector that drives the generative process. This new method can provide a valuable alternative perspective for future studies on video creation.