- Researchers demonstrate a new type of video-to-video synthesis.
- It allows developers to render fully interactive 3D environments from real-world videos.
- It can create 30-seconds long videos of 2K resolution.
Almost two decades ago, NVIDIA came up with the world’s first GPU, offering a significantly large leap in 3D gaming performance. Now, they have introduced an artificial intelligence tool that allows developers to render fully synthetic, interactive three-dimensional environments from real-world videos.
The ability to model and recreate real-world dynamics is crucial for developing intelligent agents. Synthesizing continuous visual experiences has a variety of applications in computer graphics and robotics. It could help developers create realistic scenes without specifying lighting, materials and scene geometry.
In this work, researchers have demonstrated a new type of video-to-video synthesis. The objective is to learn a mapping function that can efficiently transform an input video to an output video. They have synthesized high-resolution, temporally consistent videos using generators and discriminators, and spatio-temporal adversarial learning.
Using Neural Networks To Render High-Level Descriptions
To render the synthetic three-dimensional world in real time, they started with a conditional generative neural network and trained it on existing videos. The networks gradually learned rendering objects like vehicles, buildings, and trees.
With existing technology, the developers need to model each object individually, which is both time-consuming and expensive process. On the other hand, the new tool is based on a model that automatically learns from real video and creates virtual worlds for automotive, gaming, robotics, architecture and virtual reality.
Reference: arXiv:1808.06601 | NVIDIA | GitHub
It can create interactive environments based on real locations, or can display people dancing like their favorite rock stars. The network works on high-level descriptions of a 3D scene, such as edge maps describing locations of objects as well as their general attributes like whether a certain portion of an image consists of a building or a car. Then, it uses real-world scenes to fill in the details.
The neural networks were trained on videos of actual urban areas. Researchers created a demo that enables people to navigate a virtual urban world rendered by the networks. Since scenes are created synthetically, it’s easy to edit, add or modify objects in the virtual scene.
Courtesy of researchers
The demo runs on NVIDIA Tensor Core GPUs and provides a whole new experience of interactive graphics, as per the report. The neural networks are trained on DGX-1 along with CUDA Deep Neural Network library, using NVIDIA Tesla V100 GPUs. The team selected several thousands of clips from the Cityscapes and Apolloscapes datasets to train the networks.
Testing
They carried out multiple tests and obtained both quantitative and qualitative outcomes, which show that the sysnthesized scenes looks more realistic than those generated by existing state-of-the-art methods.
This new AI can produce 30-seconds long videos of 2K resolution. Also, it provides high-level control over the output. For instance, on can easily add or replace trees with buildings in the scene.
The approach is not perfect and fails in several scenarios, like rendering turning vehicle due to insufficient map data. However, this could be fixed by integrating 3D cues like depth maps.
Read: Google AI Can Track Objects By Colorizing Videos
Though the study is early-stage, applications of this technique could make it much easier and cheaper to develop virtual surroundings for a variety of domains.