- The new AI optimally splits multiple scenes within a video and tags each scene with metadata.
- It generates an inverted index of video content that can be searched via text input.
- This can help users and organizations utilize video content in a new manner.
In the recent years, the volume of video content has significantly increased, which has made it necessary to develop technologies that can help efficiently manage, consume and recommend heterogeneous videos. Sometimes, it can be difficult for consumers or even data analysts to understand the video content and location of the information they seek.
That’s why we have video scene detection tools that retrieve, summarize and classify the content of multi-chaptered videos. Basically, video scene detection is a process of temporally splitting a video into logical scenes. A scene is simply a set of shots that pottery a story, and each shot is made up of a number of frames.
Now, the researches at IBM have built a new technique that can efficiently detect specific scenes from a video. It can help users and organizations utilize video content in a new manner. But there are already dozens of methods that perform the same task, so how this one is different?
Using “Normalized Cost” for Video Scene Detection
A video is divided into multiple scenes, and AI tags each scene with metadata (like where it happens, what’s going on, who is starring) to augment the content with useful information. This creates an inverted index of video content that can be searched via text input.
To do this, researchers formalized scene detection task as a generic optimization process of sequential grouping. They selected optimal points for division according to the features representing shots in the video, and a distance matrix to calculate the similarity/dissimilarity between them.
They developed a normalized cost function that optimally works in difficult scenarios like weak scene indications and widely varying scene lengths. It provides a desired mathematical formulation to enable correct and unbiased partitioning.
In addition, they used deep neural networks to efficiently grab the semantic components in a scene. The networks make precise identification of correlative components, allowing scene partition that needs in-depth understanding.
Given a clip, researchers applied shot boundary detection and extracted a feature vector for individual shots using both audio and visual features. This allowed them to measure the distance between each pair of shots.
Ideal distance matrix and real distance matrix | Courtesy of researchers
The idea is that shots from different scenes will have a high distance value, while shots within the same scene will result in a low distance value. If you select the features appropriately, it will likely represent a diagonal-block-shaped layout where each block represents a scene. Of course, for a real clip, the distance matrix doesn’t look ideal, but you could still observe the block structure.
Overview of the proposed technique
The next step is to formulate the optimization problem with a cost function over distance values. Optimal scene division can be achieved by minimizing the cost function.
This technique has several substantial advantages: it doesn’t include any sensitive parameters, thus you don’t need to fine-tune the method to make it applicable for various kinds of content. Also, the novel dynamic programming makes scene partition process quick.
Since the technique is formulated as an optimization problem for sequential data, it can works with other tasks as well, including data analysis (to detect change points) and audio partitioning.
Furthermore, the comprehensive evaluation clearly indicates that the technique outperforms the existing state-of-the-art methods on real-world videos.