Video classification has been achieved by automatically mining the underlying concepts (e.g. actions, events) in videos, which plays an
essential role in intelligent video analysis. However, most existing algorithms only exploit the visual cues of these concepts but ignore
external knowledge information for modeling their relationships during the evolution of videos. In fact, humans have remarkable ability
to utilize acquired knowledge to reason about the dynamically changing world. To narrow the knowledge gap between existing methods and humans, we propose an end-to-end video classification
framework based on a structured knowledge graph, which can model the dynamic knowledge evolution in videos overtime. Here, we map the concepts of videos to the nodes of the knowledge graph.
To effectively leverage the knowledge graph, we adopt a graph convLSTM model to not only identify local knowledge structures in each video shot but also model dynamic patterns of knowledge evolution
across these shots. Furthermore, a novel knowledge-based attention model is designed by considering the importance of each video shot and relationships between concepts. Extensive experimental
results on two video classification benchmarks UCF101 and Youtube-8M demonstrate the favorable performance of the proposed framework.
修改评论