CASIA OpenIR  > 多模态人工智能系统全国重点实验室  > 多媒体计算
视频理解中的关系学习研究
高君宇
2020-05-30
Pages154
Subtype博士
Abstract

随着便携式数码设备的普及和移动互联网的发展,视频数据的获取和传输变得更为便捷,视频大数据应运而生。海量的视频数据亟需智能的视频理解技术。视频理解是一个融合视频底层特征信息和高层语义信息的过程,并服务于用户的不同需求。高效的视频理解技术可以使计算机智能地完成各种视频相关的任务,如视频监控、视频娱乐等,因此开展这项研究具有重要的理论意义和应用价值。

视频大数据具有(1)时空复杂,(2)底层特征与高层语义之间存在“语义鸿沟”,(3)类别丰富,(4)多模态,(5)个性化需求多样等特点。这些特点在视频数据中表现为纷繁复杂的关系信息,因此为视频的智能理解带来了巨大的挑战。事实上,针对视频中复杂而多样的关系模式进行学习对深入地理解视频内容是至关重要的。本文围绕如何设计有效的关系学习方法来进行视频理解展开,自底向上地重点研究了视频中的三种关系结构信息:首先针对视频中的物体层面,研究了物体表观中的结构化关系建模;接着以物体为纽带,深入挖掘了视频中的物体-语义关系信息,从而实现了视频高层语义的自动提取;最后,探索了视频语义-用户兴趣之间的关系,完成了视频的个性化服务。

论文的主要工作和创新点归纳如下:

视频中物体表观关系建模。物体是组成视频的基础要素,人们通常对视频中感兴趣的物体投以更大的关注。因此,如何实现鲁棒的视频中物体表观建模是视频理解的基础。当前大多数视频中物体表观建模的方法通常采用直接式的二分类优化目标,因而无法考虑视频中物体局部图像块间的结构化相对性关系。本文提出了基于相对性的物体表观建模方法,利用卷积神经网络学习排序函数以刻画局部区域之间的相对关系。相对性模型和深度神经网络可以在物体表观关系学习中相互促进。进一步,本文探索了时空结构化关系信息对物体表观建模的价值。现有方法难以充分利用不同上下文环境中的物体时空表观信息。事实上,时空信息可以提供丰富的特征来增强目标物体的表示,上下文信息可以为物体的判别提供更好的在线自适应。本文在一个孪生网络框架下,联合地组织了时空图卷积网络和上下文自适应图卷积网络来学习物体时空结构的自适应表示。在视频跟踪任务上的实验结果验证了提出的方法对视频中物体表观建模的有效性。

视频中物体-语义关系挖掘。物体间的相互关联及协同组成了视频中更为高级的语义信息。然而,大部分现有算法只利用视频底层的视觉特征进行语义识别,而忽略了利用外部知识信息来建模视频中物体和语义间的显式关系。为了减少算法与人类之间的知识鸿沟,本文首先在有监督学习的设置下挖掘知识指导的物体信息与视频高层语义之间的关系,并提出了一个端到端的视频语义理解框架,其利用结构化的知识图谱学习视频内物体-语义之间的动态结构化关系信息。为了高效地利用知识图谱,本文设计了一种动态图卷积模型来同时识别视频片段中的局部知识结构且建模这些连续视频片段之间的动态知识演化。进一步,为了在视频大数据爆发式增长的环境中获得更具泛化性的视频语义理解模型,本文在零样本学习的设置下完成了物体-语义关系学习。提出了一个基于知识图谱的端到端零样本视频语义分类框架,其由包括原型支和实例支在内的双支图神经网络组成。通过将物体视作零样本学习中的属性,提出的方法可以联合建模视频中类别-属性、属性-属性、类别-类别之间的关系。在视频分类和视频零样本分类任务上的实验结果表明了提出的视频中物体-语义关系挖掘方法的有效性。

视频语义-用户兴趣关系学习。视频理解的最终目的是服务于用户的使用。随着互联网的飞速发展,在线观看视频成为了人们日常生活中不可或缺的一部分,而视频推荐算法充当了连接视频语义和用户兴趣的重要手段。目前大多数方法都假设用户的兴趣是静态的,事实上,这种假设不足以反映用户兴趣随时间变化的动态关联关系,尤其是在视频内容日新月异变化的在线视频平台中。为了解决这个问题,本文设计了一个统一的框架,利用动态的循环神经网络来建模用户的个性化兴趣。为了更好地建模用户兴趣,本文设计了视频语义嵌入、用户兴趣建模和用户关联性挖掘来联合地学习视频语义和用户兴趣之间的潜在关系。通过这种方式,提出的框架变成了一个兴趣感知的网络,其可以高效地捕捉用户动态变化的兴趣。

Other Abstract

With the popularity of portable digital devices and the development of mobile Internet, the acquisition and transmission of video data have become more convenient, which results in the emerged video big data. These massive video data urgently require intelligent video understanding technology. Video understanding is a process that integrates the low-level feature information and high-level semantic information of the video, and serves for different kinds of users. Efficient video understanding technology can enable computers to intelligently complete various video-related tasks, such as video surveillance, video entertainment, etc. Therefore, it is very important and promising to carry out this research.

Video big data have several characteristics: (1) complex spatial-temporal structure, (2) semantic gap between low-level features and high-level semantics, (3) a wide variety of categories, (4) multi-modality, and (5) diverse personalized requirements. The characteristics are caused by the complex relations in videos, which bring huge challenges for intelligent video understanding. In fact, learning the complex relation patterns is essential for understanding the video content. This thesis focuses on how to design effective relation learning methods for video understanding. Specifically, we focus on three types of relation structures in a bottom-to-up manner: first, for the object level in videos, we study the structured relation modeling for object appearance. Then we use objects as a bridge to explore the object-semantic relations, resulting in automatic extraction of high-level video semantics; Finally, we focus on the relations between video semantics and user interests, and achieve personalized video service.

The major contributions of this thesis are summarized as follows:

Object appearance relation modeling in videos. Objects are the basic elements in a video. People usually pay more attention to the objects of interest. Therefore, how to achieve a robust object appearance model is the basis of video understanding. At present, most existing methods directly exploit a binary classification strategy for object appearance modeling, which cannot leverage the structured relative relations between local image patches of objects in videos. This thesis proposes a relative object appearance modeling method, using a convolutional neural network to learn a ranking function to determine the relative relationship between local regions. The proposed algorithm can effectively make the relative model and deep learning model enhance and complement each other in object appearance modeling. Further, this thesis explores the spatial-temporal relations in object appearance modeling. It is difficult for existing methods to take full advantage of spatial-temporal target appearance modeling under different contextual situations. In fact, the spatial-temporal information can provide diverse features to enhance the object representation, and the context information is important for online adaption. To comprehensively leverage the spatial-temporal structure of target objects and benefit from the context information, in this thesis, we jointly organize a spatial-temporal graph convolutional network and a context-adaptive graph convolutional network in a Siamese framework to learn the adaptive representation of the object spatial-temporal structure. The experimental results on the video tracking task verify the effectiveness of the proposed method for the object appearance modeling in videos.

Object-semantic relation mining in videos. The relations and interactions between objects constitute high-level semantic information in videos. However, most existing algorithms only exploit the visual cues of videos for semantic recognition but ignore external knowledge information for modeling explicit relations between objects and video semantics. To narrow the knowledge gap between existing methods and humans, this thesis first explores object-driven object-semantic relations under the setting of supervised learning. This thesis proposes an end-to-end video semantic understanding framework based on structured knowledge graphs, which can model the dynamic knowledge relations between objects and semantics in videos. To effectively leverage the knowledge graphs, we adopt a dynamic graph convolution model to not only identify local knowledge structures in each video shot but also model dynamic patterns of knowledge evolution across these shots. Furthermore, in order to obtain a more generalized model for handling the explosive growth of video big data, this thesis achieves object-semantic relation learning under the zero-shot setting. An end-to-end zero-shot video classification framework with knowledge graphs is proposed, which is a two-branch graph neural network including a prototype branch and an instance branch. By treating objects as attributes in the zero-shot setting, the proposed method can jointly model the category-attribute, attribute-attribute, and category-category relations in videos. Experiments on the tasks of video classification and zero-shot video classification show the effectiveness of the proposed object-semantic mining methods.

Video semantics-user preference relation learning. The ultimate goal of video understanding is to serve users' requirements. With the rapid development of the Internet, watching videos online has become an indispensable part of people's daily lives, and video recommendation algorithms have become an important way for connecting video semantics and user interests. Most existing video recommendation methods assume that user profiles (interests) are static. In fact, the static assumption is inadequate to reflect users' dynamic interests as time goes by, especially in the online video recommendation scenarios with dramatic changes of video contents and frequent drift of users' interests over different topics. To overcome this issue, this thesis proposes a dynamic recurrent neural network to model users' dynamic interests over time in a unified framework. Furthermore, to build a much more comprehensive recommendation system, the proposed model is designed to exploit video semantic embedding, user interest modeling, and user relevance mining jointly to model users' preferences. By considering these three factors, the RNN model becomes an interest-aware network that can capture users' dynamic interests effectively.

Keyword视频理解 关系学习 物体表观建模 语义挖掘 个性化应用
Language中文
Sub direction classification图像视频处理与分析
Document Type学位论文
Identifierhttp://ir.ia.ac.cn/handle/173211/39179
Collection多模态人工智能系统全国重点实验室_多媒体计算
Recommended Citation
GB/T 7714
高君宇. 视频理解中的关系学习研究[D]. 中国科学院自动化研究所. 中国科学院自动化研究所,2020.
Files in This Item:
File Name/Size DocType Version Access License
GJY-Thesis-final.pdf(8793KB)学位论文 开放获取CC BY-NC-SA
Related Services
Recommend this item
Bookmark
Usage statistics
Export to Endnote
Google Scholar
Similar articles in Google Scholar
[高君宇]'s Articles
Baidu academic
Similar articles in Baidu academic
[高君宇]'s Articles
Bing Scholar
Similar articles in Bing Scholar
[高君宇]'s Articles
Terms of Use
No data!
Social Bookmark/Share
All comments (0)
No comment.
 

Items in the repository are protected by copyright, with all rights reserved, unless otherwise indicated.