CASIA OpenIR  > 模式识别国家重点实验室  > 多媒体计算
面向多媒体数据的关系学习算法及应用研究
马璇
Subtype硕士
Thesis Advisor徐常胜
2021-05
Degree Grantor中国科学院自动化研究所
Place of Conferral智能化大厦三层第五会议室
Degree Discipline模式识别与智能系统
Keyword多媒体 关系学习 深度学习
Abstract

      随着数码设备的普及和移动互联网的发展,文本、图像和视频等多媒体数据在社交媒体平台上大量涌现。 多媒体数据中包含了丰富的语义信息,也存在着复杂的关系。对多媒体数据中的关系进行研究,不仅能够促进计算机对多媒体数据的智能理解,同时能够推动多媒体相关任务的发展,如多媒体检索、多媒体问答等。因此,开展此项研究具有重要的理论意义与应用价值。

       多媒体数据具有(1)底层特征与高层语义间存在“语义鸿沟”,(2)多模态,(3)时空复杂等特性。这些特性使得对多媒体数据中的关系进行学习十分具有挑战性。本文围绕如何设计有效的关系学习算法,以及如何将关系学习算法应用到具体任务中展开,结合深度学习方法,自底向上地研究了三个问题:首先针对多媒体数据中的物体,研究了多模态物体关系学习,完成了图像关系检测任务;接着以行为为研究对象,学习了多模态的行为语义关系,完成了行为推理任务;最后,对多模态个体行为中的时序关系进行学习,完成了个体的健康状态预测任务。

      论文的主要工作和创新点归纳如下:

      1. 多模态物体关系学习。
      物体是多媒体数据中的基本存在单元,对物体间的关系进行学习是多媒体关系学习的基础。当前的大多数物体关系学习算法仅考虑了关系在单模态数据中的表示,未能有效利用多模态信息对关系进行表示以及建模。针对上述问题,本文提出了带有语义限制的多模态潜在因子模型,并应用于图像关系检测任务。首先综合利用了文本数据与图像数据中的信息,得到了物体的多模态表示。进一步,利用潜在因子模型对物体间的关系进行了建模,对关系三元组的存在概率进行了评估。最后,计算了关系间的相似程度,对关系的学习过程进行了限制,最终得到了关系的多模态表示。在Visual Relationship和Visual Genome两个数据集上进行的实验证明了模型的有效性。

      2. 多模态行为语义关系学习。
      多媒体数据中的行为通常以人为中心,反映了人的活动和状态等内容。与物体相比,行为包含了更高层次的语义信息。对行为间的语义关系进行学习同样是理解和利用多媒体数据的重要部分。现有的行为关系学习方法仅利用了行为的单模态信息,行为的表示方式过于简单,难以充分挖掘行为中包含的内容。针对此问题,本文提出了一个多模态层次化图网络模型来对行为的语义关系进行学习,并应用于行为推理任务。该模型首先利用文本数据与图像数据得到了行为的多模态表示。进一步,分别构建了物体图与行为图,从两个层级对行为中出现的物体以及行为之间的语义关系进行建模。之后,利用图神经网络学习得到了物体图与行为图的表示,并结合行为的多模态表示得到了行为的最终表示,以送入解码器中进行解码,从而完成行为推理任务。在Event2mind数据集上进行的实验证明了所提出模型的有效性。

       3. 多模态个体行为时序关系学习。
       随着移动设备的普及,人类日常行为数据很容易被各种传感器采集得到。
这些传感器数据中包含着复杂的时序关系。对行为的时序关系进行学习可以促进对个体行为习惯的深入分析与理解,在此基础上可进一步实现对个体健康状态的预测。本文以多源传感器数据为基础,构建了一个个体行为图,用于对个体行为的时序关系进行学习,进而完成健康状态预测任务。该图包括多个局部行为图与一个全局行为图,分别用于获取行为的短期局部关系以及全局时序关系。本文利用异质图神经网络学习得到了局部行为图的表示,利用自注意力网络学习了全局行为图的表示,并用于预测健康状态。在公开数据集StudentLife上进行了实验,并与多种传统机器学习方法以及深度学习方法进行了对比。实验结果证明了所提出模型的有效性。

Other Abstract

With the popularity of digital devices and the development of the mobile Internet, enormous multimedia data such as texts, images, and videos have emerged on social media platforms. Multimedia data contain rich semantic information, and exhibit complex relations. Researching the relations in multimedia data can not only enable computers to understand data more intelligently, but also promote the development of related tasks, such as multi-modal retrieval, multi-modal question answering, etc. Therefore, it is very important and promising to carry out this research.

Multimedia data have several characteristics: (1) semantic gap between low-level features and high-level semantics, (2) multi-modality, and (3) complex temporal and spatial structure. These characteristics make it extremely challenging to learn the relations in multimedia data. This thesis focuses on how to design effective relation learning algorithms and how to apply the algorithms to specific tasks. Three issues are studied in a bottom-to-up manner with deep learning methods: first, multi-modal object relation learning is studied, with visual relation detection task completed on it; then the multi-modal behavior semantic relation is attended and  behavior inference task is finished based on it; finally, individual behavior temporal relation is learned to predict the individual health status.

The contributions of the thesis are summarized as follows:

1. Multi-modal object relation learning.
Objects are the basic units in multimedia data, and learning the relation between objects is the basis of multimedia relation learning. Most current object relation learning algorithms only consider the relation representation in single modality, which fails to use the multi-modal information of them. To solve the above problem, the thesis proposes a multi-modal latent factor model with language constraint and applies it to the visual relation detection task. The thesis first comprehensively utilizes the information in text and image modalities to obtain a multi-modal representation of the object. Furthermore, the thesis uses the latent factor model to learn the relation between objects, and evaluates the existence probability of the relation triples. Finally, the thesis calculates the similarity between relations to limit the relation learning process, after which the multi-modal relation representation is learned. The experiments conducted on Visual Relationship and Visual Genome datasets prove the effectiveness of the model.

2. Multi-modal behavior semantic relation learning.
The behavior in multimedia data is usually human-centered, reflecting human activities and status. Compared with objects, behavior contains more high-level semantic information. Learning the semantic relation between behaviors is also an important part for the understanding and utilization of multimedia data. However, existing  relation learning methods only use the single modality, and the behavior representation is too simple, which make it difficult to fully explore the information contained in the behavior. To solve the above problems, this thesis proposes a multi-modal hierarchical graph network to learn the behavior semantic relations and applies it to the behavior inference task. The model first learns the multi-modal behavior representation based on text and image data. Furthermore, object-level and behavior-level representations and relations are considered simultaneously with the building of object graph and behavior graph. After that, the thesis takes advantage of the graph neural network to learn the representations of the two graphs, which are then combined with the multi-modal representation to get the final behavior representation and sent to the decoder for result generation. The experiments conducted on the Event2mind dataset have proved the effectiveness of the proposed model.

3. Multi-modal individual behavior temporal relation learning.
The popularity of mobile devices makes it easy to collect individual behavior data with various sensors.  These data streams contain complex temporal relations, whose learning can benefit the analysis and understanding of individual behaviors, thus make the prediction of individual health status. The thesis proposes an individual behavior graph to learn the temporal relations based on multi-source sensor data, thus predict the individual health status. The graph contains multiple local context sub-graphs and a global temporal sub-graph to capture the short-term context relation and long-term temporal relation of individual behaviors respectively. The thesis learns the semantic representation and structural representation for the local context graph with heterogeneous neural network, and a self-attention network is designed to learn the representation for the global temporal graph, which is finally used to predict the health status. The thesis performs experiments on the public dataset StudentLife and compares with popularly used machine learning and deep learning methods, and the experiment results validate its effectiveness.
 

Pages90
Language中文
Document Type学位论文
Identifierhttp://ir.ia.ac.cn/handle/173211/44775
Collection模式识别国家重点实验室_多媒体计算
Recommended Citation
GB/T 7714
马璇. 面向多媒体数据的关系学习算法及应用研究[D]. 智能化大厦三层第五会议室. 中国科学院自动化研究所,2021.
Files in This Item:
File Name/Size DocType Version Access License
马璇-毕业论文-0607.pdf(1933KB)学位论文 开放获取CC BY-NC-SA
Related Services
Recommend this item
Bookmark
Usage statistics
Export to Endnote
Google Scholar
Similar articles in Google Scholar
[马璇]'s Articles
Baidu academic
Similar articles in Baidu academic
[马璇]'s Articles
Bing Scholar
Similar articles in Bing Scholar
[马璇]'s Articles
Terms of Use
No data!
Social Bookmark/Share
All comments (0)
No comment.
 

Items in the repository are protected by copyright, with all rights reserved, unless otherwise indicated.