面向第一人称视频的多模态跨域行为识别与预测研究

CASIA OpenIR > 毕业生 > 博士学位论文

	面向第一人称视频的多模态跨域行为识别与预测研究
	黄毅
	2023-05-20
页数	146
学位类型	博士
中文摘要	随着移动互联网技术的发展和智能穿戴设备的普及，第一人称视频数据的记录变得更加自动便捷。面向第一人称视频自动理解的相关技术应运而生。其在自动驾驶、人机交互等场景拥有广泛的应用前景。因此，展开相关技术的研究具有重要的理论意义和应用价值。第一人称视频的行为识别与预测，作为该应用领域的关键问题，旨在通过计算机视觉和多媒体分析等技术，提取数据中的高层语义信息，实现智能设配佩戴者当下行为活动的自动识别和未来可能发生行为的预测。尽管当前基于深度学习的视频分析技术取得了很大的进步，但将其应用到第一人称视频数据中时，面临着四个方面的因素的挑战：（1）样本稀缺，（2）多模态特性，（3）域差异，（4）时空复杂性。这些特点使得对第一人称视频数据的内容分析与行为理解，需要在样本稀缺的条件下，消除多个视频域的时空感知差异，提升第一视角视频特征的表达性能。同时，需要充分利用第一人称的多模态互补信息，得到更为有效的多模态行为特征表示。进一步地，还需要使模型能够充分理解长视频中复杂的时空概念关系和行为变化序列之间的语义关联，实现长期的行为理解。本文针对上述挑战，首先研究了数据驱动的跨域迁移方法和知识驱动的多模态学习方法，接着研究了无源域数据条件下的多模态跨域迁移方法和零样本跨域迁移方法，最后探索了基于全局关系学习的第一人称视频行为预测方法。论文的主要工作和创新点归纳如下： 1. 基于全息特征学习的跨域行为识别。研究利用大规模第三人称视频辅助挖掘第一人称视频数据的特征，通过不同视角视频之间的知识迁移提升行为识别模型的性能。针对该研究问题，本文提出了一个包含多个视角特征信息的全息特征学习方案，采用元记忆网络存储视角相关信息，同时使用一个动态元幻想模块，基于第一人称视频对记忆模块进行访问读取，利用不同视角的特有信息相互进行特征补充，在高维空间学习视频的全息特征表示，最终提升行为识别模型的性能。 2. 知识驱动的多模态行为识别。研究利用外部知识辅助挖掘第一人称视频数据的多模态特性，在有限样本的条件下整合第一人称多模态数据的相关性和互补性。针对该研究问题，本文提出了一个知识驱动的多模态行为识别框架。首先从第一人称视频和传感信号数据中提取行为和目标物体的概念信息。之后，基于外部的语义知识图谱构建提取概念的语义特征，并使用一个双支图卷积LSTM网络基于知识图谱念中的概念关系进行特征推理和多模态信息融合，实现知识驱动的行为识别，提升了行为识别模型的性能并减少了分类器学习对大规模数据的依赖。 3. 基于相对对齐的无源多模态跨域行为识别。研究无源域数据条件下的视频跨域学习方法，提升多模态第一人称视频行为识别模型在新目标域的性能。针对该研究问题，本文提出了一种多模态与时序相对对齐策略，利用自熵指导的样本划分和样本Mix-Up策略，产生与源域和目标域分布距离不同的样本，用于模拟源域和目标域视频之间的多模态以及时序分布的域差异，生成的样本在相对对齐损失函数的约束下进行样本之间的域分布差异消除，最终提升了行为识别模型在无源域数据条件下的跨域迁移能力。 4. 基于反事实样本生成的无源零样本跨域行为识别。研究无源域数据条件下，新目标域场景中无法获取部分类别的样本时的模型跨域迁移学习方法，提升第一人称视频行为识别模型对新目标域零样本类别识别的泛化能力。针对该研究问题，本文提出了一种反事实样本生成方法，基于真实目标域样本的特性生成不同域和不同类别的虚拟样本。生成的双域虚拟样本在预测一致性和多模态特征对齐的约束下，辅助模型学习源域和目标域分布一致的特征表示以及进行目标域零样本类别的知识迁移，最终提升了行为识别模型在无源域数据条件下的零样本跨域迁移能力。 5. 基于全局关系知识蒸馏的多模态行为预测。研究面向第一人称视频的行为预测方法，通过充分探索长视频中不同时刻行为之间的语义关联，提升行为预测模型的性能。针对该研究问题，本文提出了一个多模态全局关系知识蒸馏网络，采用图卷积神经网络进行视频片段与片段之间的关系建模，并使用知识蒸馏策略，首先使用教师模型学习包含未来视频片段的完整视频数据的判别性特征和视频全局关系知识，并将这两部分的额外知识蒸馏到学生模型中，最终提升了学生模型预测未来行为的能力。
英文摘要	With the development of mobile Internet technology and the popularity of smart wearable devices, recording of first-person video data has become more and more convenient. Related technologies for automatic understanding of first-person videos have emerged at the same time, which have broad application prospects in several scenarios such as autonomous driving and human-computer interaction. Therefore, it has important theoretical significance and application value to conduct research on related technologies. As the key issues in this application field, action recognition and prediction of first-person videos aim to extract high-level semantic information from these data through computer vision and multimedia analysis technologies, so as to perform automatic recognition of current action and prediction of possible future action of the intelligent device wearers. Although current deep learning based video analysis technology has made significant progresse, when applied to first-person video data, it faces four challenges: (1) sample scarcity, (2) multimodal characteristics, (3) domain discrepancy, and (4) spatio-temporal complexity. To improve the representation performance of first-person video features, the above characteristics make it necessary for the analysis and understanding of first-person videos to diminish the spatio-temporal perception discrepancies across multiple video domains under the condition of scarce samples. At the same time, it is also necessary to make full use of the multimodal complementary information of the first-person data to obtain a more effective multimodal feature representation. Furthermore, to achieve a long understanding, it is also necessary to enable the model to fully understand the complex spatio-temporal conceptual relationship and the semantic association between actions changing in long videos. To solve the above challenges, this dissertation first studies data-driven cross-domain learning method and knowledge-driven multimodal learning method, then studies multimodal cross-domain learning and zero-shot cross-domain learning methods without using source data. Finally, a first-person video action prediction method based on global relational learning is explored. The main work and innovations of this dissertation are summarized as follows: 1. Holographic feature learning for cross-domain action recognition. This dissertation studies to use large-scale third-person videos to assist in mining the discriminative features from first-person videos through transferring knowledge between two views. Specifically, this dissertation proposes a scheme named holographic feature learning which contains the information of features on both views. A meta-memory network is used to store view-related information, while a dynamic meta-hallucination module is used to read the memory module based on the input first-person video. The view-related information from two views is used to complement each other's features. Finally, the holographic feature of the video is learned in a high-dimensional space, which improves the performance of the action recognition model. 2. Knowledge-driven multimodal action recognition. This dissertation studies to use external knowledge to assist in mining the multimodal characteristics of first-person videos which integrates the correlation and complementarity of first-person multimodal data under the condition of limited sample size. Specifically, this dissertation proposes a knowledge-driven multimodal action recognition framework. Firstly, the conceptual information of actions and objects are extracted from first-person videos and sensor signals. Afterwards, the semantic features of the extracted concepts are constructed based on the external semantic knowledge graph, and a two-branch convolutional LSTM network is used to perform feature reasoning and multimodal information fusion based on the concept relationship in the knowledge graph, so as to realize knowledge-driven action recognition. The proposed method improves the performance of the action recognition model and reduces the dependence of classifier learning on large-scale data. 3. Relative alignment for source-free multimodal cross-domain action recognition.This dissertation studies the video cross-domain learning method without using source data to improve performance of the multimodal first-person video action recognition model in the target domain. Specifically, this dissertation proposes a multimodal and temporal relative alignment strategy, using self-entropy-guided sample division and sample Mix-Up strategies to generate samples with different distances from the source domain and target domain for simulating the multimodal and temporal distribution discrepancies between two domains. The generated samples are constrained by the relative alignment loss function to eliminate the domain distribution discrepancies between these samples. Finally, the performance of the model for cross-domain action recognition is improved without using source data. 4. Counterfactual sample generation for source-free zero-shot cross-domain action recognition. To improve the generalization ability of the first-person video action recognition model in recognizing zero-shot categories in new target domains without using source data, when the samples of some categories may not be obtainable in the new target scene, this dissertation studies a source-free zero-shot cross-domain transfer learning method. Specifically, this dissertation proposes a counterfactual sample generation method which generates virtual samples of different domains and categories based on the characteristics of real target domain samples. With the constraints of prediction consistency and multimodal feature alignments, the generated virtual samples of two domains assist the model in learning feature representations which are consistent between the source and target domain distributions and transferring knowledge from source model to zero-shot category classifiers in the target domain. Finally, the performance of the model for source-free zero-shot action recognition is improved. 5. Global relational knowledge distillation for multimodal action prediction. For action prediction of first-person videos, this dissertation aims at improving model performance by fully exploring the semantic correlation between actions at different moments in long videos. Specifically, this dissertation proposes a multimodal global relational knowledge distillation network, which uses graph convolutional neural network to model the relationship between video clips with a knowledge distillation strategy. Firstly, the teacher model is used to learn the discriminative features of future clip and global relationship based on full video including future clips. Then, the knowledge of discriminative features and global relationship in full video are distilled into the student model to improve its ability for future action prediction.
关键词	第一人称视频行为识别行为预测多模态学习跨域学习
语种	中文
七大方向——子方向分类	多模态智能
国重实验室规划方向分类	多模态协同认知
是否有论文关联数据集需要存交	否
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/52095
专题	毕业生_博士学位论文
推荐引用方式 GB/T 7714	黄毅. 面向第一人称视频的多模态跨域行为识别与预测研究[D],2023.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
面向第一人称视频的多模态跨域行为识别与预（10373KB）	学位论文		限制开放	CC BY-NC-SA