基于特征学习和关系推理的视频行为识别

CASIA OpenIR > 毕业生 > 博士学位论文

	基于特征学习和关系推理的视频行为识别
	胡古月
	2021-05
页数	126
学位类型	博士
中文摘要	视频行为识别是计算机视觉领域一个重要的研究问题，它在人机交互、智能监控、视频检索、自动驾驶、虚拟现实等领域有广泛的应用前景。依照时空复杂程度，视频行为可以划分为简单动作 (action)、群体活动 (group activity) 和长时活动 (long activity)，群体活动是简单动作在空间上的组合，长时活动是简单动作在时间上的组合。本文从特征学习和关系推理两个角度出发，针对不同数据模态和不同时空复杂度的视频行为开展研究，以期实现低标注、高效率、高精度的视频行为识别。具体选取计算高效的骨架视频、应用最广泛的 RGB 视频、以及视频与文本相结合的多模态视频三种代表性的行为数据，研究了简单动作、群体活动和长时活动三类由简单到复杂的视频行为的识别问题，提出了基于特征学习和关系推理的一系列新方法，有效地提升了视频行为识别的性能。具体的研究内容包括：第一，面向骨架视频中的简单动作 (action)，本文提出了一种基于时空频域联合学习的动作识别方法。早期骨架视频中的动作识别方案主要基于时空域的各类局部神经网络，它们局限于时空特征的学习而忽略了行为固有的频域模式，同时在普通局部神经网络中局部细节和非局域部语义的提取存在异步的问题。因此，本文提出了一种时空频域联合学习的动作识别方案。具体地，在时空域，提出了局部与非局部同步的时空学习模块，使得网络各层都能同步挖掘局部细节和非局部语义信息；在频域，构建了频域注意网络，使模型不仅具备了自动关注频域动作特征的能力，还能与主流时空网络良好兼容、优势互补。第二，通过将常见多流动作识别网络的多分支优化问题转化为一个虚拟的多任务学习问题，本文提出了一种适配于常见多流动作识别网络的伪多任务互助学习策略。具体地，为优化多流网络中单分支的特征学习过程，提出了一种软间隔聚焦损失函数，引入了正负样本分类器间的分类软间隔，实现了困难样本的自动聚焦。同时，为促进多流网络中跨分支特征学习的协作，提出了一种新颖的互助学习策略，使得网络各个分支互相约束、互助学习。本文以基于骨架视频的多流动作识别网络为例，在四个大规模骨架动作识别数据集上，通过实验系统性地验证了该伪多任务互助学习策略的有效性和鲁棒性。第三，面向 RGB 视频中的群体活动 (group activity)，本文提出了一种基于渐进式特征学习和关系推理的群体活动识别方法。针对视频群体活动存在的参与者众多、个体间交互频繁、视频和个体噪声较大等突出问题，提出了一种基于渐进式特征学习和关系推理的识别方法。利用低层次的个体时空特征和空间位置关系构建关系图网络，显式地建模了群体活动中复杂的语义关系；进一步通过两个基于强化学习的智能体，分别在高层次语义关系层面和低层次时空特征层面对场景语义关系图进行精炼和推理。具体地，一个关系门控智能体在高级语义关系的层面，渐进地精炼出与群体语义高度相关的语义关系；一个特征蒸馏智能体在低层次时空特征的层面，渐进地蒸馏出高信息量的个体时空特征帧。最后，通过轮替优化策略，迭代地更新语义关系图、关系门控智能体和特征蒸馏智能体，有效地提高了群体活动识别的性能。第四，面向多模态教程视频中的长时活动 (long activity)，本文提出了一种基于结构化先验学习和推理的教程长时活动分析方法。多模态教程视频的分析对象是以步骤为代表的长时活动，同时天然存在弱对齐的视频文本对。本文提出了一种基于结构化先验学习和推理的教程长时活动分析方法。利用视频文本对的跨模态语义一致性，通过对比学习获取行为活动的自监督视觉语言联合表示。基于该先验表示构造层级化的先验知识图，显式地建模了层级化的行为活动概念。进一步通过深度随机游走的方式对先验知识图进行图嵌入学习，使得各节点的节点嵌入最终编码进了各层知识图的结构化先验。最后，下游任务的查询样本通过相似性匹配的方式在层级化知识图的各层匹配和提取包含结构化先验的增强表示来提升任务性能。在行为分割、文本视频检索、时序步骤定位等一系列下游任务上的大量实验结果，系统性地验证了该方法的有效性。总的来说，本文沿着单模态到多模态、简单动作到组合活动、受控场景到开放环境的研究路线，针对三种典型数据模态下不同时空复杂度的视频行为识别存在的问题，提出了基于特征学习和关系推理的一系列新方法，一方面成功地提升了行为识别的性能，另一方面有效地缓解了识别算法对人工标注的过度依赖，有望缩小视频行为识别学术研究与实际应用之间的距离。
英文摘要	Activity analysis in videos is an important topic in the field of computer vision, which has wide applications in human-computer interface, intelligence surveillance, video retrieval, automatic driving, virtual reality, etc. According to the spatiotemporal complexity, video activities can be categorized into simple action, group activity, and long activity. Group activity can be treated as a combination of multiple actions along the spatial dimension, while the long activity can be treated as a combination of sequential actions along the temporal dimension. Aiming at exploring novel algorithms for activity recognition with low annotation, high efficiency, and high performance, we conduct research on human activity recognition in different modalities and with different complexity by means of feature learning and relation reasoning. Specifically, we propose a series of novel methods based on feature learning and relation reasoning to recognize three types of video activity (i.e., simple action, group activity, and long activity) in three representative data forms (i.e., computationally efficient skeleton-based video, widely applied RGB-based video, and video-text combined multimodal video), which effectively improve the performance of video activity recognition. The main works and contributions are as follows: (1) Regarding the action recognition in skeleton-based videos, a novel approach based on joint learning in the spatiotemporal and frequency domain is proposed. Previous methods for skeleton-based action recognition are mainly based on various local networks, which are limited to the spatiotemporal domain and ignore the intrinsic action patterns in the frequency domain. During feature extracting, the stacked local neural networks face an asynchronization problem between the local detailed and non-local semantic information. Therefore, we propose a novel method for action recognition via jointly learning in the spatiotemporal and frequency domains. Specifically, in the spatiotemporal domain, we propose a synchronous local and non-local block that enables each layer to synchronously mine the local details and non-local semantics. In the frequency domain, we propose a frequency attention network that equips the model with an ability of pattern attention in the frequency domain, which is also compatible and complementary with mainstream neural networks in the spatiotemporal domain. (2) Putting the multi-branch optimizing process of multi-stream action recognition networks under a pseudo multi-task learning paradigm, we propose a pseudo multi-task mutual learning strategy that is widely applicable for common multi-stream action recognition networks. Specifically, a soft-margin focal loss is proposed to optimize the intra-branch separated learning process, which can automatically attend to hard samples and encourage intrinsic margins in classifiers. A mutual learning policy is further proposed to facilitate the inter-branch collaborative learning process, which can encourage the network branches to learn from each other. Finally, taking the above multi-stream skeleton-based action recognition network as an example, we conduct extensive experiments on four large-scale datasets and the results clearly indicate that the pseudo multi-task mutual learning strategy is highly effective and robust. (3) Regarding the group activity recognition in RGB-based videos, a progressive feature learning and relation reasoning framework is proposed. Group activity usually involves a large number of interactive participants and contains many individual and video noises. To solve these problems, we propose a novel progressive approach based on feature learning and relation reasoning. We firstly construct a semantic relation graph with individual features and positional relations to explicitly model the semantic relations among group activities. Then, two agents based on reinforcement learning are proposed to refine the semantic relation graph from the perspective of high-level semantic relation and low-level spatiotemporal feature, respectively. Specifically, one relation-gating (RG) agent adjusts the high-level relation graph to pay more attention to group-relevant relations. Another feature-distilling (FD) agent refines the low-level spatiotemporal features by distilling the most informative frames. Finally, the semantic relation graph, FD agent, and RG agent are optimized alternately, thus successfully improving the performance of group activity recognition. (4) Regarding the long activity analysis in multimodal instructional videos, a novel method based on structured prior learning and reasoning is proposed. Multimodal instructional video naturally contains characteristic video-text pairs and its typical research objects are long activities, such as step-by-step procedures. In this paper, we propose a framework based on structured prior learning and reasoning for instructional activity analysis. Firstly, the joint video-text representations of instructional activities are obtained by self-supervised contrastive learning which exploits the semantic consistency of multimodal information. Based on the prior representations, a hierarchical knowledge graph is constructed to model the hierarchical concepts among activities. We further apply the graph embedding algorithm to the knowledge graph so as to learn its embedding. After embedding learning, each node encodes some structured prior of the knowledge graph. Finally, the query samples extract structured prior enhanced representations to boost the performance of various downstream tasks. Extensive experiments on a series of downstream tasks, including action segmentation, text-to-video retrieval, and temporal step localization, systematically demonstrate the effectiveness of the proposed method. In summary, our research is carried out along the path from single modality towards multiple modalities, from simple action towards compound activity, and from controlled condition towards open environment. Here, we propose a series of novel methods based on feature learning and relation reasoning to solve existing problems in different modalities and with different complexity of video activity recognition. These methods successfully improve the recognition performance and effectively decrease the annotation burden, thus promising to decrease the gap between academic research and industrial application of video activity recognition.
关键词	视频理解行为分析特征学习关系推理
语种	中文
七大方向——子方向分类	类脑模型与计算
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/44605
专题	毕业生_博士学位论文
推荐引用方式 GB/T 7714	胡古月. 基于特征学习和关系推理的视频行为识别[D]. 中国科学院自动化研究所. 中国科学院大学,2021.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
胡古月-基于特征学习和关系推理的视频行为（17584KB）	学位论文		限制开放	CC BY-NC-SA