|Place of Conferral||中国科学院自动化研究所|
|Keyword||视频理解 行为分析 特征学习 关系推理|
视频行为识别是计算机视觉领域一个重要的研究问题，它在人机交互、智能监控、视频检索、自动驾驶、虚拟现实等领域有广泛的应用前景。依照时空复杂程度，视频行为可以划分为简单动作 (action)、群体活动 (group activity) 和长时活动 (long activity)，群体活动是简单动作在空间上的组合，长时活动是简单动作在时间上的组合。本文从特征学习和关系推理两个角度出发，针对不同数据模态和不同时空复杂度的视频行为开展研究，以期实现低标注、高效率、高精度的视频行为识别。具体选取计算高效的骨架视频、应用最广泛的 RGB 视频、以及视频与文本相结合的多模态视频三种代表性的行为数据，研究了简单动作、群体活动和长时活动三类由简单到复杂的视频行为的识别问题，提出了基于特征学习和关系推理的一系列新方法，有效地提升了视频行为识别的性能。具体的研究内容包括：
第三，面向 RGB 视频中的群体活动 (group activity)，本文提出了一种基于渐进式特征学习和关系推理的群体活动识别方法。针对视频群体活动存在的参与者众多、个体间交互频繁、视频和个体噪声较大等突出问题，提出了一种基于渐进式特征学习和关系推理的识别方法。利用低层次的个体时空特征和空间位置关系构建关系图网络，显式地建模了群体活动中复杂的语义关系；进一步通过两个基于强化学习的智能体，分别在高层次语义关系层面和低层次时空特征层面对场景语义关系图进行精炼和推理。具体地，一个关系门控智能体在高级语义关系的层面，渐进地精炼出与群体语义高度相关的语义关系；一个特征蒸馏智能体在低层次时空特征的层面，渐进地蒸馏出高信息量的个体时空特征帧。最后，通过轮替优化策略，迭代地更新语义关系图、关系门控智能体和特征蒸馏智能体，有效地提高了群体活动识别的性能。
第四，面向多模态教程视频中的长时活动 (long activity)，本文提出了一种基于结构化先验学习和推理的教程长时活动分析方法。多模态教程视频的分析对象是以步骤为代表的长时活动，同时天然存在弱对齐的视频文本对。本文提出了一种基于结构化先验学习和推理的教程长时活动分析方法。利用视频文本对的跨模态语义一致性，通过对比学习获取行为活动的自监督视觉语言联合表示。基于该先验表示构造层级化的先验知识图，显式地建模了层级化的行为活动概念。进一步通过深度随机游走的方式对先验知识图进行图嵌入学习，使得各节点的节点嵌入最终编码进了各层知识图的结构化先验。最后，下游任务的查询样本通过相似性匹配的方式在层级化知识图的各层匹配和提取包含结构化先验的增强表示来提升任务性能。在行为分割、文本视频检索、时序步骤定位等一系列下游任务上的大量实验结果，系统性地验证了该方法的有效性。
Activity analysis in videos is an important topic in the field of computer vision, which has wide applications in human-computer interface, intelligence surveillance, video retrieval, automatic driving, virtual reality, etc. According to the spatiotemporal complexity, video activities can be categorized into simple action, group activity, and long activity. Group activity can be treated as a combination of multiple actions along the spatial dimension, while the long activity can be treated as a combination of sequential actions along the temporal dimension. Aiming at exploring novel algorithms for activity recognition with low annotation, high efficiency, and high performance, we conduct research on human activity recognition in different modalities and with different complexity by means of feature learning and relation reasoning. Specifically, we propose a series of novel methods based on feature learning and relation reasoning to recognize three types of video activity (i.e., simple action, group activity, and long activity) in three representative data forms (i.e., computationally efficient skeleton-based video, widely applied RGB-based video, and video-text combined multimodal video), which effectively improve the performance of video activity recognition. The main works and contributions are as follows:
(1) Regarding the action recognition in skeleton-based videos, a novel approach based on joint learning in the spatiotemporal and frequency domain is proposed. Previous methods for skeleton-based action recognition are mainly based on various local networks, which are limited to the spatiotemporal domain and ignore the intrinsic action patterns in the frequency domain. During feature extracting, the stacked local neural networks face an asynchronization problem between the local detailed and non-local semantic information. Therefore, we propose a novel method for action recognition via jointly learning in the spatiotemporal and frequency domains. Specifically, in the spatiotemporal domain, we propose a synchronous local and non-local block that enables each layer to synchronously mine the local details and non-local semantics. In the frequency domain, we propose a frequency attention network that equips the model with an ability of pattern attention in the frequency domain, which is also compatible and complementary with mainstream neural networks in the spatiotemporal domain.
(2) Putting the multi-branch optimizing process of multi-stream action recognition networks under a pseudo multi-task learning paradigm, we propose a pseudo multi-task mutual learning strategy that is widely applicable for common multi-stream action recognition networks. Specifically, a soft-margin focal loss is proposed to optimize the intra-branch separated learning process, which can automatically attend to hard samples and encourage intrinsic margins in classifiers. A mutual learning policy is further proposed to facilitate the inter-branch collaborative learning process, which can encourage the network branches to learn from each other. Finally, taking the above multi-stream skeleton-based action recognition network as an example, we conduct extensive experiments on four large-scale datasets and the results clearly indicate that the pseudo multi-task mutual learning strategy is highly effective and robust.
(3) Regarding the group activity recognition in RGB-based videos, a progressive feature learning and relation reasoning framework is proposed. Group activity usually involves a large number of interactive participants and contains many individual and video noises. To solve these problems, we propose a novel progressive approach based on feature learning and relation reasoning. We firstly construct a semantic relation graph with individual features and positional relations to explicitly model the semantic relations among group activities. Then, two agents based on reinforcement learning are proposed to refine the semantic relation graph from the perspective of high-level semantic relation and low-level spatiotemporal feature, respectively. Specifically, one relation-gating (RG) agent adjusts the high-level relation graph to pay more attention to group-relevant relations. Another feature-distilling (FD) agent refines the low-level spatiotemporal features by distilling the most informative frames. Finally, the semantic relation graph, FD agent, and RG agent are optimized alternately, thus successfully improving the performance of group activity recognition.
(4) Regarding the long activity analysis in multimodal instructional videos, a novel method based on structured prior learning and reasoning is proposed. Multimodal instructional video naturally contains characteristic video-text pairs and its typical research objects are long activities, such as step-by-step procedures. In this paper, we propose a framework based on structured prior learning and reasoning for instructional activity analysis. Firstly, the joint video-text representations of instructional activities are obtained by self-supervised contrastive learning which exploits the semantic consistency of multimodal information. Based on the prior representations, a hierarchical knowledge graph is constructed to model the hierarchical concepts among activities. We further apply the graph embedding algorithm to the knowledge graph so as to learn its embedding. After embedding learning, each node encodes some structured prior of the knowledge graph. Finally, the query samples extract structured prior enhanced representations to boost the performance of various downstream tasks. Extensive experiments on a series of downstream tasks, including action segmentation, text-to-video retrieval, and temporal step localization, systematically demonstrate the effectiveness of the proposed method.
In summary, our research is carried out along the path from single modality towards multiple modalities, from simple action towards compound activity, and from controlled condition towards open environment. Here, we propose a series of novel methods based on feature learning and relation reasoning to solve existing problems in different modalities and with different complexity of video activity recognition. These methods successfully improve the recognition performance and effectively decrease the annotation burden, thus promising to decrease the gap between academic research and industrial application of video activity recognition.
|胡古月. 基于特征学习和关系推理的视频行为识别[D]. 中国科学院自动化研究所. 中国科学院大学,2021.|
|Files in This Item:|
|Recommend this item|
|Export to Endnote|
|Similar articles in Google Scholar|
|Similar articles in Baidu academic|
|Similar articles in Bing Scholar|
Items in the repository are protected by copyright, with all rights reserved, unless otherwise indicated.