CASIA OpenIR
基于局部模式建模的视频行为识别与定位研究
黄林江
Subtype博士
Thesis Advisor王亮
2020-05-31
Degree Grantor中国科学院大学
Place of Conferral中国科学院自动化研究所
Degree Name工学博士
Degree Discipline模式识别与智能系统
Keyword行为识别 行为定位 局部模式建模 人体姿态 弱监督学习
Abstract

行为识别与定位是计算机视觉中的经典问题,其广泛应用于视频监控、人机交互、视频检索、自动驾驶、运动合成与仿真等多个领域。我们所说的行为识别通常指行为分类,其输入为裁剪后的视频,视频中只包含行为片段不包含背景。与行为识别相反,行为定位则是以未裁剪的视频作为输入,其目的是得到行为在视频中发生的起止位置已经该片段的行为类别。

近年来,深度学习在行为识别与定位方面取得了显著进展,但还存在很多亟待解决的问题。对于行为识别来说,主要面临的挑战是行为执行者的视角、尺度、遮挡、执行方式等方面的变化,而对于行为定位来说,有很大一部分的困难来自于繁重的标注任务。虽然很多方法被提出来解决这些问题,例如引入人体姿态进行行为识别,或是利用弱监督信号来完成行为定位,但大多数方法都只关注了行为的细节或整体,而忽略了行为的中层信息,本文中称之为行为的局部模式。我们认为,在空间上,行为的局部模式为人体的身体部件信息,在时序上,行为的局部模式为行为内部的子行为信息。但是,我们发现鲜有方法显式地建模这些信息,从而无法得到丰富的行为表示。针对以上的问题,我们做了以下的工作:

1. 针对视频行为识别,我们提出了一种姿态引导的部件对齐循环网络。模型利用了人体姿态的作为辅助信息,采用一种部件对齐模块,显式地建模了视频中人体部件的时空演变信息,提高了行为识别的准确率。部件对齐模块包含多个分支,每个分支对应于一个身体部件。在每个分支中,我们使用了一种新颖的自变化注意力机制,利用空间变换网络得到用于部件对齐的空间变换矩阵,对输入的特征图进行变换。在此基础上,我们引入了人体姿态引导的注意力机制,利用空间反变换网络将姿态注意力图映射会原坐标空间中。在此过程中,利用循环神经网络的隐状态引导生成空间变换矩阵,从而建模了人体部件的时空演变信息。最后我们将得到的注意力图作为姿态估计的预测热图,利用人体姿态的真值图对其进行监督,并将其作为权值来生成每个部件所对应的特征表示。基于部件的特征表示,我们提出了部件池化模块,将人体部件的特征池化为人体的特征。最后我们将该特征送入到循环网络中,以建模视频的时序信息。我们在两个行为识别标准数据集上的验证了提出方法的有效性。

2. 针对基于人体骨架的行为识别,我们提出了一种部件层面的图卷积网络。该方法基于图卷积网络框架,设计了一种可微的图池化操作与两种图反池化操作分别学习人体关节点到人体部件的映射与人体部件到人体关节点的映射,从而允许在图卷积的框架下显式地建模人体部件信息。基于图池化操作与图反池化操作,我们提出了一种部件关系模块与一种部件关注模块来分别学习身体部件的时空演变信息以及关注于行为中重要的身体部件。该方法对于普通的图卷积方法是很好的补充,与原始的图卷积网络结合可以有效提高行为识别准确率。

3. 针对弱监督的行为定位,我们提出了一种关系原型网络。该方法采用原型网络作为框架,首先引入了一种共生图卷积网络来对不同行为之间的关系进行建模。为了得到一致的局部子行为表示,减小行为的类内距离,我们设计了一种聚类损失,使特征尽量靠近对应的原型,从而完成聚类的过程。另外,该损失在不同行为的特征之间产生了间隔,提升了模型对于行为与背景的区分能力。为了进一步去除背景的影响,我们采用了一种相似度加权模块,并利用一种原型更新策略使学习到的原型在测试时能够更加符合单个视频的特征分布。我们在两个行为定位数据集上都取得了当前最好的效果。

Other Abstract

Action recognition and action localization are classical problems in computer vision, which are widely used in video surveillance, human-computer interaction, video retrieval, automatic driving, motion synthesis and simulation. What we refer to as "action recognition" usually refers to action classification, whose input is trimmed video, which only contains the action segment and without background. In contrast to action recognition, action localization takes untrimmed videos as input, which aims to obtain the location of the action in the video as well as the action categories of action segments. 

In recent years, deep learning has made significant progress in action recognition and action localization, however, there are still many problems to be solved.  For action recognition, the main challenge is the variance of viewpoint, scale, occlusion and execution mode on the performers. For action localization, a large part of the difficulty comes from the heavy task of annotating. Although many approaches have been proposed to solve these problems, such as the introduction of human posed for action recognition, or the usage of only weak supervisions to achieve action localization, most approaches focus only on the details or the whole of the action, but ignore the middle-level information of the action, which is referred to as the local pattern in this paper. We think that, in spatial domain, the local pattern of action is the body part information of the human body, and in temporal domain, the local pattern of action is the sub-action inside an action. However, there are few works have devoted to explicitly model local pattern information, so that rich action representation cannot be obtained. In view of the above problems, we have done the following works:

1. We propose a part-aligned pose-guided recurrent network for action recognition. The proposed model utilizes the human pose as an auxiliary information, which utilizes a part-alignment module to explicitly model the spatio-temporal dynamics of human body parts to improve the accuracy of action recognition. The part alignment module is comprised of several branches, each of which corresponds to one human body part. In each branch, we use a novel auto-transformer attention mechanism to obtain the spatial transformation matrix for part alignment using the Spatial Transformer Network (STN). Then, the pose-guided attention is introduced to predict pose attention maps based on the transformed feature maps of body parts. Subsequently, Spatial De-transformer Network (SDTN) is applied to map the  predicted pose attention maps back to the original coordinate space.  In this process, the hidden state of the recurrent neural network is used to generate the spatial transformation matrix, so as to model the spatio-temporal dynamics of human body parts. Finally, the obtained attention maps are regarded as predicted heatmaps for pose estimation as well as the weights of generating part representations. Based on the feature representation of body parts, we propose the part pooling module, which pools the features of human body parts into the feature of human body. Finally, we feed this feature into the recurrent neural network to model the sequence information of the video. We evaluate the effectiveness of our method on two benchmarks of action recognition.

2. For skeleton-base action recognition, we propose a part-level graph convolutional network, which is based on the graph convolutional network. We design a differentiable graph pooling operation and two kinds of graph unpooling operations to learn the mapping of human body joints to human body parts and the reverse process, so that allows our method to explicitly model information of human body parts. Based on graph pooling and graph unpooling, we devise a part relation block and a part attention block to learn the spatio-temporal dynamics of body parts and focus on the important body parts in specific actions. This method is a good supplement to the regular graph convolutional network that only models the relationships between human body joints. Combining with the original graph convolutional network, the accuracy of action recognition can be effectively improved. 

3. For weakly supervised action localization, we propose a relational prototypical network. In this method, we build our method on the prototypical network. Firstly, graph convolutional network is introduced to model the relationships among actions. In addition, in order to obtain consistent representation among sub-actions, and reduce the intra-class distance of actions, we design a clustering loss, which aims to push features to their corresponding prototypes, so as to achieve the process of clustering. Besides, the loss can yield a margin between different actions, improving the capacity of separating actions from background. To further remove the influence of the background, we adopt a similarity weighting module and use a prototype update strategy to make the learned prototypes more consistent with the feature distribution of a single video in the test set. We achieve the state-of-the-art performance on two datasets for action localization.

Pages124
Language中文
Document Type学位论文
Identifierhttp://ir.ia.ac.cn/handle/173211/39132
Collection中国科学院自动化研究所
Recommended Citation
GB/T 7714
黄林江. 基于局部模式建模的视频行为识别与定位研究[D]. 中国科学院自动化研究所. 中国科学院大学,2020.
Files in This Item:
File Name/Size DocType Version Access License
基于局部模式建模的视频行为识别与定位研究(6556KB)学位论文 开放获取CC BY-NC-SAApplication Full Text
Related Services
Recommend this item
Bookmark
Usage statistics
Export to Endnote
Google Scholar
Similar articles in Google Scholar
[黄林江]'s Articles
Baidu academic
Similar articles in Baidu academic
[黄林江]'s Articles
Bing Scholar
Similar articles in Bing Scholar
[黄林江]'s Articles
Terms of Use
No data!
Social Bookmark/Share
All comments (0)
No comment.
 

Items in the repository are protected by copyright, with all rights reserved, unless otherwise indicated.