基于图卷积网络的人体骨架行为识别若干问题研究

CASIA OpenIR > 模式识别实验室

	基于图卷积网络的人体骨架行为识别若干问题研究
	宋一帆
	2021-05-14
页数	134
学位类型	博士
中文摘要	人的运动行为分析是计算机视觉领域中的一个经典问题，其要求对一段视频进行分析计算，从而识别出视频中的运动目标正在进行的动作类别，或对其接下来一段时间内的行为进行预测。传统的行为识别方法主要分为两类，其一是将RGB视频看作一个三维的张量，并采用三维卷积神经网络(Convolutional Neural Network, CNN)进行建模。第二类方案是采用二维CNN对视频逐帧进行建模，后采用循环神经网络(Recurrent Neural Network, RNN)来提取时间信息。然而，传统基于RGB视频的行为识别方法，存在严重的信息冗余问题，同时复杂背景、光照条件等环境因素也都严重影响行为识别模型的鲁棒性。而与此同时，随着新型传感器的出现和人体姿态估计方法的不断发展，人体骨架作为一种更高效精简的人体结构表达逐渐成为行为识别的基础数据，基于骨架的行为识别逐渐成为热门的研究领域。骨架行为识别与传统方法的区别在于，其输入数据是一组给定骨架点的三维坐标，并通过这些骨架点的坐标变化来表示一个行为动作。骨架行为识别不仅能避免RGB视频中存在的背景和光照等问题；同时，相对于RGB视频，骨架行为识别在行为特征表达方面更为高效，所需计算消耗非常低，且识别正确率也与RGB视频相近。虽然骨架行为识别具有很多优势，但其仍然存在一些亟待解决的问题。对此，本文提出了四种方法来解决骨架行为识别领域中存在的问题：（1）针对骨架点的三维坐标在真实环境中易被遮挡或扰动的情况，本文设计了一种有序多流模型，逐步地对模型关注到的区域进行拓展，可有效解决骨架行为识别中的遮挡和噪声扰动问题。该方案主要是通过一个富激活模块来实现的，该模块可以有效地提取出当前网络流关注到的骨架点，并在下一条网络流学习过程中通过人工隐藏的方法，迫使网络在尚未关注的骨架点中捕捉不同行为的区分性信息。在四种遮挡数据集和两种噪声扰动数据集上的实验结果表明，相对于传统方法，本文方法可有效提升行为识别模型对由遮挡和姿态估计误差所带来的噪声骨架数据的鲁棒性。（2）针对当前高性能骨架行为识别模型参数量大、训练和推理时间过长的问题，本文提出一种新的轻量化模型，其包含了多分支早期融合、残差网络模块和人体部件注意力模块，可在行为识别准确性与当前最优模型相当的情况下(NTU60数据库中可达90.9%的正确率)，极大降低模型参数量，甚至仅相当于某些大模型参数量的1/34。这一轻量级模型可作为一个行为识别的基准模型，有助于未来研究者更高效地开发高性能行为识别模型。（3）骨架行为识别虽然有着较低的计算代价，但同时由于骨架点数据过少，会导致一些微小的动作表达不充分。对此，本文提出了一种动态的骨架扩充方案。该模型使用多层感知机(Multi-layer Perception, MLP)技术提升了模型对细粒度动作的表达能力，即采用更多的骨架点来表示一个行为动作。为了建模扩充后的骨架数据，本文又设计了一个全连接图注意力模块，用来建模骨架数据的空间信息。该骨架扩充方案在两组相似动作组成的子数据集中，识别精度优于传统的算法，平均性能提升达到了2.5%。（4）针对骨架行为识别中的图卷积网络感受野参数自动设定问题，本文参考了可微分模型结构搜索的方法，将感受野配置参数作为可学习的超参数，在训练过程中同步进行最优解搜索，以实现自适应设定感受野配置。该方法分别在时间和空间维度上堆叠了多个搜索单元，每个搜索单元中的模型结构在搜索过程中自动确定。在大规模公开数据集上的实验表明，所提方法能够以更少的参数量获得更高的识别正确率(采用2.23*10^6的参数量实现了90.8%的正确率)，同时也为自动超参数选取提供了可行的解决方案。
英文摘要	Human action recognition is a hot research topic in the field of computer vision, which requires to analyze and model the input videos, so as to recognize the action category executed by human actors in the video, or predict its action in the next period of time. Traditional video based action recognition methods are mainly divided into two categories. One is to treat the RGB video sequences as three dimensional (3D) tensors, and use 3D CNN to explore features. The second category uses 2D CNN to model the video sequences frame by frame, and then uses RNN to extract temporal information. However, the traditional methods directly analyze the RGB videos, which is prone to be affected by the complex background, lighting conditions and other environmental factors, resulting in a significant decline on model performance in complex scenarios. Besides, with the emergence of novel sensors and the development of pose estimation algorithms, as an efficient and precise human structural representation, human skeleton gradually has become a base data for action recognition, thus skeleton based action recognition has become a popular research field. The differences between skeleton based action recognition and traditional methods include that the input data of the former are the 3D coordinates of some predefined skeleton joints, and an action is represented by the coordinate movements of these skeleton joints. And the methods based on skeleton joints can not only avoid the background and illumination problems in RGB videos, but also reduce the computational cost, with a similar recognition accuracy than the methods based on RGB videos. Although skeleton based action recognition has many advantages, there still remain some problems to be solved. Therefore, we propose four methods to solve the problems in the field of skeleton based action recognition, shown as follows: (1) Aiming at the problems that the 3D coordinates of skeleton joints are often occluded or disturbed in real world application scenarios, we design an ordered multi-stream model to gradually expand the activated joints. This method is mainly performed by a richly activated module, which can effectively extract the skeleton joints activated by the current stream, and manually occlude these skeleton joints for the next stream, forcing the next stream to capture the unexplored information from other unactivated joints. Using this method, the occlusion and jittering problems in skeleton action recognition can be effectively solved. The experimental results on four occlusion datasets and two jittering datasets demonstrate the effectiveness of the method for boosting the robustness to the occlusion or jittering skeleton data. (2) Aiming at the problems of over-parameterization and time-consuming, we propose a new lightweight model, which contains early fused multiple input branches, residual modules and part-based attention modules. The proposed model can achieve similar recognition performance with the current state-of-the-art models (90.9% accuracy on NTU 60 dataset), while its amount of model parameters is significantly reduced, even only equivalent to 1/34 of that in some large models. This lightweight model can be used as a baseline model for action recognition, which is helpful for future researchers to develop high-performance action recognition models more efficiently. (3) Although skeleton action recognition has a low computational cost, but at the same time, due to the sparsity of skeleton joints, it will lead to the inadequate expression capability to distinguish subtle actions. We propose a dynamic skeleton augmentation method by utilizing multi-layer perception technique to augment the model representation ability on fine-grained actions, that is, using more skeleton joints to represent an action. In order to model the augmented skeleton data, we design a fully connected graph attention module to explore the spatial information of skeleton data. The skeleton augmentation method has better recognition accuracies than the traditional methods in two sub-datasets which consist of fine-grained action categories, where the average accuracy improvement is 2.5%. (4) For the urgent problem of automatically setting hyper-parameters of GCN receptive field, we refer to recent advance on differentiable neural architecture search, which regards the receptive field setting as some hyper-parameters, and searches the optimal solution synchronously with training, so as to realize the adaptive receptive field selection. In this method, multiple search cells are stacked in temporal and spatial dimensions, and the model structure in each search cell is determined during searching procedure. On two datasets, the proposed method can obtain higher recognition accuracy (90.8%) with less parameters (2.23*10^6), and also provide a feasible solution for hyper-parameter selection.
关键词	行为识别骨架序列图卷积网络注意力机制神经结构搜索
学科领域	模式识别
学科门类	工学::计算机科学与技术（可授工学、理学学位）
语种	中文
七大方向——子方向分类	图像视频处理与分析
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/44959
专题	模式识别实验室
通讯作者	宋一帆
推荐引用方式 GB/T 7714	宋一帆. 基于图卷积网络的人体骨架行为识别若干问题研究[D]. 智能化大厦16层1610. 中国科学院大学人工智能学院,2021.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
Thesis.pdf（4569KB）	学位论文		开放获取	CC BY-NC-SA