基于骨骼点序列的人体行为识别研究

	基于骨骼点序列的人体行为识别研究
	史磊
	2021-05-30
页数	132
学位类型	博士
中文摘要	人体行为识别是计算机视觉的一个重要分支，在无人驾驶、人机交互和智能监控等领域都得到了广泛的应用。行为识别的研究内容是利用机器学习等方法使得计算机可以自动分析和理解图像或者视频中人体的行为和动作，从而辅助做出进一步的决策。传统的行为识别方法主要以RGB图像序列作为输入。然而，RGB图像无差别地记录了整个时空场景的信息，使得模型容易受到行为无关的背景信息的干扰。最近，随着姿态估计算法以及深度传感器的逐渐成熟，基于骨骼点序列的行为识别方法逐渐兴起。相比于RGB图像序列，骨骼点序列仅包含人体各个关节点在一段时间内的运动轨迹，可以有效地过滤样本中行为无关的背景信息，从而提高模型的精度和鲁棒性。目前，基于骨骼点序列的行为识别方法仍面临许多挑战，包括如何进行关节点间的时空关系建模，如何利用不同模态的信息对人体行为进行分析，以及如何提高模型速度使其更好地部署于实际应用场景。本文针对以上挑战进行了深入研究，研究内容和创新点可以归纳为以下三个方面：关系建模方法的优化。骨骼点序列分析的核心在于建模人体关节点间的时空关系，从而得到人体各部位的运动模式，并进一步识别人体的行为类别。近年来，使用图卷积神经网络进行关系建模的方法展现出良好的性能。但是，已有的基于图卷积网络的方法仅根据人体的物理结构来设计图的结构，而这种图结构对于行为识别任务并不是最优的。并且，单一的图结构无法适应样本的多样性，极大限制了模型的关系建模能力。针对这个问题，本文提出一种自适应的图卷积神经网络。该网络的图结构可以根据分类任务以及样本特征进行自适应更新，从而收敛到一个最优的图结构。此外，本文引入自注意力机制来进行关节点间的关系建模，并提出了一种仅包含自注意力模块的神经网络，有效提升了行为识别的精度。针对骨骼点序列的特殊性，本文在该网络的基础上进一步提出了全局关系正则模块、时空解耦的跨帧自注意力模块和多尺度时间窗模块，使得网络更适合于骨骼点序列的建模。多模态特征的提取和融合。骨骼点序列仅记录了人体关节点在一段时间内的位置信息，虽然语义性高，但数据量较少。如何从少量的高语义性数据中提取丰富的特征并加以利用，是提升识别精度的关键。基于此，本文从以下三个多模态融合的角度进行了深入探究。首先，人体的肢体信息相比于关节点信息更加直观，但在之前的工作中没有被有效的利用。本文提出从原始数据中提取人体的肢体特征，并设计了一种有向图神经网络将肢体特征与关节点特征进行融合。该网络可以自底向上地在两种特征间进行信息传递，从而更好地捕捉两种特征间的相关关系。其次，已有的工作大多基于语义视角对骨骼点序列进行建模，即基于关节点的语义类别来组织网络计算流。本文提出一种新颖的空间视角，基于关节点间的空间位置关系来挖掘局部区域的特征。由于骨骼点在空间视角下存在分布稀疏的问题，本文设计了一种四维稀疏卷积神经网络来提高模型的计算效率。基于两种视角的方法具有较强的互补性，通过融合可以进一步提升模型的识别精度。此外，由于骨骼点序列缺乏表观信息，许多行为仅使用骨骼点序列难以区分。针对这个问题，本文提出使用RGB数据来为骨骼点序列提供表观信息，并设计了一种关节点指导的图卷积神经网络和一种中继稠密的监督方法，将骨骼点数据和RGB数据统一到一个端到端的框架中进行学习和推断。模型加速。实际应用场景中，速度是模型部署的关键因素之一。为了提高骨骼点序列建模的效率，已有方法大多通过减少模型的参数量来提高模型的运行速度。然而，模型的输入数据量也是影响模型运行速度的重要因素。本文从减少模型输入数据量的角度出发，设计了一种自适应的关节点选择模块，使得模型可以根据样本特征进行决策，在保证精度的前提下使用更少的骨骼点数据作为输入，从而提高模型的运行速度。由于决策过程离散不可微，本文使用直通式的Gumbel估计器来反传决策模块的梯度，使得整个模型可以端到端地进行训练和参数更新。
英文摘要	Human action recognition is an important branch of computer vision, which has be widely used in a number of applications such as autonomous driving, human-computer interaction and intelligent surveillance. Action recognition aims to make use of the machine learning method to make computers automatically analyze the human action in the image or video, so as to assist in making a further decision. Traditional action recognition methods mainly use RGB image sequences as input. However, RGB images equally record the information of the whole scenes, and the model is easily disturbed by the action-irrelevant background information. Recently, with the development of the pose estimation algorithm and the depth sensor, the action recognition method based on skeleton data has become more and more popular. Compared with the RGB images, the skeleton data records only the trajectories of each joint in a period of time, which can effectively filter the action-irrelevant background information of the samples and improve the accuracy and the robustness of the model. At present, there are still many challenges for skeleton-based action recognition, including how to model the spatial-temporal correlations between the human joints, how to exploit different data modalities to analyze the human action, and how to speed up the model to make it more suitable for the real applications. This thesis has carried on the in-depth research on the above challenges, where the contributions can be summarized into three aspects: 1. Optimization of the relationship modeling. The core of analyzing the skeleton sequence is to model the spatial-temporal correlations between the human joints, then get the movement patterns of the human body parts, and further predict the human action categories. In recent years, graph convolutional network has shown great performance for relationship modeling. However, the existing methods only consider the physical structure of the human body to design the graph structure, which is not optimal for the action recognition task. Besides, a single graph structure can not adapt diverse samples, which limits the capacity of the model. To solve this problem, an adaptive graph convolutional neural network is proposed in this thesis. The graph structure of the model can be updated adaptively according to the classification task and the sample features, thus converging to an optimal graph structure. In addition, this thesis employs the self-attention mechanism to model the correlations between the human joints, and proposes a pure self-attention-based network for skeleton-based action recognition, which effectively improves the recognition accuracy. Based on the proposed network, a global regularization module, a decoupled spatial-temporal cross-frame self-attention module and a multi-scale temporal feed-forward module are further designed to adapt the characteristics of the skeleton data, which makes the network more suitable for modeling the skeleton data. 2. Extraction and fusion of the multi-modality features. The skeleton sequence only records the position information of the human joints for a period of time. Although it has high semantics, it has fewer data. How to extract and utilize the rich features from a small amount of high semantic data is the the key to improve the recognition accuracy. This thesis makes a deep exploration for multi-modality fusion from three perspectives. Firstly, the bone information is more intuitive than the joint information, but is not exploited in previous works. This thesis proposes to extract the bone features from the raw data, and designs a directed graph neural network to integrate the bone features and the joint features.It can pass the message between the two kinds of features from bottom to top, so that the model can better model the correlations between them. Secondly, most of the existing works model skeletons from the semantic perspective, that is, these models organize the network computing flow based on the semantic categories of the joints. Instead, this thesis presents a novel space perspective, which can extract the local features according to the spatial position of the joints.Because the skeleton joints is sparse from the spatial perspective, a sparse 4-dimensional convolutional neural network is designed to make the calculation more efficient. The two perspectives are complementary with each other, and fusing them can further improve the recognition accuracy. In addition, because of the lack of the appearance information of the skeleton data, it is difficult to distinguish many actions using only the skeleton data. This thesis proposes using RGB data to provide the appearance information for the skeleton data, and designs a pose-guided graph convolutional network and an intermediate dense supervision, which can integrate the skeleton data and the RGB data into a unified framework for end-to-end learning and inference. 3. Model acceleration. In practical application, speed is one of the key factors in model deployment. In order to improve the efficiency of the skeleton modeling, most of the existing methods try to reduce the model parameters to improve the running speed. However, the amount of the input data is also an important factor affecting the running speed.This thesis proposes to reduce the amount of the input data, and designs an adaptive joint selection module.It can reduce the amount of the input skeleton joints by making decisions according to the sample characteristics under the premise of ensuring the accuracy, so as to improve the running speed of the model. Because the decision process is not differentiable, a Straight-Through Gumbel Estimator is exploited to back-propagate the gradient of the policy module, which makes the whole framework differentiable and can be updated in an end-to-end manner.
关键词	行为识别关系建模多模态融合图卷积神经网络自注意力机制
语种	中文
七大方向——子方向分类	图像视频处理与分析
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/44898
专题	紫东太初大模型研究中心_图像与视频分析
推荐引用方式 GB/T 7714	史磊. 基于骨骼点序列的人体行为识别研究[D]. 中国科学院自动化研究所. 中国科学院自动化研究所,2021.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
大论文_史磊_final.pdf（9326KB）	学位论文		开放获取	CC BY-NC-SA