Institutional Repository of Chinese Acad Sci, Inst Automat, Natl Lab Pattern Recognit, Beijing 100190, Peoples R China
|Place of Conferral||中国科学院自动化研究所|
|Keyword||行为识别 关系建模 多模态融合 图卷积神经网络 自注意力机制|
Human action recognition is an important branch of computer vision, which has be widely used in a number of applications such as autonomous driving, human-computer interaction and intelligent surveillance. Action recognition aims to make use of the machine learning method to make computers automatically analyze the human action in the image or video, so as to assist in making a further decision.
Traditional action recognition methods mainly use RGB image sequences as input. However, RGB images equally record the information of the whole scenes, and the model is easily disturbed by the action-irrelevant background information. Recently, with the development of the pose estimation algorithm and the depth sensor, the action recognition method based on skeleton data has become more and more popular. Compared with the RGB images, the skeleton data records only the trajectories of each joint in a period of time, which can effectively filter the action-irrelevant background information of the samples and improve the accuracy and the robustness of the model.
At present, there are still many challenges for skeleton-based action recognition, including how to model the spatial-temporal correlations between the human joints, how to exploit different data modalities to analyze the human action, and how to speed up the model to make it more suitable for the real applications. This thesis has carried on the in-depth research on the above challenges, where the contributions can be summarized into three aspects:
1. Optimization of the relationship modeling. The core of analyzing the skeleton sequence is to model the spatial-temporal correlations between the human joints, then get the movement patterns of the human body parts, and further predict the human action categories. In recent years, graph convolutional network has shown great performance for relationship modeling. However, the existing methods only consider the physical structure of the human body to design the graph structure, which is not optimal for the action recognition task. Besides, a single graph structure can not adapt diverse samples, which limits the capacity of the model. To solve this problem, an adaptive graph convolutional neural network is proposed in this thesis. The graph structure of the model can be updated adaptively according to the classification task and the sample features, thus converging to an optimal graph structure. In addition, this thesis employs the self-attention mechanism to model the correlations between the human joints, and proposes a pure self-attention-based network for skeleton-based action recognition, which effectively improves the recognition accuracy. Based on the proposed network, a global regularization module, a decoupled spatial-temporal cross-frame self-attention module and a multi-scale temporal feed-forward module are further designed to adapt the characteristics of the skeleton data, which makes the network more suitable for modeling the skeleton data.
2. Extraction and fusion of the multi-modality features. The skeleton sequence only records the position information of the human joints for a period of time. Although it has high semantics, it has fewer data. How to extract and utilize the rich features from a small amount of high semantic data is the the key to improve the recognition accuracy. This thesis makes a deep exploration for multi-modality fusion from three perspectives. Firstly, the bone information is more intuitive than the joint information, but is not exploited in previous works. This thesis proposes to extract the bone features from the raw data, and designs a directed graph neural network to integrate the bone features and the joint features.It can pass the message between the two kinds of features from bottom to top, so that the model can better model the correlations between them. Secondly, most of the existing works model skeletons from the semantic perspective, that is, these models organize the network computing flow based on the semantic categories of the joints. Instead, this thesis presents a novel space perspective, which can extract the local features according to the spatial position of the joints.Because the skeleton joints is sparse from the spatial perspective, a sparse 4-dimensional convolutional neural network is designed to make the calculation more efficient. The two perspectives are complementary with each other, and fusing them can further improve the recognition accuracy. In addition, because of the lack of the appearance information of the skeleton data, it is difficult to distinguish many actions using only the skeleton data. This thesis proposes using RGB data to provide the appearance information for the skeleton data, and designs a pose-guided graph convolutional network and an intermediate dense supervision, which can integrate the skeleton data and the RGB data into a unified framework for end-to-end learning and inference.
3. Model acceleration. In practical application, speed is one of the key factors in model deployment. In order to improve the efficiency of the skeleton modeling, most of the existing methods try to reduce the model parameters to improve the running speed. However, the amount of the input data is also an important factor affecting the running speed.This thesis proposes to reduce the amount of the input data, and designs an adaptive joint selection module.It can reduce the amount of the input skeleton joints by making decisions according to the sample characteristics under the premise of ensuring the accuracy, so as to improve the running speed of the model. Because the decision process is not differentiable, a Straight-Through Gumbel Estimator is exploited to back-propagate the gradient of the policy module, which makes the whole framework differentiable and can be updated in an end-to-end manner.
|史磊. 基于骨骼点序列的人体行为识别研究[D]. 中国科学院自动化研究所. 中国科学院自动化研究所,2021.|
|Files in This Item:|
|Recommend this item|
|Export to Endnote|
|Similar articles in Google Scholar|
|Similar articles in Baidu academic|
|Similar articles in Bing Scholar|
Items in the repository are protected by copyright, with all rights reserved, unless otherwise indicated.