Knowledge Commons of Institute of Automation,CAS
基于骨骼点序列的人体行为识别研究 | |
史磊![]() | |
2021-05-30 | |
页数 | 132 |
学位类型 | 博士 |
中文摘要 | 人体行为识别是计算机视觉的一个重要分支,在无人驾驶、人机交互和智能监控等领域都得到了广泛的应用。行为识别的研究内容是利用机器学习等方法使得计算机可以自动分析和理解图像或者视频中人体的行为和动作,从而辅助做出进一步的决策。 传统的行为识别方法主要以RGB图像序列作为输入。然而,RGB图像无差别地记录了整个时空场景的信息,使得模型容易受到行为无关的背景信息的干扰。最近,随着姿态估计算法以及深度传感器的逐渐成熟,基于骨骼点序列的行为识别方法逐渐兴起。相比于RGB图像序列,骨骼点序列仅包含人体各个关节点在一段时间内的运动轨迹,可以有效地过滤样本中行为无关的背景信息,从而提高模型的精度和鲁棒性。 目前,基于骨骼点序列的行为识别方法仍面临许多挑战,包括如何进行关节点间的时空关系建模,如何利用不同模态的信息对人体行为进行分析,以及如何提高模型速度使其更好地部署于实际应用场景。本文针对以上挑战进行了深入研究,研究内容和创新点可以归纳为以下三个方面:
|
英文摘要 | Human action recognition is an important branch of computer vision, which has be widely used in a number of applications such as autonomous driving, human-computer interaction and intelligent surveillance. Action recognition aims to make use of the machine learning method to make computers automatically analyze the human action in the image or video, so as to assist in making a further decision.
Traditional action recognition methods mainly use RGB image sequences as input. However, RGB images equally record the information of the whole scenes, and the model is easily disturbed by the action-irrelevant background information. Recently, with the development of the pose estimation algorithm and the depth sensor, the action recognition method based on skeleton data has become more and more popular. Compared with the RGB images, the skeleton data records only the trajectories of each joint in a period of time, which can effectively filter the action-irrelevant background information of the samples and improve the accuracy and the robustness of the model.
At present, there are still many challenges for skeleton-based action recognition, including how to model the spatial-temporal correlations between the human joints, how to exploit different data modalities to analyze the human action, and how to speed up the model to make it more suitable for the real applications. This thesis has carried on the in-depth research on the above challenges, where the contributions can be summarized into three aspects:
1. Optimization of the relationship modeling. The core of analyzing the skeleton sequence is to model the spatial-temporal correlations between the human joints, then get the movement patterns of the human body parts, and further predict the human action categories. In recent years, graph convolutional network has shown great performance for relationship modeling. However, the existing methods only consider the physical structure of the human body to design the graph structure, which is not optimal for the action recognition task. Besides, a single graph structure can not adapt diverse samples, which limits the capacity of the model. To solve this problem, an adaptive graph convolutional neural network is proposed in this thesis. The graph structure of the model can be updated adaptively according to the classification task and the sample features, thus converging to an optimal graph structure. In addition, this thesis employs the self-attention mechanism to model the correlations between the human joints, and proposes a pure self-attention-based network for skeleton-based action recognition, which effectively improves the recognition accuracy. Based on the proposed network, a global regularization module, a decoupled spatial-temporal cross-frame self-attention module and a multi-scale temporal feed-forward module are further designed to adapt the characteristics of the skeleton data, which makes the network more suitable for modeling the skeleton data.
2. Extraction and fusion of the multi-modality features. The skeleton sequence only records the position information of the human joints for a period of time. Although it has high semantics, it has fewer data. How to extract and utilize the rich features from a small amount of high semantic data is the the key to improve the recognition accuracy. This thesis makes a deep exploration for multi-modality fusion from three perspectives. Firstly, the bone information is more intuitive than the joint information, but is not exploited in previous works. This thesis proposes to extract the bone features from the raw data, and designs a directed graph neural network to integrate the bone features and the joint features.It can pass the message between the two kinds of features from bottom to top, so that the model can better model the correlations between them. Secondly, most of the existing works model skeletons from the semantic perspective, that is, these models organize the network computing flow based on the semantic categories of the joints. Instead, this thesis presents a novel space perspective, which can extract the local features according to the spatial position of the joints.Because the skeleton joints is sparse from the spatial perspective, a sparse 4-dimensional convolutional neural network is designed to make the calculation more efficient. The two perspectives are complementary with each other, and fusing them can further improve the recognition accuracy. In addition, because of the lack of the appearance information of the skeleton data, it is difficult to distinguish many actions using only the skeleton data. This thesis proposes using RGB data to provide the appearance information for the skeleton data, and designs a pose-guided graph convolutional network and an intermediate dense supervision, which can integrate the skeleton data and the RGB data into a unified framework for end-to-end learning and inference.
3. Model acceleration. In practical application, speed is one of the key factors in model deployment. In order to improve the efficiency of the skeleton modeling, most of the existing methods try to reduce the model parameters to improve the running speed. However, the amount of the input data is also an important factor affecting the running speed.This thesis proposes to reduce the amount of the input data, and designs an adaptive joint selection module.It can reduce the amount of the input skeleton joints by making decisions according to the sample characteristics under the premise of ensuring the accuracy, so as to improve the running speed of the model. Because the decision process is not differentiable, a Straight-Through Gumbel Estimator is exploited to back-propagate the gradient of the policy module, which makes the whole framework differentiable and can be updated in an end-to-end manner. |
关键词 | 行为识别 关系建模 多模态融合 图卷积神经网络 自注意力机制 |
语种 | 中文 |
七大方向——子方向分类 | 图像视频处理与分析 |
文献类型 | 学位论文 |
条目标识符 | http://ir.ia.ac.cn/handle/173211/44898 |
专题 | 紫东太初大模型研究中心_图像与视频分析 |
推荐引用方式 GB/T 7714 | 史磊. 基于骨骼点序列的人体行为识别研究[D]. 中国科学院自动化研究所. 中国科学院自动化研究所,2021. |
条目包含的文件 | ||||||
文件名称/大小 | 文献类型 | 版本类型 | 开放类型 | 使用许可 | ||
大论文_史磊_final.pdf(9326KB) | 学位论文 | 开放获取 | CC BY-NC-SA |
个性服务 |
推荐该条目 |
保存到收藏夹 |
查看访问统计 |
导出为Endnote文件 |
谷歌学术 |
谷歌学术中相似的文章 |
[史磊]的文章 |
百度学术 |
百度学术中相似的文章 |
[史磊]的文章 |
必应学术 |
必应学术中相似的文章 |
[史磊]的文章 |
相关权益政策 |
暂无数据 |
收藏/分享 |
除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。
修改评论