基于信息传递的人体姿态估计方法研究

	基于信息传递的人体姿态估计方法研究
	周鲁
	2021-05-29
页数	138
学位类型	博士
中文摘要	随着成像和存储技术的发展，图像和视频资源正呈现爆炸式的增长。如何从海量数据中提取出有用的结构化信息，对于理解图像和视频至关重要。人是图像视频等数据的核心要素，也是视觉内容的主要目标和表达主体。在复杂的应用场景下实现对人的结构化分析有助于完成行为判别、场景理解等高层任务，因而受到了广泛关注。其中，人体姿态估计旨在给定图像的条件下估计人体关键点的位置，是理解人体语义和分析人体结构的有效手段之一，在行为识别、虚拟现实、智慧医疗、治安防控等多个领域有着广泛的应用。因此，人体姿态估计具有十分重要的学术价值和实用意义，也成了近几年计算机视觉领域的热门课题。近年来，基于深度学习的人体姿态估计方法取得了巨大的成功，有效地提升了人体姿态估计的性能。然而人体姿态估计远没有达到理想中的效果。首先，人体图像存在尺度变化问题。其次，人体是一个非刚体结构，不同的关键点具有不同的运动自由度，从而造成人体姿态的复杂多样。此外，混杂的背景、密集人群中出现的拥挤和遮挡对人体姿态估计也造成了巨大的挑战。因此，本文以深度神经网络为基础，通过不同层级的信息传递算法和合理的网络结构设计来解决人体姿态估计中出现的诸多难题，提升了人体姿态估计的效果。本文主要的工作和贡献有： • 基于双向信息传递和空间通道注意力的人体姿态估计。针对人体姿态估计网络无法充分利用语义和空间细节信息以及特征中存在大量冗余和噪声的缺陷，提出了一种基于双向信息传递和空间通道注意力的人体姿态估计方法。首先，通过引入多尺度双向信息传递机制来促进多个尺度特征间的信息传递，高低尺度特征间的信息交互丰富了各尺度特征的语义和细节信息，而多尺度特征的融合则进一步提升了网络的尺度鲁棒性。其次，针对特征冗余和噪声干扰，本方法引入了语义增强通道注意力机制和尖锐空间注意力机制，旨在不同维度上对特征噪声进行抑制，从而获得更干净的特征表示。在公开数据集的实验结果表明，本方法有效地提升了模型的精度，在多个数据集上取得了同期领先的性能表现。 • 基于渐进式姿态语法的人体姿态估计。基于渐进式姿态语法的人体姿态估计。针对卷积神经网络无法显式地学习人体结构信息等问题，提出了一种基于渐进式姿态语法的人体姿态估计方法。首先，通过构建人体姿态语法来学习人体关键点之间的关联，促进不同人体关键点特征间的信息传递。其次，3D 卷积的引入提升了信息传递的效果，而渐进式的学习方式使得网络捕捉到不同层级的人体结构信息，从而更有效地利用人体结构先验对关键点特征进行修正。此外，自适应方向信息的引入为信息传递提供了显式的方向指引，而基于注意力的双向结果融合机制则改善了双向信息的融合效果。在公开数据集的实验结果表明，本方法效果优于其他结构化建模方法，提升了困难关键点的检测性能。 • 基于空间变换网络的人体姿态估计。针对人体姿态估计网络出现的热度图假阳性预测问题，提出了一种基于空间变换网络的人体姿态估计方法。首先，引入了空间变换网络来促进不同关键点热度图间的信息传递。其次，为了增强空间变换网络的变换能力，引入了肢干引导机制来为信息传递过程提供显式的方向指引。同时利用对抗学习来增强人体肢干预测的质量，从而提供更精确的方向引导信息，提升空间变换网络的性能。此外，为了消除空变换现象，空间变换网络采用加权均方误差损失来削弱背景损失权重，同时引入了卷积随机游走抑制预测噪声。在公开数据集的实验结果表明，本方法有效地减少了热度图中的假阳性预测，相较于基准模型取得了显著的性能提升。 • 面向遮挡场景孪生网络的人体姿态估计。针对自然场景中出现的自遮挡和外部遮挡问题，提出了一种面向遮挡场景孪生网络的人体姿态估计方法。首先，遮挡预测使得网络具备遮挡感知能力。在此基础上，利用擦除重建模块来擦除和重建遮挡区域的特征，从而对遮挡特征进行修正。其次，引入了基于孪生网络的模仿学习机制，使得遮挡分支擦除重建后的特征逼近未遮挡分支擦除重建后的特征，进而增强特征在遮挡情形下的鲁棒性。此外，基于最优传输散度的模仿学习损失促进了孪生网络双分支特征间的信息传递，而极少的参数量和计算量的增加也节省了网络的算力损耗。在公开数据集的实验结果表明，本方法在遮挡关键点上的性能提升可达1.72%，在同期的算法中亦取得了领先的性能表现。
英文摘要	With the development of imaging and storage technology, image data and video resources are showing explosive growth. How to extract useful structure information from massive data is very important for understanding images and videos. Human is not only the core element of image and video data, but also the main target and expression subject of visual content. People structure analysis in complex application scenarios is helpful to complete high-level tasks such as action recognition, scene understanding, and so on, which has been widely concerned. Human pose estimation aims to estimate the positions of human keypoints given a single image, which is one of the effective means to understand human semantics and analysis human structures and has been widely used in many fields such as action recognition, virtual reality, intelligent medical treatment, public security prevention and control, etc. Therefore, human pose estimation has very important academic value and practical significance, and has become a hot topic in the field of computer vision. In recent years, the human pose estimation methods based on deep learning have achieved great success, which greatly improve the performance of human pose estimation. However, human pose estimation is far from ideal. Firstly, there exist scale change problems in natural images. Secondly, in real life, the human body is a non rigid body structure. Different keypoints have different degrees of freedom, resulting in the complexity of human postures. In addition, messy background, crowding and occlusion in dense crowd pose great challenges to human pose estimation. Hence, based on the deep learning framework, this dissertation exploits different levels of message passing algorithms and designs reasonable network structures to solve the mentioned challenges in human pose estimation, thus greatly improving the performance and efficiency of human pose estimation. The main contributions of this dissertation are summarized as follows: • A bidirectional message passing based spatial and channel-wise attention network is proposed to address the issue that human pose estimation network cannot make full use of semantic and spatial details and there are a lot of redundancy and noise in features. Firstly, message passing among multi-scale features is promoted by the multiscale bidirectional message passing mechanism. The information interaction between high and low scale features enriches the semantic and detailed information of each scale feature. Besides, fusion of multi-scale features further improves the scale robustness of the network. Secondly, aiming at feature redundancy and noise interference, this method introduces semantics-enhanced channel-wise attention mechanism and sharp spatial-wise attention mechanism to suppress feature noise in different dimensions and obtains a cleaner feature representation. Experimental results on public datasets show that the proposed method can effectively enhance the generalization ability of the model in the case of scale change, complex background interference, congestion, etc., and the performance is also significantly improved. • A progressive pose grammar based human pose estimation method is proposed to solve the problem that convolution neural network can’t learn human body structure information explicitly. Firstly, pose grammar module is built to encode relationships among human keypoints, thus promoting the message passing among different human keypoints features. Secondly, 3D convolution is introduced to improve the effect of message passing, while the progressive learning manner enables the network to capture different hierarchial human structure information and the keypoints features can be modified effectively by using the prior of human structure. In addition, adaptive direction information is introduced to provide explicit direction guidance, and the introduced attention-based bidirectional results fusion mechanism improves the effectiveness of fusion greatly. Experimental results on public benchmarks show that the proposed method outperforms other structure modeling methods and improves the detection performance of hard human keypoints. • A spatial transformation network based human pose estimation method is proposed to address the problem of false positive predictions in heatmaps. Firstly, a spatial transformation network is introduced to promote the message passing among different human keypoints heatmaps. Secondly, to enhance the transformation ability of transformation network, limb guidance mechanism is introduced to provide explicit direction guidance for network. Meanwhile, adversarial learning mechanism is introduced to improve the quality of limb predictions, thus providing more accurate direction guidance information and improving the performance of spatial transformation. In addition, spatial transformation network takes the weighted mean square error loss to weaken the weight of background loss and addresses the empty transformation problem. Convolutional random walk is introduced as well to suppress the prediction noises. False positive predictions in rectified heatmaps are greatly reduced and network performance is largely boosted. Experimental results on public datasets show that the proposed method can effectively reduce the false positive predictions in heatmaps, and has a significant performance improvement over the baseline model. A siamese network for occlusion scenes is proposed to address self-occlusion and extern occlusion problems. Firstly, The network is occlusion-aware with the predictions of occlusions. Afterwards, features are erased and reconstructed with the erasing and reconstruction module, and thus occluded features are refined. Secondly, mimicking learning mechanism based on siamese network is introduced to enable the occluded branch feature to approach the unoccluded branch feature , thus increasing the feature robustness when facing the occlusions. In addition, the loss of imitation learning based on the optimal transport divergence promotes the information interaction between the two branch features of the twin network, and little increase in the amount of parameters and computation also saves the computations of the network. Experimental results on public datasets show that the improvement over occluded human keypoints of the proposed method reaches up to 1.72%, and achieves the leading performance among algorithms of the same period.
关键词	人体姿态估计信息传递姿态语法空间变换遮挡感知
语种	中文
七大方向——子方向分类	图像视频处理与分析
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/44916
专题	紫东太初大模型研究中心_图像与视频分析
推荐引用方式 GB/T 7714	周鲁. 基于信息传递的人体姿态估计方法研究[D]. 中国科学院自动化研究所. 中国科学院自动化研究所,2021.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
基于信息传递的人体姿态估计方法研究.pd（28429KB）	学位论文		开放获取	CC BY-NC-SA