人体姿态估计的表示学习研究

	人体姿态估计的表示学习研究
	吴文竹
	2022-05-20
页数	84
学位类型	硕士
中文摘要	随着网络技术的快速发展与普及，图像和视频等多媒体数据呈现出爆炸式增长的态势，尤其是伴随着各种社交娱乐软件的发展，与人相关的数据内容越来越多，记录人的图像和视频数量呈现急剧增加。如何利用好这些人的数据并从海量的媒体数据中获取更多有价值的结构化信息，对于分析人的活动、理解人的行为具有重要意义，相关研究也受到了学术界与工业界的广泛关注。人体姿态估计旨在针对包含人的图像或视频数据，预测出一系列人体关键点定位坐标。作为计算机视觉的经典问题之一，它对于描述人体结构信息、预测人体行为具有重要意义。近年来，深度学习技术的发展助推人体姿态估计算法产生极大变革，卷积神经网络的引入使得关键点预测的准确度得到显著提升。然而，人体姿态估计依旧面临许多问题与挑战，如遮挡、变形、服饰或光线改变等造成的预测困难，相关解决方案还有很大提升空间。本文围绕基于深度学习的人体姿态估计任务，从增强特征表示的角度出发，对当前存在的问题进行剖析，设计了更加准确、高效、鲁棒的人体姿态估计算法。具体来说，将从自底向上和自顶向下两个角度来实现提升特征表示的判别力，一方面提出基于关键点上下文聚合的人体姿态估计方法，从底层特征出发来改善用于人体关键点预测的特征图表示；另一方面提出基于对比学习的人体姿态估计方法，通过目标约束来指导网络学习，实现适于人体关键点预测的更具判别力特征表达。论文的主要工作和创新点归纳如下：基于关键点上下文聚合的人体姿态估计。针对图像中存在的噪声与干扰问题，本文提出了一种基于注意力机制的关键点上下文聚合人体姿态估计方法。由于特征图上每个像素点均与所有人体关键点相互关联，而这些关联性又表现出不同的强弱程度，因此通过建模特征图表示和人体关键点表示之间的关联关系，可使特征图每个像素点获取到与之关联性更强的关键点上下文信息，从而实现特征图判别力的增强，最终实现更加准确的人体姿态估计。实验表明，该方法能够有效增强特征图的表达能力，且在公开数据集上的多个指标上取得了同期最好效果。基于层级对比学习的人体姿态估计。针对于服饰变换、姿态扭曲造成的人体关键点区分度不够、定位不准的问题，提出了一种基于对比学习的人体姿态估计方法。为了更好地区分不同部位的人体关键点，需要通过模型学习针对相同类型的关键点对其进行类内一致性增强，而不同类型的关键点需要增大其类间可分性。因此，本文提出了一种基于层级对比学习的算法，充分考虑人体关键点的结构化信息与多尺度特性，帮助网络学习到更具区分度且判别力更强的关键点表示，有效提升人体姿态估计准确度。具体来说，针对人体关键点数目和类型相对固定的特点，提出了结合人体结构信息的正负样本采样方式，主要包含相邻关键点、区域关键点和对称关键点三种负样本采样方式；针对于不同人物目标和不同关键点之间的尺度变化问题，提出了层级对比学习损失设计方案，在构建层级样本空间的基础上进行密集对比损失的计算。实验表明，该方法能有效提升人体姿态估计任务的定位准确度，且在公开数据集上的多个指标上取得了同期最好效果。
英文摘要	With the rapid development and popularization of network technology, multimedia data such as images and videos are showing explosive growth. Especially with the development of various social and entertainment software, there are more and more data content related to people, and the numbers of related images and videos are increasing dramatically. The way to make good use of these human body data and obtain more informative structural information from massive media data is of great significance for analyzing human activities and understanding human behaviors. Related research has also received extensive attention from academia and industry. The human pose estimation task aims to locate a series of human keypoints for the given image or video data. As one of the fundamental computer vision tasks, it is of great significance for describing human structural information and understanding human behavior. In recent years, the development of deep learning technology has brought significant progress and remarkable breakthroughs for human pose estimation. The utilization of convolutional neural networks has greatly improved the accuracy of keypoints prediction. However, human pose estimation still faces many problems and challenges caused by occlusion, deformation, clothing or light changes, so there is still a long way to go for better human pose estimation. Therefore, this thesis focuses on human pose estimation from the perspective of enhancing feature representation. We analyzes the current problems in human pose estimation and designs a more accurate, efficient and robust human pose estimation algorithm based on the deep neural network. Specifically, the discrimination of feature representation will be improved from bottom-up and top-down perspectives. On the one hand, a human pose estimation method with keypoint context aggregation is proposed to improve the feature map representation for prediction based on the lower features. On the other hand, a human pose estimation method based on contrastive learning is proposed, which guides network learning through target constraints and achieves more discriminative feature representation suitable for human keypoints prediction. The main contributions are summarized as follows: Human pose estimation with keypoint context aggregation. Aiming at the problems of noise and interference in images, this thesis proposes a human pose estimation method with keypoint context aggregation based on attention mechanism. Since each pixel on the feature map is correlated with all human body keypoints, and these correlations show different degrees of strength, so each pixel of the feature map can obtain more relevant keypoint context information by modeling the relationship between the feature map representation and the human keypoints representation. Then, it can enhance the discrimination of the feature map, and finally can achieve a more accurate human pose estimation result. Experimental results show that this method can effectively enhance the discrimination of feature maps, and achieves the best results on the challenging benchmark datasets for human pose estimation. Human pose estimation with hierarchical contrastive learning. Aiming at the problems of insufficient discrimination and inaccurate location of human keypoints caused by clothing transformation and posture distortion, a human pose estimation method based on contrastive learning is proposed. In order to better distinguish the keypoints of different parts of the human body, it is necessary to enhance the intra-class consistency of the same type keypoints through model learning, while the different types of keypoints need to increase their inter-class separability. Therefore, this thesis proposes an algorithm based on hierarchical contrastive learning, which fully considers the structural information and multi-scale characteristics of human keypoints, helps the network learn more discriminative keypoint representations, and effectively improves human pose estimation accuracy. Specifically, in view of the relatively fixed number and type of human keypoints, a positive and negative sample sampling method combining human body structure information is proposed, which mainly consists of three negative sample sampling methods: adjacent keypoints sampling, regional keypoints sampling and symmetric keypoints sampling. To solve the problem about scale variation among different human objects and keypoints, a design scheme of hierarchical contrastive learning loss is proposed, which need to calculate dense contrastive loss on the basis of constructing a hierarchical sample representation space. Experimental results show that this method can improve the accuracy of human pose estimation, and also achieves the best results on the challenging benchmark datasets for human pose estimation.
关键词	人体姿态估计关键点上下文对比学习层级损失
语种	中文
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/48558
专题	紫东太初大模型研究中心_图像与视频分析
推荐引用方式 GB/T 7714	吴文竹. 人体姿态估计的表示学习研究[D]. 中国科学院自动化研究所. 中国科学院自动化研究所,2022.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
人体姿态估计的表示学习研究.pdf（7762KB）	学位论文		开放获取	CC BY-NC-SA