基于动态注意力机制的姿态估计方法研究

CASIA OpenIR > 毕业生 > 硕士学位论文

基于动态注意力机制的姿态估计方法研究

邹嘉钰

2023-05

页数

学位类型

硕士

中文摘要

姿态估计是计算机视觉领域基础且富有挑战性的任务，在行为识别、行人检测、自动驾驶、人体重识别、人机交互等领域有着深入而广泛的应用。在现实生活中，同一个体的不同关键点存在自遮挡、不同个体的关键点之间存在互遮挡的问题，给关键点检测与定位带来了困难。随着应用场景日益多样化与复杂化，人们对姿态估计算法的精度要求日益严苛，因此探究更高精度的姿态估计算法变得尤为重要。

针对基于动态注意力网络的姿态估计方法，结合已有的研究基础，在关键点语义编码、关键点空间交互、关键点特征融合等方面开展理论方法创新，本文的主要工作及贡献如下：

（1）针对局部细节语义与全局抽象语义难以充分互补的问题，提出了一种基于动态注意力的语义编码表示方法，为后续解码网络提供了高质量的综合语义特征表示。已有的基于卷积神经网络的语义网络很难建模长距离依赖关系，基于Transformer的语义网络则过度依赖于大规模的标注数据集，且计算负担繁重。本文提出一种基于动态注意力网络的语义编码方法，设计三种不同的动态语义编码结构，对每个阶段的局部特征和全局特征进行耦合，在并行交互结构中引入互学习损失，为后续的解码头网络提供更好的先验特征。该方法的有效性在多个数据集上进行验证，在姿态估计准确率上有明显提升。

（2）针对关键点空间信息交互不足的问题，提出了一种基于动态注意力的双分支空间交互机制，有效促进了关键点之间的空间信息交互。现有的工作欠缺基于卷积神经网络和基于Transformer的网络结构的交互机制，很难有效地将两者的优势结合起来。本文提出一种基于动态注意力网络的双分支空间交互方法，既能继承卷积神经网络特征提取中的平移不变性和局部相关性的优点，又能继承Transformer特征提取中的长距离建模优点，有利于提升空间交互能力。通过在多个公开数据集上的对比试验与可视化分析表明，该方法能有效提升拥挤场景和被遮挡区域的姿态估计性能。

（3）针对多尺度特征融合不充分与语义关联不足的问题，提出了一种基于动态注意力的特征融合方法，促进了不同粒度的特征融合与多类别关键点的语义关联。已有的方法要么忽略了特征融合问题，要么仅采用多尺度特征前融合的方法，带来了较大的计算负担，且难以充分挖掘不同部位关键点之间的关联。本文提出一种基于动态注意力网络的特征前融合与特征后融合方法。特征前融合模块将包含丰富细节信息的底层特征与包含丰富语义信息的高层特征进行融合，特征后融合模块通过注意力得分使得模型自适应地关注重要特征、提升不同关键点的检测性能。本文提出的特征融合方法相较于基线模型有良好的性能提升，验证了特征融合模块的有效性。

综上所述，本文通过对姿态估计算法的结构框架的研究，提出了一种能够高效对关键点进行分类与定位的网络框架。通过动态注意力对语义编码和空间交互进行优化，并对不同部位的关键点进行特征前后融合，实现了在自遮挡及互遮挡等多种复杂场景下的精确关键点分类与定位的目标。在多个公开数据集上进行实验验证，本文提出的方法具有良好的性能，对姿态估计领域的研究具有一定的借鉴意义。

英文摘要

Pose estimation is a basic and challenging task in the field of computer vision, and has in-depth and extensive applications in the fields of behavior recognition, pedestrian detection, automatic driving, human weight recognition, and human-computer interaction. In real life, different key points of the same individual have the problem of self-occlusion, and there is mutual occlusion between the key points of different individuals, which brings difficulties to the detection and positioning of key points. With the increasing diversity and complexity of application scenarios, people's requirements for the accuracy of pose estimation algorithms are becoming more and more stringent, so it is particularly important to explore higher precision pose estimation algorithms.

Aiming at the pose estimation method based on dynamic attention network, combined with the existing research foundation, the theoretical method innovation in key point semantic coding, key point spatial interaction, key point feature fusion and other aspects is carried out, and the main work and contributions of this paper are as follows.

(1) Aiming at the problem that local detail semantics and global abstract semantics are difficult to fully complement, a semantic coding representation method based on dynamic attention is proposed, which provides high-quality comprehensive semantic feature representation for subsequent decoding networks. Existing semantic networks based on convolutional neural networks are difficult to model long-distance dependencies, while transformer-based semantic networks rely too much on large-scale annotated datasets and have a heavy computational burden. In this paper, a semantic coding method based on dynamic attention network is proposed, and three different dynamic semantic coding structures are designed to couple the local features and global features of each stage, introduce mutual learning loss in the parallel interaction structure, and provide better prior features for subsequent decoding head networks. The effectiveness of this method is verified on multiple datasets, and the accuracy of pose estimation is significantly improved.

(2) Aiming at the problem of insufficient spatial information interaction of key points, a dual-branch spatial interaction mechanism based on dynamic attention is proposed, which effectively promotes the spatial information interaction between key points. The existing work lacks the interaction mechanism based on convolutional neural network and Transformer-based network structure, and it is difficult to effectively combine the advantages of the two. In this paper, a two-branch spatial interaction method based on dynamic attention network is proposed, which can inherit the advantages of translational invariance and local correlation in feature extraction of convolutional neural networks, and also inherit the advantages of long-distance modeling in feature extraction of Transformer, which is conducive to improving spatial interaction ability. Comparative experiments and visual analysis on multiple public datasets show that this method can effectively improve the attitude estimation performance of crowded scenes and occluded areas.

(3) Aiming at the problem of insufficient multi-scale feature fusion and insufficient semantic association, a feature fusion method based on dynamic attention is proposed, which promotes the semantic association between feature fusion of different granularity and multi-category key points. The existing methods either ignore the problem of feature fusion, or only adopt the method of multi-scale feature pre-fusion, which brings a large computational burden and is difficult to fully explore the correlation between key points in different parts. In this paper, a pre-feature fusion and post-feature fusion method based on dynamic attention network is proposed. The pre-feature fusion module fuses the underlying features containing rich detailed information with the high-level features containing rich semantic information, and the post-feature fusion module adaptively focuses on important features and improves the detection performance of different key points through attention score. Compared with the baseline model, the feature fusion method proposed in this paper has good performance improvement, which verifies the effectiveness of the proposed feature fusion module.

In summary, through the study of the structural framework of the pose estimation algorithm, this paper proposes a network framework that can efficiently classify and locate key points. Through dynamic attention, semantic coding and spatial interaction are optimized, and the features of key points in different parts are fused before and after, so as to achieve the goal of accurate key point classification and localization in various complex scenarios such as self-occlusion and mutual occlusion. Experimental verification on multiple public datasets, the proposed method has good performance, which has certain reference significance for the research in the field of pose estimation.

关键词

姿态估计动态注意力机制空间交互特征融合

语种

中文

七大方向——子方向分类

图像视频处理与分析

国重实验室规划方向分类

视觉信息处理

是否有论文关联数据集需要存交

否

文献类型

学位论文

条目标识符

http://ir.ia.ac.cn/handle/173211/51671

专题

毕业生_硕士学位论文
中科院工业视觉智能装备工程实验室_精密感知与控制

推荐引用方式
GB/T 7714

邹嘉钰. 基于动态注意力机制的姿态估计方法研究[D],2023.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
基于动态注意力机制的姿态估计方法研究.p（14145KB）	学位论文		限制开放	CC BY-NC-SA