基于密集特征学习的图像关键点检测与描述方法研究

CASIA OpenIR > 多模态人工智能系统全国重点实验室 > 三维可视计算

	基于密集特征学习的图像关键点检测与描述方法研究
	王常维
	2024-05-10
页数	126
学位类型	博士
中文摘要	图像关键点检测与描述是计算机视觉中的一项基础性研究话题，其任务是从给定的图像中找出代表性的像素并在这些像素位置上提取对应的局部描述子，进而可以通过匹配不同图像中局部描述子，建立图像间像素级的稀疏对应关系。经过二十多年的发展，图像关键点检测与描述方法的研究由传统基于知识的手工设计时代来到基于深度学习的数据驱动时代。全卷积神经网络（Fully Convolutional Networks, FCN）是一种特殊的卷积神经网络架构，它主要应用于图像语义分割等密集预测任务。FCN通过完全移除传统卷积神经网络中的全连接层，并代之以卷积层，允许网络在不同尺度的输入上进行端到端的训练和预测。这种设计促进了密集特征学习，可以在图像的每个位置同时提取密集特征。而近期，基于密集特征学习的图像关键点检测和描述方法，由于可以利用全图的上下文信息，并且可以端到端地并行执行关键点的检测和描述任务，所以展现出卓越的性能以及广泛的应用潜力。本文针对基于密集特征学习的图像关键点检测与描述的方法展开以下研究：提出了一种基于交叉归一化的图像局部描述子提取方法。长期以来，大多数基于手工设计和基于学习的局部描述子提取方法，都采用L2范数归一化来对局部描述子进行规范化，以将描述子空间投影到固定的超球面上。虽然超球面描述空间可以稳定模型优化过程并提高局部描述子匹配中的可重复性，但它也会导致局部描述子之间的分布变得更加密集，从而降低相邻描述子之间的区分度，并导致一些错误的匹配。针对上述问题，本文提出了一个可学习的交叉归一化技术，以作为L2范数归一化的一种替代方案，设法在稳定优化过程的同时，也保留更多区分性信息，从而让局部描述子在描述空间呈现更合理的分布。与此同时，本文提出了一个名为``高效特征复用骨干网络''的密集特征提取架构，此架构可以高效地复用骨干网络提取的浅层特征，以在不显著增加模型参数规模的前提下，提升神经网络的表征能力。此外，本文还提出了一种名为``图像级分布一致损失''的损失函数，其通过对局部描述子的分布空间施加图像级的一致性约束，来进一步提高局部描述子的判别性和鲁棒性。基于上述创新，本文提出了一个基于交叉归一化的图像局部描述子提取方法。该方法的有效性在图像匹配、单应性估计、三维重建和视觉定位等多个下游任务上得到了充分验证。提出了一种基于非局部信息增强的图像关键点检测与描述方法。当前基于深度学习的图像关键点检测和描述方法大多采用常见的卷积神经网络作为特征提取器，而受限于卷积神经网络固有的局部性归纳偏置，这些方法仅能利用有限感受野内的局部信息来学习局部描述子，导致它们缺乏对更大范围的周边上下文以及全局上下文的感知能力。此外，在训练过程中，基于深度学习的局部描述子大多只对采样的关键点，进行点对点的度量学习优化，而没有考虑利用图像的全局信息，这使得局部描述子优化过程缺乏灵活性和对图像的适应性。针对上述问题，本文提出了一种基于非局部信息增强的图像关键点检测与描述方法，旨在利用非局部信息来使局部描述子可以“看得更远从而描述得更好”，希望从“管中窥豹”提升到“高屋建瓴”。具体而言，本文引入非局部上下文增强和一致性空间注意机制，使描述子在特征提取和训练优化过程中获得超越局部属性的感知能力。首先，本文提出了自适应全局上下文增强模块和多样性周边上下文增强模块，用以构建汇集从全局到周边各层次上下文信息的特征提取架构。其次，本文提出了一致性空间注意力加权度量损失，将空间注意力感知集成到局部描述子的优化和匹配阶段。最后，本文提出了基于特征金字塔的局部特征检测方法，以获得更加稳定和准确的关键点定位。本文在图像匹配、单应性估计和视觉定位等多个任务上对提出的方法进行了深入的实验验证，结果表明所提出的方法达到了当前最先进水平。提出了一种基于知识蒸馏的高效图像关键点检测与描述方法。因为图像关键点检测和描述是许多视觉应用的关键底层技术，所以关键点检测与描述的匹配精度和运行效率影响着这些应用的性能表现和部署落地。然而，当前大多数基于深度学习的图像关键点检测和描述的研究，都集中在匹配精度的提升上，而对运行效率的提升则关注的相对较少，导致当前方法参数量大、运行效率不高。针对上述问题，本文提出了一个性能强大而运行高效的基于知识蒸馏的高效图像关键点检测与描述方法以求获得最优的匹配精度和运行效率平衡。首先，本文提出了一个十分轻量的骨干网络，高效地提取密集特征用于图像关键点检测和描述，并将网络的整体参数量缩小到0.17兆字节。为了让轻量网络表现出更高的匹配性能，本文还在图像关键点检测和描述话题中首次引入知识蒸馏技术。本文探索了不同蒸馏形式对图像关键点检测和描述的作用，并提出了一个二阶描述空间蒸馏策略以提高轻量模型的匹配性能。本文在图像匹配、单应性估计、三维重建和视觉定位等多个下游任务上，对所提出的方法的精度和效率进行了评估。实验结果表明，本文提出的方法是首个在消费级显卡上运行效率超过100FPS（Frames Per Second）的基于深度学习的图像关键点检测与描述方案，并保持着极具竞争力的匹配精度。总而言之，通过以上研究工作，本文有效地提高了基于密集特征学习的图像关键点检测和描述方法的匹配精度和运行效率。
英文摘要	Keypoint detection and description is a fundamental research in computer vision, which aims to identify representative pixels from a given image and extract corresponding descriptors at these pixel positions, and then pixel-level sparse correspondences between images can be established by matching between descriptors. Over the past two decades, the paradigm of keypoint detection and description has evolved from traditional hand-crafted approaches to data-driven methods powered by deep learning. Fully Convolutional Networks (FCNs) constitute a distinctive architecture within the realm of Convolutional Neural Networks (CNNs), primarily employed for the task of image semantic segmentation. By entirely eliminating fully connected layers that are typical in conventional CNNs and substituting them with convolutional layers, FCNs enable end-to-end training and prediction across inputs of varying scales. This architectural innovation fosters dense feature learning, a mechanism where features are concurrently extracted at every location within the image.Recently, keypoint detection and description methods based on dense feature learning have shown excellent performance and a wide range of application potentials because they can exploit the context information of the whole image and perform end-to-end joint keypoint detection and description tasks. In this paper, the following research has been carried out for keypoint detection and description based on dense feature learning: Local Descriptor Based on Cross Normalization. For an extended period, the learning process of local descriptors has significantly profited from the application of L2 normalization, a technique that maps the descriptor space onto a hyperspherical domain. While the hypersphere descriptor space does contribute to optimizing stability and enhancing the repeatability of the descriptors, it simultaneously results in a more condensed descriptor distribution, thereby diminishing their discriminative power and potentially leading to inaccurate matching outcomes. This paper introduces a learnable Cross Normalization technique as an alternative to L2 normalization. Simultaneously, this paper introduces a novel architecture termed the High-Efficiency Feature Reuse Backbone Network, designed to effectively recycle and exploit the densely extracted features from the backbone network for enhancing the representation capacity of neural networks. Furthermore, this paper presents an Image-level Distribution Consistency Loss function, which strengthens the discriminative performance and robustness of local descriptors by imposing image-level consistency constraints on the distribution space of these local descriptors. Building upon the aforementioned innovations, this paper proposes a Cross Normalization based approach for extracting image local descriptors. The efficacy of this method has been extensively validated through its successful application in a variety of downstream tasks including image matching, homography estimation, 3D reconstruction, and visual localization. Keypoint Detection and Description Method Based on Non-Local Information Augmentation. Current deep learning-based approaches for keypoint detection and description predominantly employ conventional convolutional neural networks (CNN) as feature extractors. However, they are inherently constrained by the local receptive field bias inherent in CNN, which restricts them to learning local descriptors based solely on limited local information within their receptive fields. These methods thus lack the ability to perceive and incorporate larger surrounding contexts as well as global context into the learned descriptors. Moreover, during the training process, local descriptors are typically optimized via metric learning between individual keypoints without considering the utilization of global information across the entire image. This results in a less flexible and adaptive optimization process for local descriptors that fails to fully harness the potential of image-wide contextual cues. Addressing the aforementioned issues, this paper proposes a Non-Local Information Augmented approach for keypoint detection and description in images. The aim is to explore leveraging non-local information so that local descriptors can "look wider to describe better." Specifically, this paper introduces Non-Local Context Augmentation and a Consistent Attention Mechanism, enabling descriptors to acquire enhanced perception beyond the local scope during both feature extraction and training optimization processes. Firstly, this paper proposes Adaptive Global Context Augmentation and Diverse Surrounding Context Augmentation modules to construct a feature extraction pipeline that encompasses various levels of context information, ranging from the global to the local scale. Secondly, this paper designs a Consistent Attention Weighted Triplet Loss, integrating spatial attention awareness into the optimization and matching stages of local descriptors. Thirdly, this paper proposes a Local Feature Detection method based on a Feature Pyramid to achieve more stable and accurate keypoint localization. In this paper, the proposed method is thoroughly experimentally validated across multiple tasks, including image matching, homography estimation, and visual localization. The results demonstrate that the proposed method outperforms the current state-of-the-art techniques for keypoint detection and description. Efficient Keypoint Detection and Description Method Based on Knowledge Distillation. Keypoint detection and description serve as fundamental underpinning technologies in numerous visual applications. Consequently, the accuracy of keypoint detection and description, along with their runtime efficiency, directly impact the overall performance and practical deployment feasibility of these applications. However, currently, most research on keypoint detection and description based on deep learning primarily focuses on enhancing matching accuracy, with relatively less attention given to improving operational efficiency. This trend has led to current methods being characterized by large parameter counts and suboptimal runtime performance. To bridge this gap, this paper presents a powerful yet computationally efficient keypoint detection and description method based on knowledge distillation, aiming to strike the optimal balance between matching accuracy and efficiency. In this work, we firstly introduce a highly compact backbone architecture tailored for dense feature extraction in the tasks of keypoint detection and description within images, successfully condensing the parameter count down to 0.17MB. Subsequently, to endow the lightweight network with enhanced matching capabilities, this paper pioneers the introduction of knowledge distillation technology in the domain of keypoint detection and description. This study explores different distillation forms and proposes a second-order Descriptor Space Distillation strategy to boost the matching capability of the lightweight network. This paper evaluates the precision and efficiency of the proposed method across multiple downstream tasks, encompassing image matching, homography estimation, 3D reconstruction, and visual localization. The experimental results show that the method proposed in this paper is the first to achieve a runtime efficiency exceeding 100FPS (Frames Per Second) on consumer-grade GPU for deep learning-based keypoint detection and description, while maintaining highly competitive matching accuracy. In summary, through the aforementioned research endeavors, this paper has effectively augmented the matching precision and operational efficiency of keypoint detection and description methods grounded in dense feature learning.
关键词	{图像关键点检测与描述密集特征学习归一化技术一致性注意力机制知识蒸馏
语种	中文
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/56644
专题	多模态人工智能系统全国重点实验室_三维可视计算
推荐引用方式 GB/T 7714	王常维. 基于密集特征学习的图像关键点检测与描述方法研究[D],2024.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
王常维_毕业论文.pdf（30096KB）	学位论文		开放获取	CC BY-NC-SA