面向三维场景感知的图像局部特征提取与深度估计方法研究

CASIA OpenIR > 多模态人工智能系统全国重点实验室 > 三维可视计算

	面向三维场景感知的图像局部特征提取与深度估计方法研究
	张宇阳
	2020-05-25
页数	120
学位类型	博士
中文摘要	图像局部特征提取是计算机视觉领域的基础性研究问题，其目的是提取给定图像中具有代表性的像素点位置及其对应的描述子，提取的关键点及其对应的描述子被广泛应用于三维视觉中的各项任务。图像深度估计是指预测给定图像中每一个像素点对应的深度值，而深度值作为三维空间中重要的结构信息同样被运用在各类三维视觉任务之中。图像局部特征提取和图像深度估计分别从稀疏像素点与稠密像素点两个方面对图像中不同层级的场景信息进行了解析，对于计算机理解、建模三维空间结构具有重要意义。本文主要对基于深度学习的图像局部特征提取与深度估计方法展开研究： 1、提出了一种基于多尺度信息融合的图像局部特征提取方法。在基于深度学习的图像局部特征提取方法中，大多数模型直接采用深层的卷积神经特征进行图像关键点的检测及描述。尽管深层的卷积神经特征拥有较大的图像感受野，但其往往丢失了重要的局部结构信息，导致其不能很好地表征关键点所在位置的局部结构，从而削弱了模型的性能。针对上述问题，本文提出了基于多尺度信息融合的图像局部特征提取方法，创新性地构造了基于多尺度的特征重组模块以及基于多尺度的特征混合模块。前者利用亚像素卷积将不同尺度的特征图上采样到原图分辨率，然后进行关键点的检测，极大地提升了关键点检测的精度；后者为描述子的构造引入了多层卷积特征，很好地弥补了深层卷积特征中局部结构信息不足的问题，显著地提升了描述子匹配的鲁棒性。该方法的有效性在HPatches, FM-Bench 以及 Aachen-Day-Night等公开数据集上得到了充分验证。 2、提出了一种基于语义增强的特征描述子学习方法。虽然主流图像局部特征提取方法取得了显著进步，但由于训练数据中缺乏有效的高层语义监督信息，大多数方法只能利用数据中低层语义的约束关系构造损失函数对模型进行训练，导致其不能有效学习到图像中丰富的高层语义，限制了模型的性能。针对上述问题，本文提出了基于语义增强的特征描述子学习方法，其将描述子分为度量描述子与语义描述子两个部分，其中度量描述子通过常用的度量学习方法进行训练，使其具备基本的可区分性与可靠性；而语义描述子则通过图像分类学习方法进行间接训练，使其能够提取图像中高层级的类别语义特征。通过结合度量描述子与语义描述子，可以有效提升描述子在复杂场景中的匹配精度及鲁棒性。同时，为了解决数据集中没有图像类别标签的问题，本文进一步设计了两种基于弱监督的图像分类学习策略，使模型可以在没有真值标签的情况下学习到有效的图像类别语义，极大地减少了数据获取的成本。该方法的有效性在HPatches, Aachen-Day-Night 以及 InLoc 等公开数据集上得到了充分验证。 3、提出了一种基于稠密卷积与多视角约束的无监督深度估计方法。图像深度估计中由残差模块构成的模型对于图像局部结构的解析能力不足，无法得到准确的深度估计值；此外，基于颜色一致性构建的损失函数对训练数据中的噪声敏感，也会影响模型的精度。因此，本文提出了一种基于稠密卷积与多视角约束的无监督深度估计方法。首先，本文借鉴稠密连接的思想，设计了可用于高精度深度估计的稠密卷积模块，极大地提升了图像深度估计的准确度。同时，本文利用多视角间潜在的几何约束构造了基于深度一致性的损失函数，有效地降低了数据中噪声对模型训练造成的负面影响，进一步增强了模型的鲁棒性。该方法的有效性在 KITTI , Cityscapes 等公开数据集上得到了充分验证。综上，通过以上研究工作，本文有效解决了当前图像局部特征提取方法中关键点在局部区域内的误差较大，特征描述子在光照、视角差异较大的场景中匹配精度不足等问题；此外，本文有效解决了当前基于无监督学习的图像深度估计方法中模型对图像中高低层语义解析不足，模型训练过程中基于无监督学习的损失函数受数据噪声影响较大等问题。所提出的图像局部特征提取与深度估计方法，可有效提升当前三维场景感知相关应用的精度及鲁棒性。
英文摘要	Image local feature extraction is a basic tool in the field of computer vision. It is defined to detect representative pixel positions~(keypoints) and extract their corresponding descriptors in a given image. The extracted keypoints and their corresponding descriptors are widely used in various tasks in 3D vision.Image depth estimation is defined as predicting the depth value corresponding to each pixel in a given image.The depth value, as important structural information in the 3D space, is also used in various tasks in 3D vision.Image local feature extraction and image depth estimation parse the scene information in the image from different levels, which is of great significance for the computer to understand the 3D structure behind the images.This thesis focuses on the research of image local feature extraction and depth estimation: 1. Proposing a method of image local feature extraction based on multi-level information fusion.For the deep learning based image local feature extraction, most methods directly use the deep convolutional neural features to detect and describe the image keypoints. Although these deep convolutional neural features have a large image receptive field, they often lose the important local structure information, which makes them unable to characterize the local structure surrounding the keypoints very well and thus weakens the performance of the model. To solve the above problem, this thesis proposes an image local feature extraction method based on multi-level information fusion, which constructs a novel Feature Shuffle Module and a novel Feature Blend Module. The former uses the sub-pixel convolution to upsample the feature maps of different scales from low resolution to high resolution, and then detect the keypoints based on the unsampled feature maps, which greatly improves the accuracy of the detected keypoints. The latter deploys the multi-level convolutional feature vectors to construct the corresponding descriptors, which significantly improves the robustness of the descriptors. To verify the effectiveness of the proposed method, this thesis has conducted comprehensive experiments on the HPatches dataset, FM-Bench dataset, and Aachen-Day-Night dataset, and the results show the proposed method can achieve state-of-the-art performance. 2. Proposing a method of feature descriptor learning based on semantic enhancement. Due to the lack of high-level semantic supervision information in the training data, most methods can only use the low-level semantics information to train the model. As a result, the model cannot effectively learn the high-level semantics in the image, which limits the matching performance of the descriptors. To solve the problem, this thesis proposes a method of feature descriptor learning based on semantic enhancement, which constructs the descriptor from two sub-descriptors: a metric descriptor and a semantic descriptor. The metric descriptors are trained by commonly used metric learning, making them distinguishable and reliable; while the semantic descriptors are novelly trained through image classification learning, making them be able to extract the high-level semantics in the image. By combining the metric descriptor and the semantic descriptor, the matching accuracy of the proposed descriptor can be largely improved. Furthermore, to solve the problem of no available image category labels in the training data, this thesis further proposes two novel image classification learning strategies based on weak supervision. They enable the model to learn the image category semantics without ground truth labels, which greatly reduces the cost of data acquisition. To verify the effectiveness of the proposed method, this thesis has conducted comprehensive experiments on the HPatches dataset, Aachen-Day-Night dataset, and InLoc dataset, and the results show the proposed method can achieve state-of-the-art performance. 3. Proposing an unsupervised depth estimation method based on dense convolution and multi-view constraint. For image depth estimation, most models can effectively capture the global semantics of the image, but their ability in parsing the local structure of the image is insufficient and produces blurred image depth. Besides, the color consistency loss is susceptible to noise, which cannot provide effective gradients for the model optimization on some noisy training data. To solve these problems, this thesis proposes an unsupervised depth estimation method based on dense convolution and multi-view constraint. Firstly, this thesis designs a novel dense convolution module that can be used for high-precision depth estimation, which greatly improves the accuracy of the depth estimation. At the same time, this thesis constructs a novel depth consistency loss based on the multi-view constraint, which effectively reduces the negative impact of the noise in the training data, and further enhances the robustness of the model. To verify the effectiveness of the proposed method, this thesis has conducted comprehensive experiments on the KITTI dataset and Cityscapes dataset, and the results show the proposed method can achieve state-of-the-art performance. In summary, the methods proposed in this thesis effectively improve the precision of the keypoints detection and matching. Besides, the methods proposed in this thesis weaken the adverse effects of the commonly used unsupervised training loss and effectively improve the depth estimation accuracy. The proposed methods can further improve the accuracy and robustness of current 3D scene perception related applications.
关键词	图像局部特征深度估计无监督学习关键点及描述子深度一致性
语种	中文
七大方向——子方向分类	三维视觉
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/44711
专题	多模态人工智能系统全国重点实验室_三维可视计算
推荐引用方式 GB/T 7714	张宇阳. 面向三维场景感知的图像局部特征提取与深度估计方法研究[D]. 中国科学院自动化研究所. 中国科学院自动化研究所,2020.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
面向三维场景感知的图像局部特征提取与深度（24088KB）	学位论文		开放获取	CC BY-NC-SA