CASIA OpenIR  > 毕业生  > 硕士学位论文
基于自监督学习的单目深度估计方法研究
周正铭
2023-05-24
页数88
学位类型硕士
中文摘要

单目深度估计旨在根据输入的单幅图像预测相应的场景深度图,是计算机视觉领域中一个热门的研究主题,并在机器人导航、自动驾驶、增强现实等诸多领域展现出广阔的应用前景。基于深度神经网络的单目深度估计方法一般将深度估计视为一个像素级的回归或分类任务,依赖大量训练数据从图像中学习具有强判别力的像素级特征,并利用该特征预测深度图。为了摆脱单目深度估计模型在训练阶段对于场景深度真值数据的依赖,基于自监督学习的单目深度估计方法在近年来受到了广泛的关注。自监督单目深度估计研究中的一个核心问题在于如何从无标签的训练图像中学习到具有强判别能力的像素级特征,进而提升模型的性能。针对该问题,本文从场景深度约束方式、特征融合方式、任务融合方式三个角度展开研究,主要工作包括:

1.分析了文献中常用的连续深度约束方式与离散深度约束方式各自的优点与不足,并结合两者的优点提出了一种基于原型-残差的自监督单目深度估计网络。该网络使用两个分支分别学习一幅粗粒度场景深度图和一幅场景深度残差图,进而将两者之和作为最终输出的细粒度场景深度图。此外,该网络还引入了一个遮挡感知模块,以进一步缓解训练数据中遮挡区域带来的负面影响。在国际公开数据集KITTI和Make3D上的实验结果表明,所提出的基于原型-残差的自监督单目深度估计网络优于文献中多种主流的自监督单目深度估计方法。

2.针对场景特征融合问题,提出了一种基于自蒸馏特征聚合的自监督单目深度估计网络。该网络包含多个用于融合不同尺度场景特征的自蒸馏特征聚合模块。该模块通过三个分支分别学习三个特征偏移向量图:其中一个特征偏移向量图用于细化小尺度的特征,另外两个用于在自蒸馏的条件下细化大尺度的特征。为了使自蒸馏特征聚合模块可以更有效地融合多尺度特征,并保持特征的上下文一致性,设计了一种新的自蒸馏训练策略对网络进行训练。在国际公开数据集KITTI上的实验结果表明,所提出的基于自蒸馏特征聚合的自监督单目深度估计网络优于文献中许多主流的自监督单目深度估计方法。

3.分析了自监督单目深度估计与自监督双目深度估计任务之间的区别与相似之处,并结合两者的特点提出了一种能够同时处理这两个任务的单双目协同的自监督深度估计网络。该网络采用一种孪生网络结构,其中每个子网络都可以作为单目深度估计模型使用。为了能协同处理双目图像,完成双目深度估计任务,该网络引入一种单目特征匹配模块进行隐式的图像特征匹配。为了在训练阶段利用自监督单目和双目深度估计这两个任务各自的优势,引入一种分步联合训练策略对网络进行训练。在国际公开数据集KITTI、DDAD和Cityscapes上的实验结果表明,所提出的单双目协同的自监督深度估计网络能有效完成单目和双目深度估计任务,并在两个任务上都取得了有竞争力的性能。

综上所述,本文围绕基于自监督学习的单目深度估计提出基于原型-残差的自监督单目深度估计网络、基于自蒸馏特征聚合的自监督单目深度估计网络和单双目协同的自监督深度估计网络。所提出的方法有效缓解了文献中方法在深度约束、特征融合和任务融合方面存在的问题,有助于模型在自监督下学到具有强判别力的像素级特征,并提升其单目深度估计的性能。

英文摘要

Monocular depth estimation, which aims to predict the corresponding scene depth map from an input image, is an important topic in computer vision. And it has wide application prospects in various fields, such as robot navigation, autonomous driving, augmented reality, etc. The existing DNN (Deep Neural Network)-based methods for monocular depth estimation generally regard depth estimation as a pixel-level regression or classification task, and they have to be dependent on a large amount of training data to learn pixel-level discriminative features from images and predict the depth map by these features. In order to eliminate the dependence of monocular depth estimation models on ground truth depth data at the training stage, self-supervised monocular depth estimation methods have received much attention in recent years. One of the main problems in self-supervised monocular depth estimation is how to learn pixel-level discriminative features from unlabeled training images and further improve the performance of the models. In this thesis, the study is concentrated on the aforementioned problem from three perspectives: scene depth constraint, feature fusion, and task fusion. The main works include:

1. Both the advantages and disadvantages of the commonly used continuous and discrete depth constraints in literature are analyzed, and a prototype-residual based network for self-supervised monocular depth estimation is proposed by taking the advantages of both the two depth constraints. The proposed network uses two branches to learn a coarse-level scene depth map and a scene depth residual map respectively, and then combines them together to obtain a fine-level scene depth map as the final output. In addition, an occlusion-aware module is introduced into the proposed network for further alleviating the negative influence of occlusions in the training data. Experimental results on the international public datasets KITTI and Make3D demonstrate that the proposed prototype-residual based self-supervised monocular depth estimation network outperforms multiple state-of-the-art methods in literature.

2. Addressing the scene feature fusion problem, a self-supervised monocular depth estimation network is proposed, which is based on self-distilled feature aggregation. The proposed network contains multiple self-distilled feature aggregation modules for fusing the scene features of different scales. Each of these modules learns three feature offset maps through three branches respectively, one for refining the small-scale feature while the others for refining the large-scale feature in a self-distilled manner. A new self-distilled training strategy is designed for network training, so that the self-distilled feature aggregation module could not only fuse multi-scale features more effectively, but also maintain the contextual consistency of the features. Experimental results on the international public dataset KITTI demonstrate that the proposed self-supervised monocular depth estimation network based on self-distilled feature aggregation outperforms many state-of-the-art methods in literature.

3. Both the differences and similarities between the self-supervised monocular and binocular depth estimation tasks are analyzed, and a self-supervised depth estimation network with monocular and binocular collaboration is proposed, which could simultaneously handle the two tasks by taking the advantages of them. The proposed network uses a Siamese network structure, where each of the sub-networks could be used as a monocular depth estimation model. In order to jointly handle stereo images for binocular depth estimation, a monocular feature matching module is introduced into the proposed network for matching the image features implicitly. For utilizing the advantages of both the self-supervised monocular and binocular depth estimation tasks at the training stage, a stepwise joint-training strategy is introduced for network training. Experimental results on the international public datasets KITTI, DDAD, and Cityscapes show that the proposed self-supervised depth estimation network with monocular and binocular collaboration could effectively handle both the monocular and binocular depth estimation tasks and achieve competitive performances.

In summary, a prototype-residual based network, a network based on self-distilled feature aggregation, and a network compatible with binocular depth estimation are proposed for self-supervised monocular depth estimation in this thesis. The proposed methods effectively alleviate the problems in scene depth constraint, feature fusion, and task fusion of the methods in literature, which are helpful for the model to learn pixel-level discriminative features in a self-supervised manner and improve their performance of monocular depth estimation.

关键词单目深度估计 自监督学习 深度神经网络
语种中文
七大方向——子方向分类图像视频处理与分析
国重实验室规划方向分类视觉信息处理
是否有论文关联数据集需要存交
文献类型学位论文
条目标识符http://ir.ia.ac.cn/handle/173211/52054
专题毕业生_硕士学位论文
推荐引用方式
GB/T 7714
周正铭. 基于自监督学习的单目深度估计方法研究[D],2023.
条目包含的文件
文件名称/大小 文献类型 版本类型 开放类型 使用许可
Thesis_Zhouzhengming(22570KB)学位论文 限制开放CC BY-NC-SA
个性服务
推荐该条目
保存到收藏夹
查看访问统计
导出为Endnote文件
谷歌学术
谷歌学术中相似的文章
[周正铭]的文章
百度学术
百度学术中相似的文章
[周正铭]的文章
必应学术
必应学术中相似的文章
[周正铭]的文章
相关权益政策
暂无数据
收藏/分享
所有评论 (0)
暂无评论
 

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。