CASIA OpenIR  > 毕业生  > 博士学位论文
从单幅图像学习深度
何雷1,2
2018-05-29
学位类型工学博士
中文摘要
深度信息是三维场景理解的重要要素。从单幅图像学习像素级深度信息,由于不需要进行图像之间的对应点匹配等困难的操作,所以较立体视觉等基于多幅图像提取深度的方法,具有特有的优势和应用前景。然而,根据图像成像模型可知,理论上从单幅二维图像无法唯一恢复成像过程中丢失的深度信息,所以,从单幅图像恢复深度信息本质上是一个病态问题,需要借助场景先验、全局信息和局部信息进行约束求解。从机器学习推断单幅图像深度的观点看,模型学习的过程,旨在建立一种“图像表达”与“深度信息”的映射关系。所以,如何从大量图像中在统计意义下学习到合适的图像表达和映射关系,是从单幅图像学习深度的核心问题。另外,除了深度推断的质量外,如何加快推断的速度,也是从单幅图像学习深度的一个追求目标。论文围绕从单幅图像学习深度的一些关键问题进行了系统研究,主要工作有以下几方面:
 
1. 提出了一种快速的从单幅图像学习深度信息的非参数方法
 
非参数深度估计的一般步骤为:先提取数据库和输入图像的全局特征,然后通过全局特征在数据库中搜索与输入图像最相似的候选图像集,据此建立候选图像和输入图像之间的稠密映射关系,并使用这种映射关系对候选图像进行深度迁移、融合和优化。这种方法一个重要的问题是计算效率低。针对这个问题,本文提出了在细尺度上的“流”通过在粗尺度上的“流”插值的计算策略,并给出了一种稀疏SIFT flow的快速方法,取得了2-3倍的加速比。同时,通过分析像素的SIFT flow描述子,将具有可区分性的统计因子加权到能量函数的数据项中,提高了深度估计的可靠性。
 
2. 单幅图像估计深度的固有歧义性分析和变焦数据库生成
 
我们从理论上分析了从单幅图像学习深度时由于图像未知焦距导致的歧义性,并采集了真实图像数据对这种歧义性进行了实验验证。为了消除这种由焦距引起的歧义性,可靠地学习单幅图像的深度,在模型的训练和测试阶段,有必要同时考虑焦距等摄像机的内参数信息。考虑到当前从单幅图像学习深度的数据集都是固定焦距的,本文提出了一种从固定焦距数据集生成多焦距数据集的方法,并生成了两个“变焦距数据库”。针对新生成的图像中的孔洞问题,本文还提出一种通过融合邻域信息的快速孔洞填充方法,使得生成的新焦距图像更接近真实图像。
 
3. 提出了一种充分融合网络中层信息的单幅图像深度估计方法
 
在充分分析当前深度卷积神经网络结构在像素深度估计方面的利与弊的基础上,提出了一种充分融合网络中层信息用于弥补空间分辨率损失的方法,提高了深度推断的准确性。在固定焦距数据集上,本文所提出的方法在各个误差度量上都超过了其它具有相同预训练结构的方法,获得了更好细粒度的单幅图像深度图。在此基础上,为了消除由焦距引起的单幅图像深度估计的歧义性,我们将焦距信息以全连接层的形式嵌入到了当前模型中。在生成的多焦距数据集上的大量测试表明,与没有嵌入焦距信息的模型相比,嵌入焦距信息的模型可以显著提高算法在各种误差度量上的精度。
 
4. 提出了一种纺锤形的网络结构用于推断像素级的深度
 
从网络结构上来说,目前学习像素级标签的深度神经网络,一般采取Encoder-Decoder 的网络结构,它们都是使用迁移学习的方法,从视觉高级任务的网络转化而来。为了直接学习单幅图像像素级的深度图,本文设计了一种纺锤形网络结构:先对输入图像进行升维处理,然后在高维空间再提取特征进行深度估计。为了克服计算机显存的限制,本文采用单幅图像的超分辨技术进行了升维操作。为了获取较广范围的全局信息,本文将膨胀卷积思想推广到了膨胀卷积核。通过和现有方法对比,发现本文所提出的方法在输入低分辨率图像的情况下,仍可以取得比较可信的深度估计。本文的纺锤形网络和实现策略,为单幅图像推断深度提供了一条新的途径,对其它像素级推断问题也具有参考价值。
 
英文摘要
Depth is an important cue for 3D scene understanding, and depth recovery from single 2D images is particularly attractive and promising for various applications, compared to those depth recovery methods from multiple images, such as stereo vision, due to its circumvention of establishing image point correspondences across images, a difficult problem per se. However, from the imaging process, we know that in theory depth cannot be uniquely recovered purely from single images, or depth recovery from single images is an ill-posed problem in nature, and consequently scene priors, and various global and local features are needed to sufficiently regularize the problem for meaningful estimation. From depth learning point of view, model learning is to establish some kind of mapping from "image representation" to "depth", hence how to design appropriate networks and learning algorithms to learn statistically good image representations and mapping models are two key issues. In addition to the depth estimation accuracy, how to speed up the computational process is another key aspect in the field. This work is to tackle such difficult problems in learning depth from single images, and our main contributions include:
            
1. A new fast non-parametric method of depth estimation from single images
 
The general steps of non-parametric methods for depth estimation are: Firstly global features are extracted from input image as well as the images in the dataset; Then most similar images in the dataset to the input image are searched; Then dense point correspondences from input image and the searched candidate images are established; Finally the depths from candidate images are transferred to the input image, fused and optimized. One of the key problems for such non-parametric methods is its heavy computational load. To alleviate this problem, in the fine-scale flow computing stage, the fine-scale flow is interpolated from the flow at its immediate proceeding coarse scale, and the computational load is significantly reduced. More specifically we propose a novel sparse SIFT flow method, which could speed up the computation by a factor of 2-3. In addition, based on a thorough analysis on the variance of pixel-wise descriptors of the SIFT flow, a reweighting technique based on the variance statistics is introduced on the data term in the conditional Markov random fields to further improve the estimation quality.
 
2. Inherent ambiguity in depth learning from single images and generation of varying-focal-length datasets
 
We proved that some inherent ambiguity exists in depth learning from single images and verified it experimentally. We thought the inclusion of focal-length in the training and inference phase could reduce the ambiguity. Considering the current datasets for monocular depth estimation are all captured with fixed-focal-length, and massively capturing varying-focal-length images are labor intensive and expansive, we explored a new way to generate varying-focal-length datasets from fixed-focal-length datasets, and successfully obtained two varying-focal-length datasets. To fill inevitably produced holes in the generation process, we proposed a new template-based hole-filling technique based on local feature similarity, which further improved the visual quality of the generated new images.  
 
3. Depth learning from single images by fully exploiting mid-level information 
 
Based on a thorough analysis on the advantages and disadvantages of the current networks for depth estimation from single images, and a new method is introduced to fully integrate the middle-level information of the network to remedy the loss of spatial resolution, and consequently increased the depth estimation accuracy. After extensive experiments on the fixed-focal-length datasets, the proposed method was shown outperforming all other methods under all the commonly used error metrics, and achieved a better fine-grained depth map. Considering the focal length information is a kind of global information for monocular depth estimation, and in order to eliminate the ambiguity caused by focal length, we embedded the focal length into the global features of the CNN model through fully connected layers. On the varying-focal-length datasets, extensive experiments showed that the models embedded with focal length could significantly improve the depth accuracy, compared to the models without encoding focal length information.
 
4. A novel deep CNN, Spindle-Net CNN, for pixel-wise depth prediction
 
Currently deep convolutional neural networks for inferring pixel level labels generally adopt the Encoder-Decoder architecture, which has been originally designed for high-level vision tasks. In order to directly learn pixel-wise depth from single images, we designed a spindle-like network, called Spindle-CNN. First, the input image is lifted to a high-dimension space, and then specific features are extracted for monocular depth estimation. In order to overcome the limitations of computer memory, single image super resolution technology is used to replace the lifting operation. In addition, the dilation convolution is extended to kernel dilation operation to capture the global information from a wider region. Compared with existing methods, our proposed network and method could still achieve similar depth estimation accuracy under low resolution images. Our proposed Spindle-CNN and its implementation techniques provide a new way for pixel-wise depth learning from single image, and are of reference value for other single image based pixel-wise inference problems.
 
关键词单幅图像深度学习 非参数深度估计方法 深度卷积神经网络 纺锤形网络
语种中文
文献类型学位论文
条目标识符http://ir.ia.ac.cn/handle/173211/21071
专题毕业生_博士学位论文
作者单位1.中国科学院自动化研究所
2.中国科学院大学
第一作者单位中国科学院自动化研究所
推荐引用方式
GB/T 7714
何雷. 从单幅图像学习深度[D]. 北京. 中国科学院大学,2018.
条目包含的文件
文件名称/大小 文献类型 版本类型 开放类型 使用许可
Thesis_Lei_He.pdf(10050KB)学位论文 限制开放CC BY-NC-SA
个性服务
推荐该条目
保存到收藏夹
查看访问统计
导出为Endnote文件
谷歌学术
谷歌学术中相似的文章
[何雷]的文章
百度学术
百度学术中相似的文章
[何雷]的文章
必应学术
必应学术中相似的文章
[何雷]的文章
相关权益政策
暂无数据
收藏/分享
所有评论 (0)
暂无评论
 

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。