像素级图像理解高效特征融合方法研究

CASIA OpenIR > 毕业生 > 博士学位论文

	像素级图像理解高效特征融合方法研究
	武慧凯
	2020-05-27
页数	132
学位类型	博士
中文摘要	像素级图像理解是对图像细粒度的处理和分析，旨在预测图像中每一个像素所对应的类别或数值，也被称为图像像素级预测。其涉及语义分割（预测每个像素的类别），单目深度估计（预测每个像素的深度）和图像增强（预测每个像素的数值以提升图像质量）等多种任务，可以让计算机对图像进行更加精细的感知、理解和处理，在无人驾驶、智能机器人、视频监控和智能摄影摄像中具有重要的作用。因此，开展对像素级图像理解的研究具有十分重要的理论意义和应用价值。像素级图像理解要求算法既能提取高层的语义特征，以对类别或数值进行准确的预测；同时又能提取低层的空间特征（spatial feature，边缘、纹理等细节特征），以对每一个像素进行精确的区分。但是，同时提取语义特征和空间特征非常地困难。其主要原因是提取语义特征一般需要较大的感受野和较多的非线性变换，而提取空间特征却需要较小的感受野和较少的非线性变换。因此，主流的方法通常会对语义特征和空间特征分别进行提取，然后再对二者进行融合以实现对每一个像素准确地预测和精确地区分。如何高效、准确地融合高层的语义特征和低层的空间特征对解决像素级图像理解问题至关重要，同时也非常具有挑战性。本文围绕如何设计时间复杂度低且有效的特征融合算法展开，按照数据抽象程度的不同，重点研究三个问题：如何进行图像间的特征融合，如何进行特征图间的特征融合，以及如何自动地进行多尺度特征图间的特征融合。本文的主要工作和创新点归纳如下： 1. 提出了一种基于联合上采样的框架来对图像的高层语义特征和低层空间特征进行融合，并在该框架下提出了引导滤波单元进行图像间的特征融合已有的像素级图像理解方法通常会使用全卷积网络直接对高分辨输入图像进行处理。这样虽然可以取得良好的性能，但是却会消耗大量的计算资源。为降低算法的时间和空间复杂度，本文首先使用全卷积网络对低分辨率输入图像进行处理得到低分辨率输出结果，然后使用联合上采样对低分辨率输出结果中的语义特征和高分辨率输入图像中的空间特征进行融合。特别地，本文设计出一个神经网络模块——引导滤波单元来更好地对图像中的语义特征和空间特征进行融合。该模块计算复杂度低，具有可学习的参数，并且可以与全卷积网络进行端到端的训练。实验表明，与已有的方法比，所提出的方法可以在多个任务上获得相近甚至更好的性能，同时可以提速10-100倍。这对于像素级图像理解算法在现实场景特别是在嵌入式移动设备中的应用具有重大意义。 2. 展示了基于扩张卷积的全卷积网络可以用标准的全卷积网络加联合上采样来近似，并提出了金字塔联合上采样单元进行特征图间的特征融合在像素级图像理解的方法中，基于扩张卷积的全卷积网络可以在提取高层语义特征的同时维持低层的空间特征，因此在多个数据集上取得了当时最好的性能。但是，相比于标准的全卷积网络，基于扩张卷积的全卷积网络的计算复杂度急剧增加，很难在实际应用中部署。通过分析，本文发现了基于扩张卷积的全卷积网络可以使用标准全卷积网络加联合上采样模块来近似。据此，本文提出了使用标准全卷积网络提取特征图，使用联合上采样在特征图间进行特征融合的框架，并在框架中提出了金字塔联合上采样模块来融合特征图中的语义特征和空间特征。实验证明，本文所提出的方法可以在多个数据集上达到当时最好的性能，并且降低了3倍以上的计算量，这使得算法可以实时地运行。 3. 提出了一种可以自动地设计神经网络模块以进行多尺度特征图间的特征融合的算法在像素级图像理解中，全卷积神经网络通常会提取到多个层级的特征图，即多尺度特征图。选择哪些特征图进行特征融合以及如何融合选取的特征图是特征融合研究的热点。已有的方法通常会进行大量实验手工地选择特征图和设计融合算法，这需要较强的专家知识和大量的人力物力。不同于已有方法，本文致力于设计一个算法来自动地挑选需要融合的特征图以及自动地设计特征融合算法。为此，本文设计出一个包含大量候选方案的搜索空间，并提出了稀疏二值化约束来指导搜索的过程。实验表明，本文的方法可以在较短的时间内自动地设计出一个可以达到当时最好性能并且计算复杂度较低的多尺度特征融合模块。此外，该模块可以直接迁移到其他的网络架构、数据集和任务上，并取得有竞争力的性能。
英文摘要	Pixel-wise image understanding, a.k.a pixel-wise image prediction, is to process and analyze an image in pixel-level, which aims at predicting the corresponding category or value for each pixel in an image. It involves a variety of tasks, such as semantic segmentation (predicting the category of each pixel), monocular depth estimation (predicting the depth of each pixel) and image enhancement (predicting the value of each pixel to improve image quality), enabling computers to perceive, understand, and process images more precisely. Pixel-wise image understanding plays an important role in autonomous driving, intelligent robots, video surveillance, and intelligent photography. Therefore, it is of great theoretical significance and application value to research pixel-wise image understanding. It's necessary for pixel-wise image understanding to extract the high-level semantic feature for classification or regression and maintain the low-level spatial feature (detail feature: edges and textures) to achieve accurate prediction of each pixel. However, it's difficult to extract the semantic feature and the spatial feature simultaneously. Current methods usually extract the semantic feature and the spatial feature respectively and then fuse them. Thus, the efficient fusion of the high-level semantic feature and the low-level spatial feature is essential for pixel-wise image understanding, which is also very challenging. This dissertation focuses on designing effective feature fusion algorithms with low computational complexity. To this end, three issues are well studied: feature fusion between images, feature fusion between feature maps, and automatic feature fusion between multi-scale feature maps. The main works and contributions of this dissertation are summarized as follows: 1. A framework based on joint upsampling is proposed for the fusion of high-level semantic feature and low-level spatial feature in images, in which a guided filtering unit is proposed for feature fusion between images Existing methods for pixel-wise image understanding usually employ a fully convolutional network to process high-resolution input images directly. Although good performance can be achieved, they consume a lot of computation and memory resources. To reduce the time and space complexity, this dissertation proposes a novel framework that first obtains low-resolution outputs by employing a fully convolutional network to process low-resolution input images, and then fuses the semantic feature in low-resolution outputs and the spatial feature in high-resolution input images based on joint upsampling. In particular, this dissertation designs a neural network module, the guided filtering layer, to achieve a better fusion of semantic feature and spatial feature in images. The module has learnable parameters with low computational complexity, which can be trained end-to-end with a fully convolutional network. Experiments show that the proposed method can obtain a comparable or even better performance than the existing methods in multiple tasks while running 10-100 times faster. This is of great significance to the application of pixel-wise image understanding algorithms in real-world scenes, especially in embedded mobile devices. 2. Shows that a dilated fully convolutional network can be approximated with a standard fully convolutional network and a joint upsampling module, resulting in a joint pyramid upsampling unit for feature fusion between feature maps In pixel-wise image understanding, the dilated fully convolutional network can extract the high-level semantic feature while maintaining the low-level spatial feature, achieving state-of-the-art performance on multiple datasets. However, compared to the standard fully convolutional network, the computational complexity of the dilated fully convolutional network increases dramatically, making it difficult to deploy in real-world applications. Through analysis, this dissertation shows that a dilated fully convolutional network can be approximated using a standard fully convolutional network with a joint upsampling module. Based on this, this dissertation proposes a novel framework that employs a standard fully convolutional network to extract feature maps and uses a joint upsampling module for feature fusion between feature maps. In the framework, the joint pyramid upsampling module is proposed to achieve a better fusion of the semantic feature and the spatial feature in the feature maps. Experiments show that the proposed method can achieve state-of-the-art performance on multiple datasets and reduce the amount of computation by more than 3 times, allowing the algorithm to run in real-time. 3. An algorithm is proposed to automatically design neural network modules for feature fusion between multi-scale feature maps In pixel-wise image understanding, a fully convolutional network usually produces multi-scale feature maps. Which feature maps are selected for fusion and how to fuse the selected feature maps are the focus of current research. Existing methods usually take a large number of experiments to manually select feature maps and design fusion algorithms, which requires strong expert knowledge and a lot of time. Differently, this dissertation aims at designing an algorithm to automatically select feature maps and design the fusion method. To this end, this dissertation designs a search space with a large number of candidates and proposes a sparse binary constraint to guide the search process. Experiments show that the proposed method can automatically design a multi-scale feature fusion module that can achieve state-of-the-art performance with low computational complexity. Besides, the module can be transferred to other neural networks, datasets and tasks directly, and achieves competitive performance.
关键词	像素级图像理解联合上采样图像间的特征融合特征图间的特征融合多尺度特征图间的自动特征融合
语种	中文
七大方向——子方向分类	图像视频处理与分析
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/39092
专题	毕业生_博士学位论文
推荐引用方式 GB/T 7714	武慧凯. 像素级图像理解高效特征融合方法研究[D]. 中国北京. 中国科学院大学,2020.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
武慧凯毕业论文.pdf（14140KB）	学位论文		限制开放	CC BY-NC-SA