基于多视图深度网络模型的视觉场景解析

CASIA OpenIR > 毕业生 > 硕士学位论文

	基于多视图深度网络模型的视觉场景解析
	关赫
	2018-05-25
学位类型	工程硕士
中文摘要	场景解析是视觉分析问题中的一系列高级任务，也是人工智能体感知并与真实物理世界交互的基础，在自主导航、自动驾驶、增强现实等领域有广泛应用前景。然而现有的视觉模型结构相对固化，整合手段简单，推理时先验不足，不利于实现鲁棒的、高效的视觉场景解析。为此，本文提出了一种数据与认知双向驱动的理论框架，在语义分割和深度估计等核心问题上作出有意义的改进，实现了基于多视图深度网络增强的视觉场景解析。本文的主要创新点包括： 1. 提出一种基于自顶向下的结构化动态推理方法，有效地抽取语义先验知识，并复用了真值标注。网络设计的最近趋势之一证实了 Inception 分解卷积组是有效的，因为它可以在低维空间上聚集邻域间的强相关性而不会过多地损失表达能力。该方法基于 Inception 分解的空心卷积组实现了自底向上特征抽取和自顶向下先验指导的有机结合，以低成本、正交化等优势实现了分割性能的再增强。 2. 提出一种基于视图分解的多通路对抗训练方法，对语义粗分割结果按类别离散化处理成子视图集。该方法基于端到端训练将粗精度分割图分解为类别独立的子视图集，并且逐通道进行对抗训练获取更有针对性的反馈梯度信息。 3. 提出一种基于自监督的循环对抗网络结构，应用于单目深度估计任务。该模型从循环结构设计、自监督单目深度估计以及图像块采样对抗三种方法联合优化。通过共享和级联基础网络，实现立体视差对的自监督循环估计；通过图像块采样策略降低样本对抗训练的难度，促使合成视图的局部细节更丰富。最后我们总结了以上提出的方法中的关键问题和应对策略，以及未来亟待探索的研究方向。
英文摘要	Scene analysis is a series of high-level tasks in visual analysis problem, and it is also the basis for artificial agents to perceive and interact with the real physical world. It has wide application prospects in many areas such as robotics navigation, autonomous driving, and augmented reality. However, those existing visual model structures are relatively solidified, the integration methods are slightly naive, and the priori is insufficient in inference phase, which is not conducive to a robust and efficient visual scene analysis framework. To this end, we propose a novel theoretical framework which combines data-driven and cognitive-driven for visual scene analysis, which makes a series of meaningful improvements in the core issues such as semantic segmentation task and depth estimation. The approach is applied to visual scene understanding based on multi-view deep enhancement network. The main contributions of this work include: 1. We propose a top-down structured dynamic inference method, which solves the problems of priori extraction and tagging reused. One of recent trends in network design confirms that the inception-type donut convolution group is efficient, since it can aggregate spatial context over lower dimensional without reducing representational power too much. The method successfully combines the benefits of bottom-up feature learning and top-down prior modeling by leveraging inception-decomposition donut convolution groups and then achieve the improvement of segmentation performance with two advantages of low-cost and orthogonalization. 2. We propose a multi-channel adversarial training method based on view decomposition skill, which promote rationalization of class-specific and class-agnostic semantic subviews. The adversarial network decomposes the coarse segment map into category-independent subviews and then performs per-channel adversarial training process to obtain more targeted feedback gradient information. 3. We propose a self-supervised based cyclical adversarial framework and apply it to monocular depth estimation task. The framework is jointly optimized in terms of cyclic architecture design, forward-warping reconstruction and image patch sampling strategy to achieve both high efficiency and high accuracy. By sharing and cascading the basic network, the self-supervised cyclic estimation is proposed for stereo disparity pairs. By image patch sampling scheme, we can reduce the difficulty of full-resolution sample adversarial training and promote more local details are embedded into synthetic views. In the end, we summarized the key issues and coping strategies in those above proposed method and the research direction that needs to be explored in the future.
关键词	多视图输入自顶向下先验学习对抗训练语义分割单目深度估计
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/21598
专题	毕业生_硕士学位论文
作者单位	中国科学院自动化研究所
推荐引用方式 GB/T 7714	关赫. 基于多视图深度网络模型的视觉场景解析[D]. 北京. 中国科学院研究生院,2018.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
Master Paper.pdf（8873KB）	学位论文		限制开放	CC BY-NC-SA