基于多域学习的视觉场景解析研究

CASIA OpenIR > 类脑智能研究中心

	基于多域学习的视觉场景解析研究
	王玉玺
	2021-12
页数	138
学位类型	博士
中文摘要	随着信息化社会的快速发展和智能终端设备的普及，全球范围内的数据呈现爆炸式增长，我们迎来了大数据时代。在大数据时代，数据呈现显著的多模态特性，比如几乎所有的互联网新闻报道都包含图像、文字描述以及视频等多种模态，这些模态一定程度上都可以表征这条新闻。此外，由于成像设备的多样性和拍摄视角的多样性，图像和视频数据又呈现显著的多视角特性，比如同一对象可以拥有不同视角下的图像。我们将同一对象在不同模态、不同视角下的表达泛化的称为多域数据，其不同域包含了相同数据或者相似数据的不同表达形式。视觉场景解析是指对图像的识别与理解，包括计算机视觉任务中的图像分类、目标检测、语义分割、行人重识别、机器人控制以及图像风格变换等。本文针对视觉场景解析中的图像识别、语义分割以及图像翻译等问题，从多域学习的领域自适应角度出发，研究了单源域到单目标域、无源域数据以及多源域到多目标域的自适应学习方法，致力于最终实现在开放环境下的基于多域学习的视觉场景解析范式。本文取得的主要研究成果如下。 1. 基于不确定性伪标签校正的跨域语义分割。图像语义分割是指为每一个像素预测类别，它需要像素级的标签信息，而在自然场景中，像素级别的语义标注需要耗费大量的人力物力。由于合成数据可以近乎零成本地获得语义标注信息，本文利用合成数据的监督信息和无标注的真实数据，学习在真实数据上的语义分割模型。首先，本文设计了一种软重采样方式解决分割样本中的类别不平衡问题。其次，针对目标域的预测不确定性类别，提出了基于不确定性的校正算法，渐进式的提高域自适应性能。最后，基于不确定度，提出了一种自适应的伪标签生成算法，有效地减少了标签噪声。在多个标准跨域语义分割数据库上的结果证明了所提出方法的有效性，并且取得了当前最好的水平。 2. 基于无源域数据的领域自适应语义分割。虽然本文提出的方法在跨域语义分割问题上取得了不错的性能，但是现有的域自适应语义分割方法都假设源域数据和目标域数据是可获取的。由于数据隐私和数据保护，某些特定情况下源域数据不可获取。在缺乏源数据的情况下，传统的基于分布对齐的域自适应方法或者基于图像翻译的域自适应方法不再适用。针对上述问题，本文提出了一种针对源域数据不可用的跨域自适应语义分割框架，主要包括隐式特征对齐、双向伪标签学习以及信息传播等模块。大量的实验和消融研究被用来验证所提出方法的有效性。在标准的域自适应任务上，实现了当前最好的结果，并对比有源域数据的情况，也取得了相当的性能。更进一步，本文提出的方法在源域模型不可见的黑盒情况下也取得了令人满意的效果。 3. 基于注意力机制的多源域到多目标域的域自适应图像分类。现有的域自适应学习算法大多关注于单源域到单目标域或者多源域到单目标域的情况，而在实际场景中，由于数据的多域性，从多源域到多目标域的情况更为常见。因此，本文在现有的单源域到单目标域工作的基础上，提出了针对多源域到多目标域的域自适应问题的解决方案。首先，本文构建了一种基于对抗学习的多域对齐算法，用于学习域不变信息和类判别信息。为了进一步提升跨域迁移的效果，本文基于注意力机制构建了领域内注意力模块和领域间注意力模块，在域内和域间学习域不变信息。在多个标准跨域图像分类数据集上证实了所提出方法的有效性。 4. 基于多域知识共享机制的多域图像翻译。图像翻译试图学习一个优质的映射函数，该映射函数需要把来自不同分布（领域）之间的风格信息相互转换并保持原图像的内容信息不变。针对更具挑战性的自然场景的多域图像翻译任务，本文通过构建多域之间知识共享模块来建模不同域之间共有的翻译模式，进而增强每个域的翻译性能。此外，为进一步增强翻译图像的细节学习，本文提出一种对称的绝对一致性约束，使得翻译得到的图像细节更加真实。在多个多域图像翻译任务上的实验结果都证明了所提出方法的有效性。总的来说，本文对多域学习的理论和方法进行了系统而深入的研究，并针对现实的视觉场景解析问题，针对性地提出了四种多域学习算法，并将其应用在多域图像翻译、多域图像分类和多域图像语义分割等视觉场景解析问题中，取得了不错的应用效果。
英文摘要	With the rapid development of the information society and the popularization of intelligent terminal devices, we have ushered in the era of big data. In the period of big data, data exhibits significant multi-modal characteristics. For example, almost all Internet news reports contain multiple modalities such as images, text descriptions, and videos. In addition, image and video data show significant multi-view characteristics due to the diversity of devices and views of taking photos. For example, the same object can have images with different views. We generally consider the representations of the same object in different modalities and perspectives as multi-domain data, which means different domains contain different representations of the same or similar data. Visual scene analysis refers to the recognition and understanding of images, including all kinds of tasks in Computer Vision, such as image classification, object detection, semantic segmentation, pedestrian re-identification, robot control, and image style transfer. From the perspective of domain adaptation of multi-domain learning, this paper studies domain adaptation methods under different scenarios, including from single-source domain to single-target domain, source data-free setting, and multi-source domain to multi-target domain. It applies them to mainstream Computer Vision issues such as image classification, semantic segmentation, and image-to-image translation. The main contributions of this paper are summarized as follows. 1. Uncertainty-aware pseudo labels refinery for domain adaptive semantic segmentation. Semantic segmentation aims to provide a pixel-level prediction, which requires pixel-level annotations. In practice, it is labor-expensive and time-consuming for labeling pixel-level annotations. To address this problem, existing methods use synthetic data due to their annotations are free. This paper provides a domain adaptation method by utilizing the labeled synthetic data and unlabeled target data to learn an effective semantic segmentation model for the real-world target data. Firstly, a soft-resampling method is proposed to solve the class-imbalance problem in the semantic segmentation task. Secondly, considering uncertainty classes of target predictions, an uncertainty-aware rectifying technique is proposed to improve adaptation progressively. Finally, this paper provides an adaptive pseudo-label assignment method, which effectively reduces the label noise. The results on multiple standard cross-domain semantic segmentation datasets have proved the effectiveness of these methods and have achieved state-of-the-art results. 2. Source data-free domain adaptation for semantic segmentation. Existing domain adaptation methods assume that the source domain data and target domain data are available. However, in some scenarios, the source data is not available due to data privacy or data protection. Traditional domain adaptation methods based on distribution alignment or image translation are no longer applicable in the absence of source data. To solve the above problem, this paper proposes a source data-free domain adaptive semantic segmentation framework, which mainly includes three schemes: implicit feature alignment, bidirectional pseudo-label learning, and information propagation. Extensive experiments and ablation studies are conducted to validate the effectiveness of the proposed method. On the standard adaptation tasks, the proposed approach achieves new state-of-the-art results and performs comparably to source data-dependent adaptation methods. It also performs well on the black-box source model scenario. 3. Attention-guided multiple source and target domain adaptation for image classification. Existing domain adaptation methods focus on the scenarios from a single source domain to a single target domain or from multi-source domains to a single target domain. However, the adaptation from multi-source domains to multi-target domains is more practical. Therefore, based on previous works from a single source domain to a single target domain, this paper proposes a method to solve the problem of domain adaptation from multiple source domains to multiple target domains. On the one hand, this paper constructs a multi-domain alignment algorithm based on adversarial learning, which is used to learn domain invariant information and class discrimination information. On the other hand, this paper constructs the intra-domain and inter-domain attention modules to learn domain invariant information within and between domains. Experiments on multiple standard cross-domain datasets have confirmed the effectiveness of the proposed method. 4. The multi-domain image-to-image translation algorithm with knowledge sharing module. The image-to-image translation aims to obtain a high-quality mapping function, which exchanges the style information between two different images sampled from different domains and remains the content information unchanged. To tackle the challenging multiple domains image-to-image translation issue, this paper constructs a multi-domain knowledge sharing module to model the translation pattern shared by different domains and enhance image translation performance on the specific domain. In addition, to further improve the details of image translation, this paper also proposes a symmetrical absolute consistency loss to constrain translation learning. Extensive experiments have demonstrated the effectiveness of the proposed method. In general, this paper conducts a systematic and in-depth study on multi-domain learning theory and methods. It proposes four multi-domain learning algorithms for realistic visual scene analysis problems and applies them to the visual scene analysis tasks, such as multiple domains image-to-image translation, multi-source-multi-target domain adaptation for image classification, and domain adaptation for semantic segmentation.
关键词	多域学习视觉场景解析无监督域自适应语义分割图像分类
语种	中文
七大方向——子方向分类	图像视频处理与分析
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/46598
专题	类脑智能研究中心
推荐引用方式 GB/T 7714	王玉玺. 基于多域学习的视觉场景解析研究[D]. 中科院自动化所. 中科院自动化所,2021.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
YX_Wang_final.pdf（13008KB）	学位论文		开放获取	CC BY-NC-SA