视觉自监督学习关键技术研究

CASIA OpenIR > 毕业生 > 博士学位论文

	视觉自监督学习关键技术研究
	李朝闻
	2024
页数	122
学位类型	博士
中文摘要	在互联网技术迅猛发展和移动设备广泛普及的背景下，数字信息已成为各行业的重要生产力。信息爆炸式增长带来了挑战，也创造了机遇。如何对海量的数据进行智能化的处理，实现数字化转型成为了一个日益重要的课题。在计算机视觉领域，自监督学习通过挖掘视觉数据内在信息进行模型训练而不依赖人工标注，在有效解决昂贵的标注问题的同时，提高了各种下游任务的性能。视觉自监督学习是计算机视觉中重要且基础的一部分，其核心目标是构建视觉自监督基础模型，已经在自动驾驶、医疗影像以及军事侦查等诸多领域中发挥着核心的作用。然而，相对于已经取得突破进展的语言自监督基础模型，视觉自监督基础模型的进展尚未达到其在自然语言处理领域的水平，面临诸多关键科学问题，解决这些问题对推动视觉自监督基础模型的发展具有重要的现实意义和价值。在构建视觉自监督基础模型的过程中，研究者所面临的关键挑战主要集中于如何在大规模未经人工管控的开放场景中进行高效地自监督预训练。这一挑战的核心实质是设计和实现能够深度挖掘和高效利用大量未标注数据内在信息的自监督学习技术，从而在无需昂贵标注成本的前提下，增强模型的泛化能力和性能表现。针对这一挑战，研究者需要解决以下几个关键科学问题： 1. 无标注数据局部语义有效学习问题。如何从无标注数据中有效学习局部语义，使得模型在没有监督信号的情况下捕捉数据的核心特征和结构，这对于模型的通用表征学习具有重要意义。 2. 大规模开放场景的自监督学习问题。如何从大规模开放数据中学习，使得模型从复杂多变的数据中提取稳定且具有区分性的特征表示，这对于模型的场景泛化具有重要意义。 3. 自监督模型训练的高效可靠问题。如何对模型进行高效可靠的训练，使其能够满足大规模和实际应用场景对模型训练成本和预测可靠性的严格要求，这对于模型的应用落地具有重要意义。本文针对视觉自监督基础模型构建过程中的三个关键科学问题，提出了一系列系统性的解决方案，显著提升了模型的性能和实用性。本文的主要研究成果和贡献可以概括如下： 1. 针对此前视觉自监督算法对无标注数据局部语义有效学习能力不足的问题，本文提出了一种注意力引导的掩码自监督算法。借鉴自然语言处理领域的掩码语言模型，本文设计了一种适用于计算机视觉的掩码图像自监督模型，有效解决了此前主流方法在局部信息提取和上下文学习方面的限制。本文进一步提出了一种注意力引导的掩码策略，以替代传统的随机掩码策略。该策略利用模型自身的注意力机制指导掩码过程，既保持了随机掩码的多样性，又维护了图像的语义完整性，有效捕获了图像局部间的上下文关系。通过此技术预训练的模型，能够有效利用数据中的细粒度信息，在多个计算机视觉下游任务中提升了模型的精度。 2. 针对此前视觉自监督算法在处理大规模多目标数据时所表现出的学习能力不足的问题，本文提出了一种面向多目标数据训练的分层自监督算法。鉴于以往自监督方法在多目标场景下难以实现数据鲁棒和通用表征提取，本文旨在探索能够在多目标图像场景中有效学习通用表征的自监督技术。该技术从多目标数据本身的特性出发，将多目标自监督学习分为场景-场景、场景-实例和实例-实例三个层面，并采用无监督Selective Search方法提取实例进行表征学习。通过这三个层面的预训练，确保了算法在多目标场景上的鲁棒性和通用性。广泛的实验结果表明，该技术在公开数据集上的预训练效果显著，并且在多个下游任务中均能显著提高模型的精度。此外，该技术还展现了良好的网络结构适应性。因此，该技术能够具有在大规模多目标数据中学习的能力，并为后续工作提供了理论支持。 3. 为了更好地适应大规模现实场景和开放环境的需求，本文提出了一种面向任意场景训练的统一表征算法。本文从理论角度分析了现有自监督算法在处理大规模开放场景时所面临的算法与数据属性的冲突问题，并进一步从形式上简化算法和规避冲突，从而避免了复杂的预处理步骤和对额外无监督算法的依赖。在技术实现方面，本文通过显式地执行局部级别学习，解决了现有算法中的冲突问题，并提出了一种基于局部特征增强的方法。这种方法取代了传统的全局数据增强策略，减少了全局性干扰。该技术使模型能够从图像中学习到非同质化的特征表示，并在实际训练中取得了显著效果。通过在多个公开数据集及混合数据集上的广泛实验，证明了该技术在任意场景上的有效性。因此，该技术既能够有效利用数据中的细粒度信息，又具备在大规模任意场景中学习的能力。 4. 针对视觉自监督算法的训练效率低下和预测不可靠的问题，本文提出了一种基于自洽机制的高效自监督算法。本文首先深入分析了当前主流算法的底层机制，并识别出算法中高掩码操作是导致效率低下和预测不可靠的主要原因。针对这一发现，本文设计了一种并行掩码策略，提高了数据利用率，从而显著提升预训练效率。此外，通过引入自洽机制，本文的方法能够在局部预测中学习到一致性，并将这种一致性累积至全局预测中，从而显著增强了预测的一致性和可靠性。实验结果表明，该技术仅需消耗此前性能最好的模型13%的预训练时间和一半的计算量，就能在相同网络结构上达到同等水平的性能，展现了其卓越的高效性。此外，基于该技术的预训练模型在多种不同类型的公开数据集上展现出了优越的预测能力，并在各种视觉下游任务中均能实现出色的性能。因此，该技术既可以有效利用数据中细粒度信息，又能够在大规模任意场景中学习，还具备高效可靠的特性。
英文摘要	Against the backdrop of rapid internet technology development and widespread mobile device adoption, digital information has become a vital productive force across various industries. The explosive growth of information presents both challenges and opportunities. The intelligent processing of massive data to achieve digital transformation has become an increasingly important issue. In the field of computer vision, self-supervised learning leverages the intrinsic information in visual data for model training without relying on manual annotations. This approach not only effectively addresses the costly issue of labeling but also enhances the performance of various downstream tasks. Visual self-supervised learning is a fundamental and significant aspect of computer vision. Its core goal is to construct foundation models for visual self-supervision, which have already played a pivotal role in domains such as autonomous driving, medical imaging, and military reconnaissance. However, in comparison to the breakthroughs achieved in language self-supervised models, the progress in visual self-supervised models has not yet reached the levels seen in natural language processing, facing numerous key scientific challenges. Addressing these challenges is crucial for advancing the development of visual self-supervised foundational models and holds significant practical significance and value. In the process of building self-supervised vision foundation models, researchers face key challenges primarily focused on how to efficiently conduct self-supervised pre-training in a large-scale open scenarios. The essence of this challenge lies in designing and implementing self-supervised learning techniques capable of deeply mining and efficiently utilizing the intrinsic information of vast amounts of unlabeled data, thereby enhancing the model's generalization ability and performance without the need for costly labeling. To address this challenge, researchers need to solve several key scientific problems: 1. The problem of effectively learning local semantics from unlabeled data. Understanding how to efficiently capture local semantics from unlabeled data, enabling models to grasp the core features and structures of data without supervisory signals, is crucial for the development of generalizable representation learning in models. This issue holds significant importance for advancing the state of unsupervised learning methodologies. 2. The problem of self-supervised learning from large-scale open scenarios. It is crucial to develop methods for learning from vast, open datasets, allowing models to extract stable and distinctive feature representations from complex and varied data, which is vital for the model's generalization across different scenes. 3. The problem of efficient and reliable self-supervised model training. Developing methods for efficient and reliable training of models is essential to meet the stringent demands for training costs and prediction reliability in large-scale and practical application scenarios, which is significant for the practical implementation of the model. This paper proposes a series of systematic solutions to three key scientific problems in the construction process of visual self-supervised foundation models, significantly improving model performance and practicality. The main research findings and contributions of this paper can be summarized as follows: 1. To address the inadequacy of previous visual self-supervised algorithms in effectively learning local semantics of unlabeled data, this paper introduces an attention-guided masked self-supervised pre-training algorithm. Drawing from the masked language models in the natural language processing field, this paper designs a masked image self-supervised model suitable for computer vision, effectively overcoming the limitations of mainstream methods in local information extraction and contextual learning. Furthermore, an attention-guided mask strategy is proposed as an alternative to traditional random mask. This strategy uses the model's own attention mechanism to guide the masking process, maintaining the diversity of random mask while preserving the semantic integrity of images and effectively capturing the contextual relationships between image regions. Therefore, models pre-trained through this technology can effectively utilize the fine-grained information in the data and improve the accuracy of the model in multiple computer vision downstream tasks. 2. To address the insufficient learning capability of previous visual self-supervised algorithms when dealing with large-scale multi-object data, this paper proposes a hierarchical self-supervised algorithm based on multi-object data training. Given the difficulty of previous self-supervised methods in achieving data robustness and general representation extraction in multi-object scenarios, this paper explores self-supervised techniques capable of effectively learning general representations in multi-object image scenes. This technique starts from the characteristics of multi-object data itself, dividing multi-object self-supervised learning into scene-scene, scene-instance, and instance-instance levels, and employs an unsupervised Selective Search method to extract instances for representation learning. This three-level pre-training ensures the algorithm's robustness and universality in multi-object scenes. Extensive experimental results demonstrate significant pre-training effects on public datasets and improvements in multiple downstream tasks. Additionally, the technique exhibits good adaptability to network structures. Therefore, this technology can have the ability to learn in large-scale multi-object data and provides theoretical support for subsequent work. 3. To better adapt to the demands of large-scale real-world scenarios and open environments, this paper introduces a unified representation algorithm trained on arbitrary scenes. This paper theoretically analyzes the conflicts between algorithmic and data attributes faced by existing self-supervised algorithms when dealing with large-scale open scenarios. It further formally simplifies the algorithms and circumvents these conflicts, thereby eliminating the need for complex pre-processing steps and reliance on additional unsupervised algorithms. In terms of technical implementation, this paper addresses existing algorithmic conflicts through explicit patch-level learning and proposes a patch feature enhancement method. This approach replaces traditional global data augmentation strategies, reducing global interference. The technique enables the model to learn heterogeneous feature representations from images and achieves significant results in actual training. Extensive experiments on various public and mixed datasets prove the effectiveness of this technique in arbitrary scenes. Therefore, this technology can not only effectively utilize fine-grained information in data, but also have the ability to learn in large-scale arbitrary scenarios. 4. To address the problem of low training efficiency and unreliable predictions in visual self-supervised algorithms, this paper proposes an efficient self-supervised algorithm based on a consistency mechanism. This paper begins with an in-depth analysis of the underlying mechanisms of mainstream algorithms and identifies high mask operations as the primary cause of low efficiency and unreliable predictions. In response to this discovery, a parallel mask strategy was designed to increase data utilization, significantly improving pre-training efficiency. Moreover, by incorporating the self-consistency mechanism, our method learns consistency in patch predictions and accumulates this consistency to global predictions, substantially enhancing the consistency and reliability of predictions. Experimental results show that this technique requires only 13% of the pre-training time and half the computational resources of advanced models to achieve equivalent performance on the same network architecture, demonstrating its exceptional efficiency. Additionally, pre-trained models based on this technique have shown advanced predictive capabilities on various types of public datasets and achieve outstanding performance in a variety of visual downstream tasks. Therefore, this technology can effectively utilize fine-grained information in data, learn in large-scale arbitrary scenarios, and is efficient and reliable.
关键词	请输入关键词
语种	中文
是否为代表性论文	是
七大方向——子方向分类	图像视频处理与分析
国重实验室规划方向分类	视觉信息处理
是否有论文关联数据集需要存交	否
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/56508
专题	毕业生_博士学位论文
推荐引用方式 GB/T 7714	李朝闻. 视觉自监督学习关键技术研究[D],2024.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
视觉自监督学习关键技术研究——李朝闻05（42567KB）	学位论文		限制开放	CC BY-NC-SA