CASIA OpenIR  > 毕业生  > 博士学位论文
基于序列自注意机制的通用视觉模型研究
陈志扬
2024-05-13
Pages132
Subtype博士
Abstract

随着深度神经网络的快速发展,近年来通用人工智能技术已经取得了重大的研究突破。特别是基于序列自注意机制的通用语言模型,例如ChatGPT,通过在大规模数据集上学习训练,已经在众多自然语言处理任务上展现出了优秀的模型性能和广泛的适用性。与此同时,作为自然界中最为常见的信息载体之一,准确地感知和理解图像内容对于实现更全面的通用人工智能至关重要。因此,构建能够普遍理解图像信息、执行多样化视觉任务的通用视觉模型,已经成为了通用人工智能研究领域的一个核心挑战。

 

本文致力于构建具备广泛视觉任务处理能力的通用视觉模型,特别是探索如何在序列自注意机制的基础上,通过创新的模型架构和训练算法设计,实现能够在常见视觉任务上保持较高性能,并可以灵活处理多样化需求的通用视觉模型。该模型的构建过程主要面临以下挑战。首先,不同视觉任务在任务定义和输入图像内容上存在显著差异,目前缺乏适应多样化视觉输入的通用图像编码网络,以及能够覆盖多样化输出形式的解码结构;其次,通用视觉模型的图像细节感知能力尚显不足,在复杂场景中难以确保输出内容的准确性,存在“幻觉”现象;此外,通用视觉模型计算复杂度往往较高,难以在资源受限的实际场景中应用部署。针对上述挑战,本文首先以序列自注意机制为基本工具,构建具备广泛视觉任务处理能力的通用视觉编码-解码模型。然后,本文探讨了该模型在面对复杂开放场景及资源受限环境时的模型设计和训练策略,以提升模型效率和实用性。

 

本文的主要研究成果和贡献归纳如下:

 

1. 针对序列自注意机制难以感知二维空间结构、图像特征能力较弱这一问题,本文提出了一种基于可形变自注意力的通用特征编码结构。该结构旨在为通用视觉模型提供更为精确的图像特征表示。通过引入可形变划块方法,所提出的自注意视觉编码器能够在特征序列编码阶段动态调整图像块的尺寸和位置,以捕获图像中详尽的局部语义信息,进而有效地提取通用图像特征。在图像分类、目标检测等视觉任务的实验中,基于可形变自注意力的通用特征编码结构显著提升了这些任务的性能,验证了该编码器结构的广泛适用性和有效性。

 

2. 针对计算机视觉任务的输出形式复杂多样,单个模型难以统一解码不同任务输出这一问题,本文提出了一种能覆盖大多数视觉任务的通用任务定义形式:以物体为基本单元,基于图像和类别提示输入的多序列生成任务。基于此定义,本文设计了一种语言引导的视觉目标级自注意解码结构。该结构能够将类别文本标签作为输入,然后以序列这一通用输出格式为图像中每个物体生成符合不同任务要求的结果。通过调整输入类别,以及输出序列内的具体内容,该解码结构具备了解码不同视觉任务输出的能力。实验表明,所提出的视觉解码结构在多个经典视觉任务上取得了与专用模型相媲美的性能水平,并且额外的类别输入也进一步提升了模型在开放环境中的可控性和通用性。

 

3. 针对通用视觉模型对图像的局部细粒度信息感知能力弱这一问题,本文对通用视觉模型中普遍存在的“幻觉”现象进行深入研究,并从构造高质量训练数据、引入像素级视觉监督两方面提升通用视觉模型的细粒度理解能力。一方面,基于图像中的细节物体关系标注,本文构建了更关注图像细粒度和抽象关系特征的图文指令微调数据;另一方面,本文引入物体的像素级掩码预测损失作为辅助监督,显式地引导模型聚焦于与上下文高度相关的区域。这些方法均能够增强其在图文多模态输入下的细粒度图像理解能力。此外,本文划分了不同的“幻觉”类型,提出了能为通用视觉模型提供更详细“幻觉”指标的评测数据集。实验表示,本文方法能有效地提升通用视觉模型输出结果的准确性,缓解“幻觉”现象。

 

4. 针对通用视觉模型参数量和计算复杂度高,难以在资源受限的实际场景中部署这一问题,本文提出了一种通用的视觉自注意模型轻量化方法。本文首先对通用视觉模型的复杂度进行深入分析,指出高维全连接模块普遍对计算量有着较大需求。因此,本文提出了一种轻量化的稀疏全连接模块。该模块是一个即插即用的模块,能够普遍减少不同视觉自注意模型的计算量。一方面,该模块通过引入分组线性层和通道混洗操作,减少了高维全连接层在通道间的计算量,另一方面,该模块在空间维度下采样以融合局部特征,使得不同图像块之间可以共享相似特征运算,从而在不影响模型性能前提下,大幅减少其参数量和计算量。在一系列视觉模型上的实验,均充分验证了该模块的有效性和通用性,具备为通用视觉模型提升效率的能力。

Other Abstract

As deep neural networks have progressed rapidly, artificial general intelligence has witnessed a surge of groundbreaking research in recent years. Notably, large language models, such as ChatGPT, which leverage the self-attention mechanism and extensive training corpus, have exhibited exceptional performance and versatility across a wide spectrum of linguistic tasks. Concurrently, since images constitute the most prevalent data format in reality, the ability to accurately perceive and comprehend image content becomes essential for advancing more generalized artificial intelligence. Consequently, the development of a general vision model that can interpret arbitrary images and execute a broad array of visual tasks stands as a pivotal challenge within the research of artificial general intelligence.

 

This dissertation is dedicated to the construction of a general vision model capable of handling a wide range of visual tasks. Specifically, it delves into innovative model architectures and training algorithms based on the self-attention mechanism, culminating in a model with enhanced performance on existing visual tasks and increased adaptability in satisfying more flexible requirements. In the pursuit of this model, this dissertation confronts several challenges. Initially, there is a pronounced disparity in task definitions and image content among different visual tasks. Currently, the general vision model lacks a universal image encoder capable of extracting common features from diverse visual inputs, as well as a general decoder that can accommodate various output formats. Subsequently, the general vision model exhibits a diminished capacity for perceiving and understanding fine-grained image features, making the accuracy of the outputs in complex scenarios uncertain and prone to hallucination. Lastly, the computational complexity of the general vision model is prohibitive, rendering it unsuitable for deployment in resource-limited environments. To address these challenges, the dissertation introduces an encoder-decoder framework based on self-attention as a general vision model, capable of managing a wide range of visual tasks. It further explores the model structure and training strategies when faced with complex, open scenarios and resource-constrained environments, aiming to augment the model's effectiveness and efficiency.

 

The main contributions of this dissertation are summarized as follows:

 

1. To address the weakness that the self-attention mechanism may overlook local spatial structures and extract inferior image features, this dissertation introduces a general image encoder based on deformable self-attention. This model aims to provide more accurate and robust image features for the general vision model. By introducing a deformable patch-splitting module, the model dynamically modulates the size and position of image patches during the embedding process, capturing comprehensive local semantics and thus extracting more potent image features. The proposed encoder significantly enhances the performance across a variety of visual tasks, including image classification and object detection, thereby validating its broad applicability and efficacy.

 

2. To overcome the difficulty in uniformly decoding outputs for various visual tasks using a single model, this dissertation proposes a general definition that encompasses a wide range of visual tasks: treating objects as fundamental units and generating multiple sequences based on the input image and class prompts. According to this definition, a language-guided visual decoder focusing on the objects is proposed. This decoder incorporates class labels as additional inputs, and generates outputs in a general sequence format for each object in the image, thereby meeting diverse task requirements. By tailoring the input categories and the specific content within the output sequences, this visual decoder is able to decode outputs for varied visual tasks. Experiments demonstrate that the proposed visual decoder has achieved comparable performance with specialized models across a range of traditional visual tasks, and the additional class prompts further enhances the model's adaptability and universality in open environments.

 

3. To conquer the challenge that the general vision model has limited ability to discern fine-grained image information, this dissertation undertakes an in-depth investigation into hallucination, which is a common phenomenon observed in general vision models. After that, this dissertation enhances the fine-grained understanding capability of the general vision model from two aspects: the construction of a high-fidelity dataset and the implementation of pixel-level supervision during training. On one hand, this dissertation crafts data from annotations detailing object relationships, resulting in vision instructions accentuating fine-grained and high-level semantic features within the image. On the other hand, a pixel-level mask loss is integrated as auxiliary supervision, explicitly guiding the model to focus on regions that are highly related to the input context. These training techniques refine the model's fine-grained image comprehension abilities when presented with both images and text in the input context. Additionally, the dissertation divides hallucination into several categories, and introduces a novel benchmark to provide more nuanced metrics for hallucination. Experiments confirm that this approach significantly improves the accuracy of the general vision model and mitigates the occurrence of hallucination.

 

4. To solve the problem that general vision models suffer from their large amount of parameters and high computational complexity, hindering their application and deployment in resource-constrained scenarios, the dissertation presents a versatile, lightweight approach for vision transformers.  An exhaustive computational analysis is initially conducted, highlighting the substantial computational demands in high-dimensional feed-forward networks. To this end, the dissertation introduces a sparse feed-forward network, SparseFFN. It is a lightweight, plug-and-play module that can reduce complexity across various vision transformers. On one hand, SparseFFN sparsifies connections among high-dimensional channels through the introduction of group linear layers and channel-shuffle operations. On the other hand, it consolidates local features and shares computations among tokens by downsampling in the spatial dimension, significantly reducing complexity without compromising the model performance. Comprehensive experiments across a broad spectrum of vision models fully substantiate the effectiveness and universality of this method, underscoring its potential in enhancing the efficiency of the general vision model.

Keyword图像识别 注意力机制 基础视觉模型 通用视觉模型
Language中文
Document Type学位论文
Identifierhttp://ir.ia.ac.cn/handle/173211/56643
Collection毕业生_博士学位论文
Recommended Citation
GB/T 7714
陈志扬. 基于序列自注意机制的通用视觉模型研究[D],2024.
Files in This Item:
File Name/Size DocType Version Access License
基于序列自注意机制的通用视觉模型-v05(16306KB)学位论文 限制开放CC BY-NC-SA
Related Services
Recommend this item
Bookmark
Usage statistics
Export to Endnote
Google Scholar
Similar articles in Google Scholar
[陈志扬]'s Articles
Baidu academic
Similar articles in Baidu academic
[陈志扬]'s Articles
Bing Scholar
Similar articles in Bing Scholar
[陈志扬]'s Articles
Terms of Use
No data!
Social Bookmark/Share
All comments (0)
No comment.
 

Items in the repository are protected by copyright, with all rights reserved, unless otherwise indicated.