CASIA OpenIR  > 毕业生  > 硕士学位论文
面向长尾分布的视觉识别关键技术研究
李俊
2024-05
Pages78
Subtype硕士
Abstract

    在符合现实世界数据分布的数据集上取得良好表现是一个模型走向实际部署应用的重要一步,然而现实世界数据分布通常呈现出长尾分布,这对现有的很多算法带来了挑战。长尾分布数据给模型训练带来的问题大致可以分为两个方面,一个是头部和尾部类别训练样本数差距过大导致模型对尾部类别的预测准确率明显偏低,另一个是尾部类别样本数太少,代表性不足,难以提供丰富的信息。本文从这两个方面出发,基于深度学习方法,对当前研究存在的一些问题进行了深入研究并提出了相应的解决办法,总结起来本文的贡献如下:
    1. 针对模型在头部尾部类别学习偏好存在差异的问题,本文提出了嵌套式协同学习框架。具体地,嵌套式协同学习框架通过协同学习多个专家模型来更好地利用有限数据,挖掘对视觉分类任务最重要的视觉特征。其中协同学习包括专家模型内的协同学习和专家模型间的协同学习,两种协同学习均有效地降低了模型预测的不确定性,使模型学到的知识在各个专家模型间传递,有效地提升了单个专家模型的性能。嵌套式关系则来自本文提出的难类别挖掘方法,其通过选择具有高预测分数的负类别作为困难类别,形成了局部类别集合与全体类别集合的嵌套关系。模型在嵌套关系下不仅可以从全局视角对所有类别进行建模学习,还可以从局部视角进行建模学习,这有助于模型捕获全局且稳定的特征,还有利于区分更加细致的特征,大大加强了模型对易混淆类别的分辨能力。该方法在多个长尾分布数据集上取得了最优的性能,大量的分析实验也证明了其有效性。
    2. 针对尾部类别样本数太少,代表性不足的问题,本文提出使用预训练的视觉语言多模态大模型来提供更丰富的额外信息,然而直接应用大模型会带来巨大的资源消耗,基于此本文提出了文本引导的指令微调方法,其能够在使用较低计算资源消耗的情况下实现预训练多模态大模型向下游长尾分类任务的迁移。文本引导的指令微调方法基于提示指令微调的技术路线,通过将类中心的学习后移降低了大量的GPU 显存消耗。同时由于将类中心学习后移,可学习提示指令数量减少,文本提出使用复合型的文本监督来提升提示指令的生成质量。具体地,文本监督被分为类别层面和内容层面,它们分别提供了类间可分性的监督和捕捉类内变化的作用。本文提出的文本引导的指令微调方法解除了模型在推理时对预定义类别名称的依赖,从而实现了更灵活的提示生成,同时减少了文本编码器的输入数据,大幅降低了GPU 显存消耗。该方法在长尾分布,小样本识别,域泛化等多个实验设置下均取得了显著的性能提升,证明了该方法的有效性与通用性。

Other Abstract

    Achieving satisfactory performance on datasets that conform to real-world data distribution is a crucial step for models to be deployed in practical applications. However, real-world data distribution often exhibits a long-tail distribution, posing challenges to many existing algorithms. The problems brought by long-tail distribution data to model training can be broadly categorized into two aspects. One is the significant gap in the number of training samples between head and tail categories, leading to a noticeable overestimation of prediction accuracy for head categories by the model. The other is the scarcity of samples in tail categories, resulting in insufficient representativeness and difficulty in providing rich information. In this paper, addressing these two aspects, we conduct in-depth research and propose corresponding solutions based on deep learning methods. The contributions of this paper can be summarized as follows: 
    1. To address the issue of divergent learning preferences between head and tail categories, this paper proposes a nested collaborative learning framework. Specifically, the nested collaborative learning framework leverages the collaboration of multiple expert
models to better utilize limited data and explore the most critical visual features for visual classification tasks. The collaborative learning includes both intra-expert and inter-expert collaborative learning, both of which effectively reduce the uncertainty of model predictions. This facilitates the transfer of learned knowledge among various expert models, significantly enhancing the performance of individual expert models. The nested relationship is derived from the proposed hard category mining method, which selects negative categories with high prediction scores as difficult categories, forming a nested relationship between local class sets and the overall class set. Under this nested relationship, the model can learn not only from a global perspective but also from a local perspective, aiding in capturing global and stable features and distinguishing finer features, greatly enhancing the model’s ability to distinguish between easily confused categories. This method achieves optimal performance on multiple long-tail distribution datasets, and extensive analysis experiments also confirm its effectiveness. 
    2. Addressing the issue of insufficient representativeness of samples in tail categories,
this paper proposes the use of pre-trained visual-language multimodal large models to provide richer additional information. However, directly applying large models incurs significant resource consumption. Therefore, this paper proposes a text-guided prompt tuning method, which enables the transfer of pre-trained multimodal large models to downstream long-tail classification tasks with lower computational resource consumption. The text-guided prompt tuning method is based on the technique route of prompt tuning, reducing a significant amount of GPU memory consumption by shifting the learning of class centers. Additionally, due to the shifted learning of class centers, the number of prompt instructions to be learned decreases. The paper suggests using compound text supervision to improve the quality of prompt instruction generation. Specifically, text supervision is divided into category-wise and content-wise supervision, providing supervision for inter-class separability and capturing intra-class variations, respectively. The proposed text-guided prompt tuning method eliminates the model’s reliance on predefined class names during inference, achieving more flexible
prompt generation while reducing the input data of the text encoder, thus substantially reducing GPU memory consumption. This method demonstrates significant performance improvements in various experimental settings such as long-tail distribution, few-shot recognition, and domain generalization, proving its effectiveness and versatility.

Keyword长尾分布 协同学习 视觉语言多模态大模型 高效提示指令微调
Language中文
Document Type学位论文
Identifierhttp://ir.ia.ac.cn/handle/173211/57129
Collection毕业生_硕士学位论文
Recommended Citation
GB/T 7714
李俊. 面向长尾分布的视觉识别关键技术研究[D],2024.
Files in This Item:
File Name/Size DocType Version Access License
面向长尾分布的视觉识别关键技术研究-fi(9043KB)学位论文 限制开放CC BY-NC-SA
Related Services
Recommend this item
Bookmark
Usage statistics
Export to Endnote
Google Scholar
Similar articles in Google Scholar
[李俊]'s Articles
Baidu academic
Similar articles in Baidu academic
[李俊]'s Articles
Bing Scholar
Similar articles in Bing Scholar
[李俊]'s Articles
Terms of Use
No data!
Social Bookmark/Share
All comments (0)
No comment.
 

Items in the repository are protected by copyright, with all rights reserved, unless otherwise indicated.