基于有限标注数据的图像识别方法研究

CASIA OpenIR > 毕业生 > 博士学位论文

	基于有限标注数据的图像识别方法研究
	马成丞
	2023-11-27
页数	132
学位类型	博士
中文摘要	图像识别是计算机视觉领域中最基础且最关键的任务之一，对于辅助人类理解、管理图像大数据起到至关重要的作用。本文重点研究图像识别中的图像分类和目标检测任务。前者旨在为整张图像预测出类别标签，后者则需进一步预测出图像中所有目标的类别标签和坐标位置。时至今日，人们可以基于深度学习模型准确实现图像识别任务。训练深度学习模型通常需要大量带有标注的训练数据，然而在现实场景中，训练数据的标注过程需要高昂的人工成本和时间成本。在预算有限的情况下，人们只能利用有限的有标注数据训练图像识别模型。深度学习的成功通常基于训练集和测试集服从相同分布这一假设。图像识别模型正因为在训练过程中拟合了训练集的特征分布，所以才能很好地泛化至测试集，取得令人满意的图像识别结果。然而当训练集规模有限时，训练集无法准确反应测试集分布，上述假设不再成立，模型的泛化能力和测试精度将大幅下降。为了避免模型过拟合于训练集，近期一系列相关工作或遵循半监督学习范式，借助额外的无标注数据来学习数据固有的特征分布；亦或遵循迁移学习范式，直接利用在其他视觉任务上预训练过的、具备良好泛化性能的特征提取器，将训练整个模型简化成只训练模型后段的分类器，从而降低训练难度。然而这些工作的算法性能有待进一步提升。本文围绕基于有限标注数据的图像识别方法展开研究，考虑半监督、长尾分布下的半监督、半监督小样本、小样本四个难度依次递增的问题设定，主要技术贡献总结如下： 1. 提出了一种基于交叉校正的半监督图像识别方法。在半监督问题设定中，训练集由几百至几千个有标注数据和更多的无标注数据构成，二者分别构成有标注训练集和无标注训练集，包含相同的类别。对于无标注训练集，目前常用的基于自标注的半监督训练框架无法检测并校正错误的伪标注，而错误的伪标注严重抑制了半监督训练的效果。针对上述问题，本文提出了一种名为交叉校正的半监督训练框架。具体地，交叉校正框架同时训练两个初始参数不同的图像识别模型。在训练阶段，交叉校正框架将两个模型预测结果之间的分歧认定为潜在的错误伪标注，进而利用交叉校正机制确定伪标注的类别标签，提升伪标注的质量。在测试阶段，融合两个模型的预测结果可以进一步提升图像识别精度。以PASCAL VOC目标检测数据集为例，SSD 300模型的半监督精度比全监督精度提升3.24%mAP。 2. 提出了一种基于混合专家结构的长尾半监督图像分类方法。常规的半监督算法通常默认有标注数据集和无标注数据集的类别都服从均匀分布，然而在长尾半监督问题中，二者分别服从长尾分布和某种未知的分布。模型在训练过程中倾向于将无标注数据预测成头部类别，对于尾部类别的召回率较低，这一问题限制了模型前段的特征提取器的性能。为解决该问题，本文基于混合专家结构提出一种名为互补专家的长尾半监督学习算法，采用多个伴随不同强度的输出调整的分类头来生成伪标注，进而实现对所有类别的准确召回。此外，本文提出按类批归一化机制，处理头部和尾部类别服从不同的特征分布这一问题，从而保证模型训练的收敛。该方法的有效性在CIFAR-10-LT、CIFAR-100-LT和STL-10-LT三个实验数据集上得到了验证。以CIFAR-10-LT 数据集为例，该方法的分类精度超过此前最优方法约1%。 3. 提出了一种基于特征重构的半监督小样本图像分类方法。半监督小样本设定比半监督设定更加严苛，其训练数据量大幅减少，训练集中每个类别通常只有几个有标注训练样本和几十个无标注训练样本。已有算法通常采用迁移学习的思想直接使用预训练过的特征提取器，仅在模型后段重新设计并训练一个广义线性分类器。然而已有算法在训练过程中仅利用了伪标注带来的监督信息，分类器的性能受限于伪标注的质量。针对上述问题，本文提出了一种基于特征重构的半监督小样本图像分类方法，将训练数据分布作为无监督的指导信息来辅助训练，从而缓解过拟合问题。实验证明，在伪标注准确率相同的情况下，本文提出的方法可以获得比已有算法更高的图像分类精度。该方法的有效性在miniImageNet、tieredImageNet、CIFAR-FS和CUB四个数据集上得到了验证。以CUB 数据集为例，该方法的分类精度超过此前最优方法约1%。 4. 提出了一种基于低秩子空间的小样本图像识别方法。小样本问题设定比半监督小样本问题设定更加困难，训练集中不包含无标注数据，仅由极少的有标注数据构成，每个类别通常只有几个样本。由于训练集难以体现完整的数据分布情况，模型容易过拟合于训练样本而难以泛化到测试集上，因此避免过拟合问题是提升小样本图像识别精度的关键。本文基于大规模多模态预训练模型对小样本图像识别算法进行研究，首先通过计算梯度流验证了模型在训练初期和末期分别倾向于学习有利于泛化和不利于泛化的特征，随后提出了一种基于子空间的小样本学习方法，在反向传播过程中将梯度投影至预定义的低秩子空间中，去除不利于泛化的梯度分量。该方法的有效性在11个图像分类数据集和1个目标检测数据集上得到了验证，其中分类精度超过基线方法约5.42%。
英文摘要	Image recognition is one of the most fundamental and important tasks in the computer vision field, which is vital to assisting people in understanding and managing visual big data. This dissertation focuses on image classification and object detection within the image recognition task. The former aims to assign a category label to an entire image, while the latter further classifies and locates the instances of objects within the image. Nowadays, image recognition can be accurately achieved by deep learning models, and the training process of models usually requires a large amount of labeled training data. However, in the real-life scenario, it is always costly and time-consuming to annotate training images for the image recognition task. Due to the budget constraint, people have to leverage limited labeled data to conduct model training. The success of deep learning lies in the assumption that the training set shares the same distribution with the test set. Since the image recognition model is trained to fit the training set, it can nicely generalize to the test set and achieve satisfying performances. However, if the size of labeled training set is limited, the above assumption is then violated, causing the model performing poorly on the test set. To avoid the model from overfitting to the training set, a series of relevant works either follow the semi-supervised training paradigm to resort to additional unlabeled data for capturing the inherent distribution information, or follow the transfer-learning paradigm to lower the difficulty of training by directly utilizing the pretrained feature extractor from other visual tasks with good generalization capability. However, the performances of these existing works still need improvements. This dissertation focuses on the research of image recognition with limited labeled data. Four problem settings with increasing difficulty levels are considered in sequence, including the semi-supervised setting, the long-tailed semi-supervised setting, the semisupervised few-shot setting, and the few-shot setting. The main technical contributions are summarized as below: 1. Propose a semi-supervised image recognition method based on CrossRectify framework. In the semi-supervised setting, the training set is composed of hundreds or thousands of labeled samples, and more unlabeled samples, contributing to the labeled training set and unlabeled training set, respectively. Both of them contain the same categories. As for the unlabeled training set, the commonly-used self-labeling-based semi-supervised training framework can neither detect nor rectify the incorrect pseudo labels, which can severely suppress the effectiveness of semi-supervised learning. To deal with the above issue, a semi-supervised learning framework named CrossRectify has been proposed in this dissertation. Specifically, two image recognition models with different initial parameters are trained at the same time. In the training stage, Cross- Rectify regards the disagreements between two sets of model predictions as potential incorrect pseudo labels, and further leverages the cross-rectifying mechanism to decide the actual categories of pseudo labels. In the inference stage, two different predictions can be merged to further boost the final performances. The experimental results among Stanford Cars, CUB, PASCAL VOC, and MS COCO datasets validate the effectiveness of CrossRectify framework. For example, the semi-supervised performance of SSD 300 can outperform its fully-supervised performance by 3.24% mAP on PASCAL VOC. 2. Propose a long-tailed semi-supervised image classification method based on Mixture-Of-Experts (MOE). In the regular semi-supervised setting, the label distributions of both labeled and unlabeled training set are uniform distribution. However, in the long-tailed semi-supervised setting, the label distribution of labeled training set is imbalanced, and that of unlabeled training set is unavailable. The model tends to predict unlabeled training data as head classes, while suffers from low recall on tail classes. Such issue suppresses the performance of feature extractor within the model. To address such issue, a MOE-based long-tailed semi-supervised image classification method named ComPlementary Experts (CPE) has been proposed in this dissertation. CPE trains three different classifier heads with different intensities of logit adjustment to recall the head, medium, and tail classes, respectively. Besides, the classwise batch normalization mechanism has been proposed to handle the inconsistent feature distributions between head and tail classes, which aims to stabilize the training process. The experimental results among CIFAR-10-LT, CIFAR-100-LT, and STL-10-LT validate the effectiveness of CPE algorithm. For example, CPE outperforms the state-of-the-art algorithm by about 1% on CIFAR-10-LT. 3. Propose a semi-supervised few-shot image classification method based on feature reconstruction. The semi-supervised few-shot setting is harder than semi-supervised setting. The size of training set has significantly decreased, with only a few labeled training samples and dozens of unlabeled training samples within each category. Most of existing semi-supervised few-shot algorithms utilize the pretrained feature extractor from other visual tasks directly and only train a simple generalized linear classifier behind the extractor. However, these algorithms only exploit the supervision signal from pseudo labels, and the performance of classifier is restricted by the quality of pseudo labels. To mitigate the overfitting issue, a semi-supervised few-shot image classification method based on feature reconstruction has been proposed in this dissertation, which considers the distribution of training data as unsupervised signal to benefit the training process. Experiments validate that such method can achieve higher accuracy than existing methods given pseudo labels with the same quality. Furthermore, the experimental results among miniImageNet, tieredImageNet, CIFAR-FS, and CUB validate the effectiveness of the proposed method. For example, such method outperforms the state-of-the-art method by 1% on CUB. 4. Propose a few-shot image recognition method based on low-rank subspace. The regular few-shot setting is more difficult than the semi-supervised few-shot setting, where the training set is still only composed of a few labeled samples, but no unlabeled samples anymore. Due to its limited size, the training set can hardly represent the distribution of test set, so the image recognition model is prone to overfit the training set and fails to generalize to the test set. As a result, mitigating the overfitting issue is the key to improving the performance of few-shot image recognition. This dissertation focuses on the few-shot image recognition task based on large-scale pretrained vision-language models. According to empirical findings, the model tends to learn generalizable and non-generalizable features in the early and later training stage, respectively. Correspondingly, a few-shot image recognition method named SubPT has been proposed in this dissertation. SubPT projects the gradient vector onto a pre-defined low-rank subspace during back-propagation, and such process can discard the adverse gradient components to prevent overfitting. The experimental results among 11 image classification datasets and 1 object detection datasets validate the effectiveness of SubPT. For example, SubPT outperforms the baseline algorithm by 5.42% on image classification task.
关键词	图像识别深度学习半监督学习小样本学习
学科领域	模式识别
学科门类	工学::控制科学与工程
收录类别	其他
语种	中文
七大方向——子方向分类	机器学习
国重实验室规划方向分类	小样本高噪声数据学习
是否有论文关联数据集需要存交	否
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/54586
专题	毕业生_博士学位论文
推荐引用方式 GB/T 7714	马成丞. 基于有限标注数据的图像识别方法研究[D],2023.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
马成丞_基于有限标注数据的图像识别方法研（4555KB）	学位论文		限制开放	CC BY-NC-SA