英文摘要 | Image recognition is one of the most fundamental and important tasks in the computer vision field, which is vital to assisting people in understanding and managing visual big data. This dissertation focuses on image classification and object detection within the image recognition task. The former aims to assign a category label to an entire image, while the latter further classifies and locates the instances of objects within the image. Nowadays, image recognition can be accurately achieved by deep learning models, and the training process of models usually requires a large amount of labeled training data. However, in the real-life scenario, it is always costly and time-consuming to annotate training images for the image recognition task. Due to the budget constraint, people have to leverage limited labeled data to conduct model training.
The success of deep learning lies in the assumption that the training set shares the same distribution with the test set. Since the image recognition model is trained to fit the training set, it can nicely generalize to the test set and achieve satisfying performances. However, if the size of labeled training set is limited, the above assumption is then violated, causing the model performing poorly on the test set. To avoid the model from overfitting to the training set, a series of relevant works either follow the semi-supervised training paradigm to resort to additional unlabeled data for capturing the inherent distribution information, or follow the transfer-learning paradigm to lower the difficulty of training by directly utilizing the pretrained feature extractor from other visual tasks with good generalization capability. However, the performances of these existing works still need improvements.
This dissertation focuses on the research of image recognition with limited labeled data. Four problem settings with increasing difficulty levels are considered in sequence, including the semi-supervised setting, the long-tailed semi-supervised setting, the semisupervised few-shot setting, and the few-shot setting. The main technical contributions are summarized as below:
1. Propose a semi-supervised image recognition method based on CrossRectify framework. In the semi-supervised setting, the training set is composed of hundreds or thousands of labeled samples, and more unlabeled samples, contributing to the labeled training set and unlabeled training set, respectively. Both of them contain the same categories. As for the unlabeled training set, the commonly-used self-labeling-based semi-supervised training framework can neither detect nor rectify the incorrect pseudo labels, which can severely suppress the effectiveness of semi-supervised learning. To deal with the above issue, a semi-supervised learning framework named CrossRectify has been proposed in this dissertation. Specifically, two image recognition models with different initial parameters are trained at the same time. In the training stage, Cross- Rectify regards the disagreements between two sets of model predictions as potential incorrect pseudo labels, and further leverages the cross-rectifying mechanism to decide the actual categories of pseudo labels. In the inference stage, two different predictions can be merged to further boost the final performances. The experimental results among Stanford Cars, CUB, PASCAL VOC, and MS COCO datasets validate the effectiveness of CrossRectify framework. For example, the semi-supervised performance of SSD 300 can outperform its fully-supervised performance by 3.24% mAP on PASCAL VOC.
2. Propose a long-tailed semi-supervised image classification method based on Mixture-Of-Experts (MOE). In the regular semi-supervised setting, the label distributions of both labeled and unlabeled training set are uniform distribution. However, in the long-tailed semi-supervised setting, the label distribution of labeled training set is imbalanced, and that of unlabeled training set is unavailable. The model tends to predict unlabeled training data as head classes, while suffers from low recall on tail classes. Such issue suppresses the performance of feature extractor within the model. To address such issue, a MOE-based long-tailed semi-supervised image classification method named ComPlementary Experts (CPE) has been proposed in this dissertation. CPE trains three different classifier heads with different intensities of logit adjustment to recall the head, medium, and tail classes, respectively. Besides, the classwise batch normalization mechanism has been proposed to handle the inconsistent feature distributions between head and tail classes, which aims to stabilize the training process. The experimental results among CIFAR-10-LT, CIFAR-100-LT, and STL-10-LT validate the effectiveness of CPE algorithm. For example, CPE outperforms the state-of-the-art algorithm by about 1% on CIFAR-10-LT.
3. Propose a semi-supervised few-shot image classification method based on feature reconstruction. The semi-supervised few-shot setting is harder than semi-supervised setting. The size of training set has significantly decreased, with only a few labeled training samples and dozens of unlabeled training samples within each category. Most of existing semi-supervised few-shot algorithms utilize the pretrained feature extractor from other visual tasks directly and only train a simple generalized linear classifier behind the extractor. However, these algorithms only exploit the supervision signal from pseudo labels, and the performance of classifier is restricted by the quality of pseudo labels. To mitigate the overfitting issue, a semi-supervised few-shot image classification method based on feature reconstruction has been proposed in this dissertation, which considers the distribution of training data as unsupervised signal to benefit the training process. Experiments validate that such method can achieve higher accuracy than existing methods given pseudo labels with the same quality. Furthermore, the experimental results among miniImageNet, tieredImageNet, CIFAR-FS, and CUB validate the effectiveness of the proposed method. For example, such method outperforms the state-of-the-art method by 1% on CUB.
4. Propose a few-shot image recognition method based on low-rank subspace. The regular few-shot setting is more difficult than the semi-supervised few-shot setting, where the training set is still only composed of a few labeled samples, but no unlabeled samples anymore. Due to its limited size, the training set can hardly represent the distribution of test set, so the image recognition model is prone to overfit the training set and fails to generalize to the test set. As a result, mitigating the overfitting issue is the key to improving the performance of few-shot image recognition. This dissertation focuses on the few-shot image recognition task based on large-scale pretrained vision-language models. According to empirical findings, the model tends to learn generalizable and non-generalizable features in the early and later training stage, respectively. Correspondingly, a few-shot image recognition method named SubPT has been proposed in this dissertation. SubPT projects the gradient vector onto a pre-defined low-rank subspace during back-propagation, and such process can discard the adverse gradient components to prevent overfitting. The experimental results among 11 image classification datasets and 1 object detection datasets validate the effectiveness of SubPT. For example, SubPT outperforms the baseline algorithm by 5.42% on image classification task. |
修改评论