英文摘要 | How to learn image representations efficiently like humans has always been one of the most attractive issues in the research of artificial intelligence. Image representation learning is an automatic learning method that acquires high-quality and interpretable representations from images. Its purpose is to transform images from low-level pixel representations to high-level semantic representations by using deep learning algorithms, which help to better understand and process various downstream visual tasks. Inspired by the learning mechanisms of different stages in the human brain, this thesis conducts brain-inspired image representation learning research along the development process of the human visual system from structural growth to unsupervised learning to supervised learning. The content of this thesis is related to the design of network structure, unsupervised learning, and supervised learning. Specifically, the design of the network structure is based on the contrastive learning framework, aiming to enhance the diversity of preserved information in the network during representation learning. The study of unsupervised learning focuses on autoencoders and masked image modeling algorithms, which aim at improving the learning and training efficiency of the algorithm. The study of supervised learning is based on classification training tasks, which is to improve the supervised modeling of categories and enhance its generalization performance. The main contributions of this thesis are summarized as follows:
1. The multiple-projector information divergence contrastive learning framework. Contrastive learning is one of the most widely used unsupervised learning frameworks, which appends the projector to the end of the backbone network to learn the representation by optimizing the contrastive loss. However, its network structure employs a single projector, which may limit the network's ability to capture diverse image features. Inspired by the human brain's capacity to process different information in parallel through multiple visual pathways, this thesis proposes a contrastive learning framework based on multiple-projector information divergence. This framework introduces multiple projectors to do contrastive learning and reduces the similarity of features between different projectors, which promotes the network to learn comprehensive visual representations through cooperative learning of multiple pathways. In addition, this work proposes a balanced strategy of projector training to coordinate the training progresses between different projectors, which aims to balance the learning of different image features. This framework is adaptive and can apply to various network structures. It potentially retains richer image feature information in visual representations and improves the performance of several downstream tasks such as classification and detection.
2. The learning algorithm is based on local activity contrast. Autoencoder is one of the earliest proposed deep unsupervised learning algorithms. Although its task is simple, the gradient of the encoder trained by the backpropagation algorithm could be small, which limits the learning efficiency. To address this issue, this thesis proposes an autoencoder training algorithm based on local error through two forward passes, inspired by the feedback connections propagating information in the visual ventral pathway. This algorithm can successfully generate reconstructed images and train the network to do simple image-generation tasks. Meanwhile, this algorithm can improve the quality of learned representations in the encoder and achieve comparable or even better performance than mainstream unsupervised learning algorithms in downstream evaluation tasks.
3. The semantic-attention masked image modeling framework. Masked image modeling is a recently emerged unsupervised learning framework that greatly improves the performance of representation learning. However, the random masks of this framework may limit the training efficiency of the encoder. Inspired by the mechanism of human eye movements in finding visual attention points, this thesis proposes a semantic-attention masked image modeling framework. By combining contrastive learning tasks, this framework employs self-attention maps of the visual transformer to generate semantic activation maps and proposes a sampling strategy to promote the network to learn different semantic information, which greatly reduces the training epochs. In addition, this study finds that semantic activation maps have serious redundant patterns of semantic information such as objects and backgrounds. Therefore, this work proposes a pattern sampling strategy to balance the learning process of different semantic patterns based on the contribution of reconstruction loss of different semantic information. This approach significantly improves the network's performance on various downstream tasks, such as classification and detection.
4. The uncertainty classification learning loss function. Traditional supervised learning usually employs classification loss to learn the representations. However, this framework ignores learning the high-level properties of the category such as uncertainty, which may lead to poor generalization performance. To address this issue, this thesis proposes a classification loss function that is based on category uncertainty, inspired by the mechanism by which humans use uncertainty information to adjust the decision boundary. This method models the uncertainty of different categories by employing a Gaussian distribution, which modifies the classification decision based on the uncertainty information. This method proposes a likelihood loss based on the maximum likelihood estimation so that the network can automatically learn the distribution parameters of the category. In addition, this work also introduces sort-distance and hard-distance margins to increase the inter-class difference and the intra-class similarity during training, which results in achieving better generalization performance in standard and long-tailed classification tasks.
In summary, the studies of this thesis are based on three steps in visual representation learning: designing the network architecture, unsupervised pre-training, and supervised learning. To address the typical problems in different steps, this thesis draws inspiration from the mechanisms and characteristics of the human brain and proposes several algorithms for improving image representation learning. These algorithms increase the diversity of retained image feature information, enhance unsupervised learning efficiency, and improve the generalization performance of supervised learning. |
修改评论