脑启发的图像表征学习研究

CASIA OpenIR > 毕业生 > 博士学位论文

	脑启发的图像表征学习研究
	祝贺
	2023-05-20
页数	130
学位类型	博士
中文摘要	如何像人类一样高效地学习图像表征，一直是人工智能领域最关注的研究问题之一。图像表征学习是一种从图像中自动学习高质量、可解释表征的方法，其目的在于通过深度学习算法，将图像从低层次的像素表示转换为高层次的语义表征，从而更好的理解和处理各项下游视觉任务。本文沿着人类视觉系统从结构生长到无监督学习再到监督学习的发育过程，借鉴人脑不同阶段的学习机制，开展了脑启发的图像表征学习研究。本文的研究内容涉及网络结构、无监督学习和监督学习三个方面。其中，网络结构设计基于对比学习框架，旨在增强网络在表征学习中保留信息的多样性；无监督学习围绕自编码器和掩膜图像建模算法开展研究，期望提高算法的学习和训练效率；监督学习基于分类损失函数开展研究，目的在于完善监督类别表征建模并提升其泛化性能。本文的主要研究内容和贡献总结如下： 1.基于多投影通路信息分化的对比学习框架。对比学习是最广泛使用的无监督学习框架之一，其在主干网络的末端附加由全连接层网络组成的投影通路，再计算对比学习损失进行表征学习。然而单一投影通路的网络结构设计，很可能限制了网络捕获多样化的图像特征。受人脑视觉多通路并行处理不同信息机制的启发，本文提出了一种基于多投影通路信息分化的对比学习框架。该框架采用多条投影通路的网络结构设计，通过减小不同通路间学习特征的相似性，使多条通路协作学习全面的视觉特征。此外，该框架还引入了通路训练平衡策略，用于协调通路间的训练进度，均衡学习不同图像特征。该算法框架通用性强，可适用于多种对比学习框架和网络结构，有助于在表征中保留更丰富的图像特征信息，从而提升分类、检测等多项下游任务的性能。 2.基于局部活动对比的自编码器表征学习算法。自编码器是最早被提出的深度无监督学习算法，虽然训练任务简单，但其采用反向传播算法训练，容易导致回传到编码器的梯度较小，从而限制了表征学习的效率。针对这一问题，受视觉腹侧通路反馈连接传播信息的机制启发，本文设计了一种基于局部活动对比的训练算法。该方法基于两次前馈的局部误差训练自编码器，不仅可以训练自编码器实现图像重建任务，还可以训练网络完成简单的图像生成任务。同时，该算法可显著提升编码器结构的表征学习性能，在评估任务上，取得了与主流无监督学习算法相近甚至更优的性能表现。 3.基于视觉关注的掩膜表征学习框架。掩膜图像表征学习是最近兴起的无监督学习框架，其极大地提升了表征学习的性能。但该框架采用随机掩膜的策略，很可能限制了编码器的训练效率。受人类眼动寻找视觉关注点的机制启发，本文提出了一种基于视觉关注的掩膜表征学习框架。该框架通过结合对比学习任务，采用了基于视觉关注点的掩膜策略，使用视觉转换器（visual transformer，ViT）的自注意力图来生成语义激活图，并采样该激活图生成高效的语义掩膜，从而极大地降低了掩膜学习对长训练回合数的依赖。同时，该研究在实验中发现多个自注意力图对物体、背景等语义信息有不均衡的偏好，若是均匀地采用语义激活图生成掩膜，将产生严重的冗余学习。于是该工作提出了模式采样策略，该策略根据多种语义掩膜的归一化重建损失，平衡不同语义模式的学习。该算法显著提高了网络在分类、检测等多个下游任务上的表现。 4.基于类别不确定性的图像分类损失函数。传统监督表征学习主要基于分类任务学习图像特征，但之前分类训练采用的损失函数只关注于对原型向量进行度量和学习，并没有引入对类别不确定性等高维信息的学习，这一定程度上限制了图像表征的泛化性能。受人脑根据不确定性信息修正决策边界的机制启发，本文提出了一种基于类别不确定性的图像分类损失函数。该损失函数通过高斯分布函数建模类别的原型向量和不确定性信息，决策时不仅计算输入和类别原型的相似度距离，还会引入类别的不确定性信息修正结果。同时该研究还基于极大似然项估计，提出了似然项损失，使网络自动学习类别的分布参数。此外，该工作还引入了软距离和硬距离间隔，通过在训练时加大类间差异、减小类内差异，从而取得更好的泛化性能。实验表明，该研究可提升多种网络在标准、长尾分类数据集上的准确率。综上所述，本文针对图像表征学习中网络结构设计、无监督学习、监督学习的三个步骤开展研究。针对其中存在的不同典型问题，本文借鉴了人类大脑的学习机制和特点，设计了一系列脑启发的解决方案，有效地丰富了保留特征的多样性，提升了网络的无监督学习效率，提高了网络监督学习的泛化性能。
英文摘要	How to learn image representations efficiently like humans has always been one of the most attractive issues in the research of artificial intelligence. Image representation learning is an automatic learning method that acquires high-quality and interpretable representations from images. Its purpose is to transform images from low-level pixel representations to high-level semantic representations by using deep learning algorithms, which help to better understand and process various downstream visual tasks. Inspired by the learning mechanisms of different stages in the human brain, this thesis conducts brain-inspired image representation learning research along the development process of the human visual system from structural growth to unsupervised learning to supervised learning. The content of this thesis is related to the design of network structure, unsupervised learning, and supervised learning. Specifically, the design of the network structure is based on the contrastive learning framework, aiming to enhance the diversity of preserved information in the network during representation learning. The study of unsupervised learning focuses on autoencoders and masked image modeling algorithms, which aim at improving the learning and training efficiency of the algorithm. The study of supervised learning is based on classification training tasks, which is to improve the supervised modeling of categories and enhance its generalization performance. The main contributions of this thesis are summarized as follows: 1. The multiple-projector information divergence contrastive learning framework. Contrastive learning is one of the most widely used unsupervised learning frameworks, which appends the projector to the end of the backbone network to learn the representation by optimizing the contrastive loss. However, its network structure employs a single projector, which may limit the network's ability to capture diverse image features. Inspired by the human brain's capacity to process different information in parallel through multiple visual pathways, this thesis proposes a contrastive learning framework based on multiple-projector information divergence. This framework introduces multiple projectors to do contrastive learning and reduces the similarity of features between different projectors, which promotes the network to learn comprehensive visual representations through cooperative learning of multiple pathways. In addition, this work proposes a balanced strategy of projector training to coordinate the training progresses between different projectors, which aims to balance the learning of different image features. This framework is adaptive and can apply to various network structures. It potentially retains richer image feature information in visual representations and improves the performance of several downstream tasks such as classification and detection. 2. The learning algorithm is based on local activity contrast. Autoencoder is one of the earliest proposed deep unsupervised learning algorithms. Although its task is simple, the gradient of the encoder trained by the backpropagation algorithm could be small, which limits the learning efficiency. To address this issue, this thesis proposes an autoencoder training algorithm based on local error through two forward passes, inspired by the feedback connections propagating information in the visual ventral pathway. This algorithm can successfully generate reconstructed images and train the network to do simple image-generation tasks. Meanwhile, this algorithm can improve the quality of learned representations in the encoder and achieve comparable or even better performance than mainstream unsupervised learning algorithms in downstream evaluation tasks. 3. The semantic-attention masked image modeling framework. Masked image modeling is a recently emerged unsupervised learning framework that greatly improves the performance of representation learning. However, the random masks of this framework may limit the training efficiency of the encoder. Inspired by the mechanism of human eye movements in finding visual attention points, this thesis proposes a semantic-attention masked image modeling framework. By combining contrastive learning tasks, this framework employs self-attention maps of the visual transformer to generate semantic activation maps and proposes a sampling strategy to promote the network to learn different semantic information, which greatly reduces the training epochs. In addition, this study finds that semantic activation maps have serious redundant patterns of semantic information such as objects and backgrounds. Therefore, this work proposes a pattern sampling strategy to balance the learning process of different semantic patterns based on the contribution of reconstruction loss of different semantic information. This approach significantly improves the network's performance on various downstream tasks, such as classification and detection. 4. The uncertainty classification learning loss function. Traditional supervised learning usually employs classification loss to learn the representations. However, this framework ignores learning the high-level properties of the category such as uncertainty, which may lead to poor generalization performance. To address this issue, this thesis proposes a classification loss function that is based on category uncertainty, inspired by the mechanism by which humans use uncertainty information to adjust the decision boundary. This method models the uncertainty of different categories by employing a Gaussian distribution, which modifies the classification decision based on the uncertainty information. This method proposes a likelihood loss based on the maximum likelihood estimation so that the network can automatically learn the distribution parameters of the category. In addition, this work also introduces sort-distance and hard-distance margins to increase the inter-class difference and the intra-class similarity during training, which results in achieving better generalization performance in standard and long-tailed classification tasks. In summary, the studies of this thesis are based on three steps in visual representation learning: designing the network architecture, unsupervised pre-training, and supervised learning. To address the typical problems in different steps, this thesis draws inspiration from the mechanisms and characteristics of the human brain and proposes several algorithms for improving image representation learning. These algorithms increase the diversity of retained image feature information, enhance unsupervised learning efficiency, and improve the generalization performance of supervised learning.
关键词	图像表征学习无监督表征学习监督表征学习对比学习掩膜图像建模脑启发算法
语种	中文
七大方向——子方向分类	图像视频处理与分析
国重实验室规划方向分类	视觉信息处理
是否有论文关联数据集需要存交	否
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/51664
专题	毕业生_博士学位论文
推荐引用方式 GB/T 7714	祝贺. 脑启发的图像表征学习研究[D],2023.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
祝贺-脑启发的图像表征学习研究.pdf（5008KB）	学位论文		限制开放	CC BY-NC-SA