面向图像识别的深度神经网络迁移研究

CASIA OpenIR > 毕业生 > 博士学位论文

	面向图像识别的深度神经网络迁移研究
	聂兴
	2024-05-14
页数	138
学位类型	博士
中文摘要	基于深度学习的图像识别模型已经广泛应用于自动驾驶、工业制造、智慧安防等领域。但是训练图像识别模型通常需要大量的计算资源开销。此外，现有的图像识别模型也很难直接适应新的数据分布或场景。针对这些问题，研究者们提出了深度神经网络迁移，通过从已有模型中挖掘潜在的知识，避免了重复训练带来的资源浪费，帮助模型在新场景上更快地学习和适应。近年来，面向图像识别的深度神经网络迁移方法取得了显著的研究进展，为视觉数据的处理、理解和应用提供了重要的支持。然而，这些方法通常面临以下适用性问题： 1）模型的参数效率挑战。现有方法通常需要微调预训练模型的全部或部分参数，导致计算开销与预训练模型的规模呈线性增长，并且容易导致过拟合问题而影响性能； 2）模型的可扩展性挑战。现有方法大多只考虑了单个目标域的情况，很难满足连续适应多个目标域的需求； 3）模型的多模态场景挑战。现有方法主要针对图像领域，而很少探索如何利用多模态数据之间的互补信息，实现跨模态图像识别的迁移需求。如何有效地解决上述问题，从而扩展深度神经网络迁移方法的应用范围并增强其性能，是一个极具挑战性的课题，对于推动相关领域的发展具有至关重要的意义。针对上述问题，本文对面向图像识别的深度神经网络迁移进行研究和探索，并为每个问题提供了相应的解决方案。本文的主要贡献归纳如下： 1.针对参数效率挑战，提出了一种基于提示学习的深度神经网络迁移方法。该方法的核心思想是在适当的提示设计的帮助下，保持预训练模型的参数冻结，通过改善预训练模型的中间特征图的信息流来适应目标域中的下游任务，而不是直接对预训练模型进行参数更新。该方法提出了一种基于提示的迁移策略，为每个输入图像学习具有判别性的视觉提示，同时冻结预训练模型的参数。具体地，该方法为预训练模型引入了一种简单且轻量的提示学习模块，针对预训练模型的多个语义层级，从输入图像中提取出特定于任务的提示特征，通过将学习到的提示与预训练模型的各级中间特征图进行自适应聚合，从而产生参数高效的目标域模型，即每个下游任务只需要训练少量的额外参数。最后，在图像分类和语义分割的基准数据集上的大量实验验证了所提方法的有效性。 2.针对可扩展性挑战，提出了一种基于双端记忆巩固的深度神经网络迁移方法。该方法的核心思想是通过设计模型内部的记忆交互机制，赋予模型持续不断地学习新的目标域的能力。具体地，该方法引入基于特征蒸馏和参数动量更新的双端记忆巩固机制。首先将模型参数解耦为短期记忆分支和长期记忆分支，其中短期记忆分支侧重于模型的表示能力，通过对近期学习过的任务的快速适应以形成短期记忆，长期记忆分支侧重于模型的抗遗忘能力，通过学习少量任务平衡的样本以形成长期记忆。然后，采用特征蒸馏和参数动量更新在两个分支之间进行动态交互，以产生丰富的特征表示，从而使模型形成所有学习过的任务的结构化知识，无需针对每一个目标域增加参数量。最后，在图像分类和语义分割的公开数据集上进行了大量的实验，验证了所提方法的有效性。 3.针对多模态场景挑战，提出了一种基于可微分门控的深度神经网络迁移方法。该方法的核心思想是在多模态场景下从预训练模型中自适应地提取跨模态特征，并在不同模态之间实现精准对齐，以进行视觉-听觉多模态场景的知识迁移。具体地，该方法提出了一种可微分门控框架，引入双向门控引导模块对预训练模型进行特征搜索，通过联合的可微分优化方法在训练过程中学习一组动态门控掩码，这些掩码通过自适应地激活预训练模型的特征来检索特定的知识。随后，为了进一步利用迁移至目标域模型的知识，该方法引入了动态查询增强模块，根据提取到的不同模态的特征自适应地增强查询向量，从而缓解发声目标和背景区域之间的不平衡问题，帮助产生特定于发声目标的视觉掩码。最后，在视听分割的基准数据集上进行了大量实验证明了所提方法的有效性。
英文摘要	Image recognition models based on deep learning have been widely used in many fields such as autonomous driving, industrial manufacturing, and intelligent security. However, training image recognition models usually requires substantial computation resources. Additionally, existing image recognition models struggle to adapt to new data distributions or scenarios directly. In response to these problems, researchers have proposed deep neural network transfer. By mining the potential knowledge of existing models, deep neural network transfer mitigates the waste of resources caused by repeatedly training and helps the model to learn and adapt faster to new scenarios. In recent years, deep neural network transfer methods for image recognition have made significant progress, providing important support for the processing, understanding, and application of visual data. However, these methods typically encounter the following applicability issues: 1) model parameter efficiency challenge. Existing methods usually require fine-tuning all or part of the pre-trained model parameters, resulting in the computational overhead that increases linearly with the scale of the pre-trained model, and is prone to overfitting and degrading performance; 2) model scalability challenge. Most existing methods only consider the scenario of a single target domain, and it is hard to meet the needs of continuously adapting to multiple target domains; 3) model multi-modal scenario challenge. Existing methods mainly focus on the image field, but rarely explore how to utilize the complementary information between multi-modal data to achieve the transfer requirement of cross-modal image recognition. How to effectively address the above problems, thus expanding the application scope and improving the performance of deep neural network transfer methods, is a quite challenging topic and crucial to promote the development of related fields. To tackle the aforementioned issues, this thesis focuses on deep neural network transfer for image recognition and provides the corresponding solutions for each issue. The main contributions are summarized as follows: 1. For the parameter efficiency challenge, a deep neural network transfer method based on prompt learning is proposed. The key idea of the proposed method is to help the pre-trained model with frozen parameters adapt to the downstream task in the target domain with the help of appropriate prompt design, by improving the information flow of the intermediate feature maps of the pre-trained model, instead of directly updating the parameters of the pre-trained model. The proposed method introduces a prompt-based transfer strategy to learn discriminative visual prompts for each input image, while keeping the parameters of the pre-trained model frozen. Specifically, a simple and lightweight prompt learning module is developed for the pre-trained model. For multiple semantic levels of the pre-trained model, task-specific prompt features can be extracted from the input images. By adaptively blending the learned prompt features with multi-level intermediate feature maps of the pre-trained model, the proposed method can generate a parameter-efficient target domain model by training only a small number of additional parameters for each downstream task. Finally, extensive experiments on image classification and semantic segmentation datasets are carried out to verify the effectiveness of the proposed method. 2. For the scalability challenge, a deep neural network transfer method based on bilateral memory consolidation is proposed. The key idea of the proposed method is to design a memory interaction mechanism within the model, thus endowing the model with the capability of continuously learning new target domains. Specifically, the proposed method introduces a bilateral memory consolidation mechanism based on feature distillation and momentum-based parameter update. First, the model parameters are decoupled into the short-term memory branch and long-term memory branch. The short-term memory branch focuses on the representation ability of the model and forms short-term memory through rapid adaptation to recently learned tasks. The long-term memory branch focuses on the anti-forgetting ability of the model and forms long-term memory by learning a small number of task-balanced samples. Then, dynamic interaction between two memory branches is performed through feature distillation and momentum-based parameter update to produce rich feature representations, thus making the model form the structured knowledge of all the previously learned tasks, without additional parameters for each target domain. Finally, extensive experiments on the public datasets of image classification and semantic segmentation are performed to verify the effectiveness of the proposed method. 3. For the multi-modal scenario challenge, a deep neural network transfer method based on differentiable gating is proposed. The key idea of the proposed method is to adaptively extract cross-modal features from the pre-trained model and achieve precise alignment between different modalities, with the goal of transferring knowledge in audio-visual multi-modal scenarios. Specifically, the proposed method develops a differentiable gating framework, performs the feature search on the pre-trained model by introducing a novel bidirectional gating guidance module, and learns a set of dynamic gating masks during the training stage through a joint differentiable optimization method. These learnable masks are responsible for retrieving specific knowledge by adaptively activating features of pre-trained models. Subsequently, to further take advantage of the knowledge that is transferred to the target domain model, the proposed method introduces a dynamic query enhancement module to adaptively enhance the query vectors according to the extracted features of different modalities, thereby mitigating the imbalance problem between the sounding object and the background regions. Furthermore, this design facilitates generating visual masks specific to the sounding object. Finally, extensive experiments on audio-visual segmentation datasets are conducted to verify the effectiveness of the proposed method.
关键词	图像识别深度神经网络迁移提示学习双端记忆巩固可微分门控
语种	中文
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/57426
专题	毕业生_博士学位论文
推荐引用方式 GB/T 7714	聂兴. 面向图像识别的深度神经网络迁移研究[D],2024.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
面向图像识别的深度神经网络迁移研究.pd（14000KB）	学位论文		限制开放	CC BY-NC-SA