CASIA OpenIR  > 毕业生  > 博士学位论文
基于知识对齐与蒸馏的持续学习方法研究
李焜炽
2024-05-16
Pages140
Subtype博士
Abstract

实现通用人工智能的一个关键是构建的人工智能系统能够像人类一样灵活高效地持续学习新知识,并迅速适应动态变化的现实环境。但是,基于静态封闭的学习环境的假设导致当前基于数据驱动的人工神经网络在学习新任务时无法保留和积累先前学到的知识,也无法将这些知识应用于未来学习中。这种现象被称为人工神经网络的灾难性遗忘,即在学习新任务时遗忘先前的记忆。在持续学习的情境下,新任务数据通常与智能系统之前学习的旧数据是非独立同分布的。这表示新数据的分布、特征或关系可能与之前学习的数据有明显不同。由于旧模型仅在过去的任务上进行过训练,并且没有接触过新数据,因此其学到的表示缺乏适应性以容纳这些新数据。这种不足导致模型更新时会出现显著的表示漂移,进一步恶化了灾难性遗忘问题。为了解决这个问题,本研究致力于动态地提升模型学到的表征对新数据的适应能力,实现新旧类别表征的对齐,以减少任务之间的差异,进而缓解模型在学习新任务后出现的特征漂移,从而减轻灾难性遗忘。通过处理不断变化的非平稳数据流中新数据与旧数据之间的差异,智能系统能更好地应对现实世界中的变化,并持续学习新知识,从而提升系统的持续学习性能和对开放环境的适应性。本文的主要工作和创新点如下:

(1) 本文首先分析了在非独立同分布的新旧数据中存在的知识差异,以及这种差异对持续学习方法整体表现造成的不良影响。针对这个问题,本文分析和借鉴人类儿童持续学习的行为过程,提出了一个包含三个训练阶段的框架。首先,通过图片与特征混合回放的方法结合知识蒸馏技术训练一个特征校准网络,以生成一个知识对齐后的教师网络;然后,使用不同基于知识蒸馏的持续学习方法在学习新任务的同时将校准后的旧任务知识从这个教师网络传递到新模型;最后,应用特征调整技术更新存储的旧任务的特征数据。由于第1和2阶段均使用了知识蒸馏技术,因此该方法形成了一个级联的知识蒸馏框架。实验结果表明,本研究的主要方法在三个数据集上取得了当时最优的表现,验证了通过知识对齐可以显著提高知识传递的效果。此外,级联的知识蒸馏框架被证明是一个通用的学习框架,适用于主流的基于知识蒸馏的持续学习方法,能够显著改善它们的整体性能。这一发现为我们研究持续学习提供了重要的指导和启示。

(2) 级联的知识蒸馏框架仅仅使用了从旧模型中提取的特征作为特征校准网络的输入。在旧模型无法很好泛化新旧类别的情况下,这种做法会影响知识对齐效果,从而影响持续学习的整体表现。为了解决这一问题并且减少训练阶段以节省训练时间和资源,本文提出了融合动态模型和块状知识蒸馏的改进方法。改进后的框架包括两个训练阶段。首先,在第1阶段,本文训练一个扩展模块,用于识别新旧类别。该模块有助于发现旧模型遗漏或者错过的属于旧类别但区别于新类别的关键特征。本文利用这些被遗漏的特征来校准旧类别的表示,增强对旧类别表示的泛化能力。接下来,在第2阶段,本文采用提出的块状知识蒸馏方法,结合长尾数据学习的技术,在学习新类别的同时将校准后的旧任务知识传递到新模型。块状知识蒸馏方法根据类别样本数将$softmax$层分成多个块,并进行分块蒸馏。由于每个块中的类别频率相同或相似,所以块内的$softmax$输出之间没有或只有很少的偏差。这可以为旧类别和新类别提供平衡的监督信息,有效解决数据不平衡问题。通过广泛的实验验证,本文提出的改进框架被证实显著提高了原方法的性能,在四个持续学习基准测试数据集上取得了当前最先进的表现。特别是在类增量平均准确率方面,该方法比当前最新的方法提高了1.2%至19.0%。

(3) 传统的知识蒸馏方法采用一对一的信息匹配来计算蒸馏损失。在持续学习旧任务样本不足以及高度数据不平衡的情况下,该方法会导致决策边界偏向新任务。为此,本文提出了基于数据混合的一对多信息匹配方法。该方法首先通过基于数据混合的数据增强方法得到混合数据。然后,它解耦混合数据及其相应的监督信息,更加强调对旧类别的重视。它优先将由旧类别样本计算得到的监督信息分配给混合数据,这相当于将这些混合样本视为旧类别的训练样本,同时意味着将旧类别决策边界推向新类别以包含这些混合数据,从而扩展了旧类别的决策边界,缓解了这个问题。实验结果表明该方法在四个持续学习基准测试数据集上取得了最优的性能,并且可以与前面提出的融合特征增强校准与块状蒸馏的级联知识蒸馏框架结合,在四个数据集上将类增量平均准确率提高了2.2%至4.7%。

Other Abstract

One key to achieving general artificial intelligence is to build an artificial intelligence~(AI) system that can continuously learn new knowledge, adapt quickly to dynamically changing real-world environments, and do so with the same flexibility and efficiency as humans. However, the assumption of a static closed learning environment has led to the current data-driven artificial neural networks being unable to retain and accumulate previously learned knowledge when learning new tasks, nor apply that knowledge to future learning. This phenomenon is known as catastrophic forgetting in artificial neural networks, where prior memories are forgotten when learning new tasks. In the context of continual learning, new task data is usually non-i.i.d. (independent and identically distributed) with the old data the intelligent system has previously learned. This means that the distribution, features, or relationships of the new data may be significantly different from the data learned before. Since the old model has only been trained on past tasks and has not encountered the new data, its learned representations lack adaptability to accommodate these new data. This deficiency leads to significant representation drift when updating the model, further exacerbating the problem of catastrophic forgetting. To address this issue, this study focuses on dynamically enhancing the adaptability of the model's learned representations to new data, aligning representations of old and new categories to reduce differences between tasks, thereby alleviating feature representation drift after learning new tasks and mitigating catastrophic forgetting. By handling the differences between new and old data in the continuously changing non-stationary data stream, intelligent systems can better cope with changes in the real world, continuously learn new knowledge, and enhance the system's ability to continually learn and adapt in open environments. The main work and innovations of this study are as follows:

(1) This paper first analyzes the knowledge differences present in non-independent and identically distributed new and old data, and the adverse effects of these differences on the overall performance of continual learning methods. To address this problem, we analyze and draw inspiration from the behavioral process of continuous learning in human children and propose a three-stage training framework. First, by combining the method of mixing images with features replay and knowledge distillation technology, we train a feature calibration network to generate a teacher network aligned with knowledge. Then, using different continual learning methods based on knowledge distillation, we transfer the calibrated knowledge of old tasks to the new model while learning new tasks. Finally, we apply feature adaptation techniques to update the feature vectors of old tasks. Since both the first and second stages use knowledge distillation technology, this method forms a cascaded knowledge distillation framework. The experimental results show that our main method achieved the best performance at the time on three datasets, verifying that knowledge alignment can significantly improve the effectiveness of knowledge transfer. In addition, the cascaded knowledge distillation framework has been proven to be a universal learning framework applicable to mainstream knowledge distillation-based continual learning methods, significantly improving their overall performance. This discovery provides important guidance and insights for us in continual learning.

(2) The cascaded knowledge distillation framework only uses features extracted from the old model as input to the feature calibration network. In cases where the original old features lack generalizability, this approach can affect the effect of knowledge alignment and consequently impact the overall performance of continual learning. To address this issue and reduce the training stage to save training time and resources, we propose an improved method that combines dynamic models and block-wise knowledge distillation. The enhanced framework consists of two training stages. First, in the initial stage, we train an expansion module to identify new and old classes. This module helps discover key features belonging to old classes but different from new classes that were missed by the old model. We use these missed features to calibrate the old representations, enhancing the generalization ability for old class representations. Next, in the second stage, we adopt the proposed block-wise knowledge distillation method, combined with techniques for learning from long-tail data, to transfer the calibrated old task knowledge to the new model. The block-wise knowledge distillation method divides the softmax layer into multiple blocks based on the number of samples per class, and conducts block-wise distillation. Because the class frequencies within each block are similar or identical, there is little to no deviation between the softmax outputs within a block. This method provides balanced supervision information for both old and new classes, effectively addressing data imbalance issues. Through extensive experimental validation, the improved framework we proposed has been confirmed to significantly enhance the performance of the original method, achieving state-of-the-art results on four continual learning benchmark datasets. In particular, our method has shown an improvement of 1.2% to 19.0% compared to the current state-of-the-art methods in terms of class-incremental average accuracy.

(3) Traditional knowledge distillation methods use one-to-one information matching to calculate distillation loss. Due to insufficient training data for old tasks and highly imbalanced data in continual learning, this method can cause the decision boundary to bias towards new tasks. To address this, we propose an one-to-many information matching method based on data-mixing augmentation techniques. This method first obtains mixed data through data-mixing augmentation. Then, it decouples the mixed data and its corresponding supervisory information, emphasizing more on old categories. It prioritizes allocating supervisory information calculated from old category samples to mixed data, equivalent to treating these mixed samples as training samples of old categories, simultaneously meaning pushing the old category decision boundary towards new categories to encompass these mixed data, thereby expanding the old category's decision boundary and mitigating this issue. Experimental results show that the proposed method achieves state-of-the-art performance on four benchmark datasets for continual learning. Moreover, the proposed method can be combined with the previously proposed cascaded knowledge distillation framework of merging feature boosting calibration and block-wise knowledge distillation, significantly improving its performance, boosting the class-incremental mean accuracy by 2.2%–4.7% on the four datasets.

Keyword持续学习 灾难性遗忘 知识对齐 级联的知识蒸馏框架 一对多信息匹配
Subject Area人工智能
Language中文
Document Type学位论文
Identifierhttp://ir.ia.ac.cn/handle/173211/57245
Collection毕业生_博士学位论文
Recommended Citation
GB/T 7714
李焜炽. 基于知识对齐与蒸馏的持续学习方法研究[D],2024.
Files in This Item:
File Name/Size DocType Version Access License
李焜炽学位论文.pdf(116614KB)学位论文 限制开放CC BY-NC-SA
Related Services
Recommend this item
Bookmark
Usage statistics
Export to Endnote
Google Scholar
Similar articles in Google Scholar
[李焜炽]'s Articles
Baidu academic
Similar articles in Baidu academic
[李焜炽]'s Articles
Bing Scholar
Similar articles in Bing Scholar
[李焜炽]'s Articles
Terms of Use
No data!
Social Bookmark/Share
All comments (0)
No comment.
 

Items in the repository are protected by copyright, with all rights reserved, unless otherwise indicated.