CASIA OpenIR  > 毕业生  > 硕士学位论文


(1)当旧数据无法获取时,本文提出了一种无需原始数据存储的正则化方法:知识蒸馏连续表征方法。该方法基于经典连续学习方法LWF(Learning Without Forgetting),在其基础上加入了真实语音样本表征对齐约束,在计算蒸馏损失的同时将新模型所学的真实语音的特征向量与相应真实语音在旧模型上的特征向量进行余弦相似性计算,若两者接近,这说明新模型继承了旧模型中的知识。LWF方法中的蒸馏损失则是通过全部数据在新旧模型中的输出来将旧模型中的知识传递给新模型。为了灵活控制蒸馏损失和真实语音余弦相似性损失在模型训练中的重要程度,知识蒸馏连续表征方法对两者添加了权重系数。在英文和中文数据集中的5种连续学习实验表明,知识蒸馏连续表征方法相比于直接微调的AvgEER降低了19.12%至82.56%,且其性能优于经典的LWF方法。

(2)当旧数据可以获取时,本文提出了一种需要少量数据存储的样例回放方法:边界生成语音回放方法。该方法受到经典连续学习方法iCaRL的启发,在模型可以正确判别出的生成语音中,通过K近邻算法挑选距离真实语音类平均向量最近的m个生成语音样本进行存储。该方法既避免了离群点被挑选入回放样本,又可以挑选出处于类边界出的生成语音样本,同时只保存生成语音类别的做法可以节约存储空间。在英文和中文数据集的2种连续学习实验表明,边界生成语音回放算法相比于ER方法的AvgEER分别降低了38. 83%和14.24%,相比于iCaRL方法的AvgEER分别降低了37.30%和8.41%,且只需要一半的存储空间。


Other Abstract

Intelligent speech technology is indispensable in daily life. Voice guidance, audiobooks, intelligent customer service, etc. have brought a lot of convenience to people's lives. But, the bad use of synthetic speech will bring harms to political security, economic security, and social security, such as: smearing public figures, destroying the corporate image and financial market stability, causing property damage to citizens, etc. The detection technology for fake audio has become a research hotspot in recent years. However, due to the diverse types of fake audio and the ever-changing synthetic technology, the detection model has difficulty dealing with unknown types of fake audio and the audio from cross-dataset. How to improve The generalization of the model to unknown types and fake audio across datasets becomes a difficult problem. This paper proposes a solution from the perspective of model updates. The solution is based on continual learning. By designing incremental learning algorithm suitable for fake audio detection, the model can continuously learn the feature of fake audio and respond timely to the new type. The work and innovations of this paper can be summarized in the following two aspects:

(1) When old data is not available, this paper proposes a regularization-based method without the storage of original data: detecting fake without forgetting (DFWF) method. This method is based on the classical continual learning method LWF (Learning Without Forgetting). On this basis, real audio sample representation alignment is added. It uses cosine distance to evaluate the similarity of real audio embeddings between the new model and the old model.

If the similarity is high, the new model inherits the knowledge from the old model. The distillation loss in the LWF method is to transfer the knowledge from the old model to the new model through the logits of all data. In order to control the importance of distillation loss and real sample cosine similarity loss flexibly, we add weight coefficients to both losses. Five continual learning experiments on English and Chinese datasets show that our DFWF method reduces AvgEER by 19.12% to 82.56% compared to the directly fine-tuning, and outperforms the classical LWF method.

(2) When old data is available, this paper proposes an experience-replay-based method that requires a small amount of data storage: the boundary forgeries replay method. This method is inspired by the classic continual learning method iCaRL. Among the fake audio that can be correctly detected by the model, the fake samples closed to the average embedding of the real audio are selected and stored in buffer by the K-nearest algorithm.

This method not only avoids outliers from being selected into replay samples but also selects fake audio samples that are near class boundaries. Because only fake samples are stored, it saves storage space. Two continual learning experiments on English and Chinese datasets show that the AvgEER of our boundary forgeries replay method is reduced by 38.83% and 14.24% on two corresponding datasets compared with ER method, and is reduced by 37.30% and 8.41% compared with the iCaRL method.

The above work has not only achieved effective results, but also obtained national patents and software copyright, and has also been applied to the business of the Ministry of Public Security and the Ministry of Industry and Information Technology.

Document Type学位论文
Recommended Citation
GB/T 7714
马浩鑫. 基于连续学习的生成语音检测方法研究[D]. 中科院自动化研究所. 中国科学院大学,2022.
Files in This Item:
File Name/Size DocType Version Access License
马浩鑫-毕业论文0525-加签名-终终.(4018KB)学位论文 限制开放CC BY-NC-SA
Related Services
Recommend this item
Usage statistics
Export to Endnote
Google Scholar
Similar articles in Google Scholar
[马浩鑫]'s Articles
Baidu academic
Similar articles in Baidu academic
[马浩鑫]'s Articles
Bing Scholar
Similar articles in Bing Scholar
[马浩鑫]'s Articles
Terms of Use
No data!
Social Bookmark/Share
All comments (0)
No comment.

Items in the repository are protected by copyright, with all rights reserved, unless otherwise indicated.