CASIA OpenIR  > 毕业生  > 硕士学位论文
基于连续学习的生成语音检测方法研究
马浩鑫
2022-05-17
页数58
学位类型硕士
中文摘要

智能语音技术在日常生活中必不可少,语音导航、智能音箱、智能客服等等都为人们的生活带来了诸多便利,合成语音在听感上已经可以逼近人声,但这也是一把“双刃剑”,对于合成语音的不良运用将会给政治安全、经济安全、社会安全带来诸多危害,如:抹黑公众人物、破坏企业形象和金融市场的稳定、造成公民的财产损失等。针对生成语音的检测技术成为近年来的研究热点,然而由于生成语音的类型多样、生成技术日新月异,检测模型难以应对训练集外的未知类型生成语音和跨数据集特征不匹配的生成语音,如何提升模型在未知类型和跨数据集生成语音上的泛化性成为一大难题。本文从模型更新的角度提出一种解决方案,该方案基于连续学习,通过设计适用于生成语音检测的增量式学习算法来帮助模型连续性地学习生成语音特征,与语音生成技术一同进化,持续更新,从而对新的生成语音进行快速及时地相应,提升模型泛化性。本文的工作和创新点可以总结为以下两个方面:

(1)当旧数据无法获取时,本文提出了一种无需原始数据存储的正则化方法:知识蒸馏连续表征方法。该方法基于经典连续学习方法LWF(Learning Without Forgetting),在其基础上加入了真实语音样本表征对齐约束,在计算蒸馏损失的同时将新模型所学的真实语音的特征向量与相应真实语音在旧模型上的特征向量进行余弦相似性计算,若两者接近,这说明新模型继承了旧模型中的知识。LWF方法中的蒸馏损失则是通过全部数据在新旧模型中的输出来将旧模型中的知识传递给新模型。为了灵活控制蒸馏损失和真实语音余弦相似性损失在模型训练中的重要程度,知识蒸馏连续表征方法对两者添加了权重系数。在英文和中文数据集中的5种连续学习实验表明,知识蒸馏连续表征方法相比于直接微调的AvgEER降低了19.12%至82.56%,且其性能优于经典的LWF方法。

(2)当旧数据可以获取时,本文提出了一种需要少量数据存储的样例回放方法:边界生成语音回放方法。该方法受到经典连续学习方法iCaRL的启发,在模型可以正确判别出的生成语音中,通过K近邻算法挑选距离真实语音类平均向量最近的m个生成语音样本进行存储。该方法既避免了离群点被挑选入回放样本,又可以挑选出处于类边界出的生成语音样本,同时只保存生成语音类别的做法可以节约存储空间。在英文和中文数据集的2种连续学习实验表明,边界生成语音回放算法相比于ER方法的AvgEER分别降低了38. 83%和14.24%,相比于iCaRL方法的AvgEER分别降低了37.30%和8.41%,且只需要一半的存储空间。

以上工作不仅取得了有效的成果,相关研究还获得国家专利和软件著作权授权,也已应用于公安部和工信部等的业务中。

英文摘要

Intelligent speech technology is indispensable in daily life. Voice guidance, audiobooks, intelligent customer service, etc. have brought a lot of convenience to people's lives. But, the bad use of synthetic speech will bring harms to political security, economic security, and social security, such as: smearing public figures, destroying the corporate image and financial market stability, causing property damage to citizens, etc. The detection technology for fake audio has become a research hotspot in recent years. However, due to the diverse types of fake audio and the ever-changing synthetic technology, the detection model has difficulty dealing with unknown types of fake audio and the audio from cross-dataset. How to improve The generalization of the model to unknown types and fake audio across datasets becomes a difficult problem. This paper proposes a solution from the perspective of model updates. The solution is based on continual learning. By designing incremental learning algorithm suitable for fake audio detection, the model can continuously learn the feature of fake audio and respond timely to the new type. The work and innovations of this paper can be summarized in the following two aspects:

(1) When old data is not available, this paper proposes a regularization-based method without the storage of original data: detecting fake without forgetting (DFWF) method. This method is based on the classical continual learning method LWF (Learning Without Forgetting). On this basis, real audio sample representation alignment is added. It uses cosine distance to evaluate the similarity of real audio embeddings between the new model and the old model.

If the similarity is high, the new model inherits the knowledge from the old model. The distillation loss in the LWF method is to transfer the knowledge from the old model to the new model through the logits of all data. In order to control the importance of distillation loss and real sample cosine similarity loss flexibly, we add weight coefficients to both losses. Five continual learning experiments on English and Chinese datasets show that our DFWF method reduces AvgEER by 19.12% to 82.56% compared to the directly fine-tuning, and outperforms the classical LWF method.

(2) When old data is available, this paper proposes an experience-replay-based method that requires a small amount of data storage: the boundary forgeries replay method. This method is inspired by the classic continual learning method iCaRL. Among the fake audio that can be correctly detected by the model, the fake samples closed to the average embedding of the real audio are selected and stored in buffer by the K-nearest algorithm.

This method not only avoids outliers from being selected into replay samples but also selects fake audio samples that are near class boundaries. Because only fake samples are stored, it saves storage space. Two continual learning experiments on English and Chinese datasets show that the AvgEER of our boundary forgeries replay method is reduced by 38.83% and 14.24% on two corresponding datasets compared with ER method, and is reduced by 37.30% and 8.41% compared with the iCaRL method.

The above work has not only achieved effective results, but also obtained national patents and software copyright, and has also been applied to the business of the Ministry of Public Security and the Ministry of Industry and Information Technology.

关键词生成语音检测,连续学习,知识蒸馏,样例回放
语种中文
文献类型学位论文
条目标识符http://ir.ia.ac.cn/handle/173211/48826
专题毕业生_硕士学位论文
推荐引用方式
GB/T 7714
马浩鑫. 基于连续学习的生成语音检测方法研究[D]. 中科院自动化研究所. 中国科学院大学,2022.
条目包含的文件
文件名称/大小 文献类型 版本类型 开放类型 使用许可
马浩鑫-毕业论文0525-加签名-终终.(4018KB)学位论文 限制开放CC BY-NC-SA
个性服务
推荐该条目
保存到收藏夹
查看访问统计
导出为Endnote文件
谷歌学术
谷歌学术中相似的文章
[马浩鑫]的文章
百度学术
百度学术中相似的文章
[马浩鑫]的文章
必应学术
必应学术中相似的文章
[马浩鑫]的文章
相关权益政策
暂无数据
收藏/分享
所有评论 (0)
暂无评论
 

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。