面向生成语音的模型指纹分析研究

CASIA OpenIR > 毕业生 > 硕士学位论文

	面向生成语音的模型指纹分析研究
	ZHANG, CHU YUAN
	2024-06
页数	64
学位类型	硕士
中文摘要	语音合成技术的进步带来了广泛的应用，但也引发了对潜在滥用的担忧。特别是在取证和知识产权保护领域，亟需应对这些挑战。当前针对生成语音的取证技术主要集中在对于生成语音的二分类判别以及来源的追溯，而并未通过语音生成系统各模块在所生成语音中所留下的痕迹对生成方法进行更细粒度的识别与分析。针对这一局限性，本文聚焦于主流语音生成管道的声学模型与声码器模块，并通过对生成语音的特征提取和分析，旨在提取这些不同模块的指纹特征，以全面了解它们在语音合成过程中的作用。在本研究的框架内，设计并实施了一系列旨在分析声学模型和声码器指纹特征的实验，旨在更全面地理解这些不同模块在语音合成不同阶段留下的指纹是如何相互作用和互动的。本文主要作出了以下贡献： (1) 验证了声学模型和声码器指纹的存在：目前针对生成语音的鉴别与溯源方法主要集中在对生成语音的二分类判别以及来源整体的追溯，未对主流语音生成管道的声学模型与声码器模块在所生成语音中所留下的痕迹进行进一步的识别与分析。为了填补这一空白，本文首先验证了声学模型和声码器在生成语音中留下的痕迹，通过对声学模型和声码器的指纹特征进行提取和分析，旨在验证这些指纹的存在。通过对语音进行声学特征提取和 Res2Net 指纹提取模型的训练，本文在 LibriTTS 数据集上进行了实验验证，在声学模型指纹分类上取得了98.5% 的准确率，声码器指纹分类上取得了 99.2% 的准确率，证明了声学模型和声码器在语音合成过程中会留下独特的指纹特征。 (2) 探索了声学模型和声码器指纹之间的关系：本文分析了不同模块指纹对语音合成总体输出的影响。本质上，这部分研究旨在解决以下问题：声学模型在最终合成语音中留下的指纹是否比声码器更为显著，还是相反？为了验证这两个竞争假设，通过一系列针对不同模块的实验，发现在声学模型未知的情况下依旧可以对声码器指纹进行高准确率（98.4%）的分类，而反之则不然，从而得出结论：声码器在最终合成语音中可能留下比声学模型更显著的指纹。 (3) 分析了模型指纹提取系统的鲁棒性：本文探索和分析了指纹提取系统在面对对抗性攻击时的鲁棒性。在不同条件下提取可识别指纹的能力对于确保合成语音检测系统以及语音合成流程的可靠性和安全性至关重要。为此，本文深入研究了包括背景噪声、混响、音频速度变化以及语音合成系统中可能出现的其他干扰等各种因素对指纹可提取性以及可分类性的影响，并通过实验得到结论，即使这些干扰会对指纹的识别性产生一定的影响，但指纹特征提取系统仍然具有一定的鲁棒性，能提取出识别性较强的指纹特征。
英文摘要	Advancements in speech synthesis technologies have brought about widespread applications but have also raised concerns about potential misuse and abuse. Especially in the fields of forensics and intellectual property protection, there is an urgent need to address these challenges. Current forensic techniques for synthetic speech primarily focus on binary discrimination of generated voices and tracing their origins, without conducting a more granular identification and analysis of the methods used for voice generation through the traces left by various modules of the voice generation system. To address this limitation, this thesis focuses on the acoustic models and vocoder modules of mainstream speech generation pipelines, and aims to extract and analyze the fingerprint features of these different modules, in order to fully understand their roles in the process of speech synthesis. Within the framework of this study, a series of experiments designed to analyze the fingerprint features of acoustic models and vocoders were implemented, aiming to more comprehensively understand how the fingerprints left by these different modules at various stages of speech synthesis interact and interplay. The main contributions of this thesis include: (1) Verification of the existence of acoustic model and vocoder fingerprints: Currently, methods for the identification and tracing of synthetic speech primarily focus on binary discrimination of generated voices and the overall tracing of their origins, without further identification and analysis of the traces left by the acoustic model and vocoder modules in the mainstream speech generation pipelines. To fill this gap, this study first verifies the traces left by the acoustic model and vocoder in the generated speech. By extracting and analyzing the fingerprint features of the acoustic model and vocoder, it aims to verify the existence of these fingerprints. Through acoustic feature extraction and the training of a Res2Net fingerprint extraction model, several verifying experiments were conducted on the LibriTTS dataset, achieving an accuracy of 98.5% in acoustic model fingerprint classification and 99.2% in vocoder fingerprint classification, proving that acoustic models and vocoders leave unique fingerprint features during the speech synthesis process. (2) Examination of the relationship between acoustic model and vocoder fingerprints: The interaction between different module fingerprints in the overall speech synthesis process and their unique contributions are explored and analyzed. In essence, this section of the study addresses the following question: does the acoustic model leave a more prominent fingerprint in the final synthesized speech than the vocoder, or vice versa? To test these two competing hypotheses, through a series of experiments targeting different modules, it was found that the vocoder fingerprints could still be classified with high accuracy (98.4%) despite the presence of unknown acoustic models, whereas the reverse was not true. This leads to the conclusion that the vocoder may leave a more significant fingerprint in the final synthesized speech than the acoustic model. (3) Analysis of fingerprint extraction system robustness: This thesis then explores and analyzes the robustness of the fingerprint extraction system when facing adversarial attacks. The ability to extract identifiable fingerprints under different conditions is crucial for ensuring the reliability and security of synthetic speech detection systems and the speech synthesis process. To this end, the investigation studies the impact of various factors, including background noise, reverberation, changes in audio speed, and other disturbances that may occur in speech synthesis systems, on fingerprint attributability. Through experimentation, it concludes that while these interferences may have a certain impact on the identifiability of fingerprints, the fingerprint feature extraction system still possesses a degree of robustness, capable of extracting fingerprints with strong identifiability.
关键词	生成语音语音生成方法辨别声学模型声码器模型指纹分析
学科领域	模式识别 ; 人工智能其他学科
学科门类	工学::计算机科学与技术（可授工学、理学学位）
语种	中文
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/57602
专题	毕业生_硕士学位论文
推荐引用方式 GB/T 7714	ZHANG, CHU YUAN. 面向生成语音的模型指纹分析研究[D],2024.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
133619671366058076.p（2152KB）	学位论文		限制开放	CC BY-NC-SA