基于受限样本的语音合成方法研究

CASIA OpenIR > 毕业生 > 博士学位论文

	基于受限样本的语音合成方法研究
	汪涛
	2023-06
页数	134
学位类型	博士
中文摘要	随着人工智能技术的不断发展，人机交互需求正经历着快速的增长。语音合成任务作为人机交互的重要输出，受到了学术界和工业界的广泛关注。目前，利用精细标注的大规模语料库，语音合成技术已经能够合成出和人类极为相似的语音。而基于受限样本的语音合成技术，则致力于利用少量目标说话人的训练数据（语音时长从数秒到几十分钟）来定制化语音合成系统。这一技术可以降低构建语音合成系统的难度，并进一步提高交互体验，可应用于个性化语音助手、智能机器人、自然语言交互等领域。然而，当样本受限时，目标说话人的训练数据无法为模型提供足够的信息，从而导致语音合成系统的风格可控性、风格一致性和风格泛化性等方面存在问题。因此，基于受限样本的语音合成方法的研究具有重大的研究价值，同时也充满挑战。本文的研究内容将以深度学习的语音合成方法为理论基础，为了提高受限样本情况下语音合成系统的整体质量，本文根据受限样本的数据规模，将分别从少量样本下（有目标说话人多个语音数据，总时长范围在几十秒至几十分钟之间）基于风格参数解耦的声学建模、单一样本下（只有目标说话人一个语音样本，单一样本时长范围在数秒至十几秒之间）基于上下文风格感知的声学建模、以及少量或单一样本下基于源滤波模型神经网络化的多风格声码器三个方面对基于受限样本的语音合成方法进行研究，主要研究成果如下：在基于少量样本的声学建模方面，本文针对主流的利用少量样本对声学模型自适应的方案所存在的风格可控性差、模型易过拟合的关键问题，从提高声学模型风格参数解耦程度的角度出发，对声学模型的说话内容、音色和韵律信息解耦进行了深入研究，共包含两个递进式工作。首先，提出了基于口语内容与音色解耦的少量样本声学建模方法。该方法通过引入音素先验概率特征作为中间层特征，将声学模型分解成口语内容预测模块和语音转换模块并进行联合优化，从而将端到端声学模型进行风格参数解耦。基于此框架本文提出了两种少量样本风格迁移策略，可以有效应对真实场景中语音样本无论是否有文本标注，均可以进行快速、稳定的说话人风格迁移。其次，进一步的，为了将声学模型风格参数进行更加彻底的解耦，本文提出了基于韵律、音色和说话内容解耦的少量样本声学建模方法。该方法通过自动学习一组音素级转移令牌，可以有效控制语音中的韵律信息。当模型需要进行少量样本风格迁移时，只需要微调与说话人相关的韵律模块和解码器相关参数即可，有效减少微调参数，防止模型过度拟合。实验表明，所提方法可以有效将声学模型耦合的参数根据风格进行解耦，此外利用所提框架在2021年多说话人多风格克隆挑战赛极少样本（仅五句话）克隆任务中获得了国际第一名的成绩。在基于单一样本的声学建模方面，本文针对主流的利用单一样本提取风格表征并嵌入语音合成系统的方案所存在的风格一致性差的关键问题，从提高声学模型对单一样本上下文风格感知能力的角度出发，对如何引入单一样本上下文风格信息并提高单一样本下合成语音风格一致性的问题进行了深入研究，共包含两个递进式工作。首先，受启发于基于文本的语音编辑任务，提出了基于上下文风格感知的声学建模方法，该方法通过在训练阶段对输入的音频进行随机掩码并利用文本信息以及掩码后的语音重新预测掩码片段，可以有效的使模型感知到文本中的语义信息以及掩码片段的上下文风格信息，从而无需任何风格表征向量作为引导也可以引入单一样本中的丰富的上下文风格信息。其次，进一步的，针对所提的基于上下文风格感知声学模型在单一样本场景下所面临的长文本生成问题以及单一样本自适应问题，分别提出了字级别自回归生成方法以及单一样本自适应方法，从而进一步提高了在单一样本下所提方法的性能以及实用性。实验表明，所提方法可以有效的引入上下文风格信息，相对于目前常用的基于风格表征的单一样本声学建模方法，无论在客观指标还是对音质和相似度的主观评测等方面均有明显的提升。在基于少量或单一样本的声码器建模方面，本文针对神经网络声码器对未见风格泛化性差的问题，从融合传统声码器对风格高度泛化性和神经网络声码器的高音质特性的角度出发，对基于源滤波模型神经网络化的多风格声码器建模进行了深入研究。具体地，提出了基于确定性与随机性解耦的神经网络声码器，该模型包含四个模块：确定性源模块、随机性源模块、神经清浊音决策模块和神经滤波器模块。此外为了更加精细的建模源信号，提出了基于多频带激励的策略以进一步丰富激励源。所提方法既融合了传统的源滤波模型原理，也引入了神经网络强大的拟合能力，相对于纯神经网络声码器，有效的降低了模型参数并提高了可解释，从而提高了模型的泛化性和音质。实验表明，所提方法在主客观指标以及运行效率上相对于基线系统均有明显提升，并通过细致的消融实验验证了所提方法的有效性，提高了受限样本下语音合成系统的表现力和性能。
英文摘要	With the development of artificial intelligence technology, the demand for human-computer interaction is experiencing rapid growth. As an important output of human-computer interaction, the task of speech synthesis has received extensive attention from academia and industry. Currently, with the use of finely annotated large-scale corpora, speech synthesis technology can produce speech that is extremely similar to human speech. However, speech synthesis technology based on limited samples aims to customize speech synthesis systems using small amounts of training data from target speakers (speech duration ranging from a few seconds to several minutes). This technology can reduce the difficulty of building speech synthesis systems and further improve interaction experience, and can be applied to personalized speech assistants, intelligent robots, natural language interaction, and other fields. However, when samples are limited, the training data of the target speaker cannot provide sufficient information to the model, resulting in problems such as style controllability, style consistency, and style generalization of speech synthesis systems. Therefore, research on speech synthesis methods based on limited samples has significant research value, while also being full of challenges. The research content of this paper will be based on deep learning speech synthesis methods, to improve the overall quality of speech synthesis systems under limited samples, this paper will study the speech synthesis method based on limited samples from three aspects: acoustic modeling based on style parameter decoupling under few samples (target speaker speech duration ranging from a dozen seconds to several minutes), acoustic modeling based on context-aware style perception under single samples (target speaker speech duration ranging from a few seconds to a dozen seconds), and multi-style vocoder based on the source-filter model under few or single samples. The main research results are as follows: In the field of acoustic modeling based on few samples, this paper addresses the key issue of poor controllability of style and model overfitting in mainstream approaches that use few sample to adapt acoustic models. From the perspective of improving the decoupling degree of acoustic model style parameters, this paper conducts in-depth research on decoupling speech content, timbre, and prosodic information of the acoustic model, including two progressive works. Firstly, a few-shot acoustic modeling method based on decoupling spoken content and voice is proposed. This method introduces phoneme prior probability features as intermediate features, and decomposes the acoustic model into a spoken content prediction module and a voice conversion module, thereby decoupling the style parameters of end-to-end acoustic models. Based on this framework, two few-shot style transfer strategies are proposed, which can effectively handle speaker style transfer of speech samples in real-world scenarios regardless of whether they are text-labeled or not, in a fast and stable manner. Secondly, to achieve more thorough decoupling of the style parameters of acoustic models, this paper proposes a few-shot acoustic modeling method based on decoupling prosody, timbre, and speech content. This method can effectively control the prosodic information in speech by automatically learning a set of phoneme-level transition tokens. When the model needs to perform few-shot style transfer, only fine-tuning of the prosodic module related to the speaker and the decoder-related parameters is required, which effectively reduces the number of fine-tuning parameters and prevents model overfitting. Experiments show that the proposed method can effectively decouple the parameters of acoustic models according to style, and using the proposed framework, this paper achieved the first place in the 2021 Multi-Speaker Multi-Style Cloning Challenge with extremely few samples (only five sentences). In the field of acoustic modeling based on single samples, this paper addresses the key issue of poor style consistency in mainstream approaches that use a single sample to extract style representations and embed them into speech synthesis systems. From the perspective of improving the acoustic model's ability to perceive contextual style on single samples, this paper conducts in-depth research on how to introduce contextual style information from single samples and improve the consistency of synthesized speech style under single-sample conditions, consisting of two progressive works. Firstly, inspired by text-based speech editing tasks, a context-aware style perception acoustic modeling method is proposed. During the training phase, this method randomly masks input speech and uses text information and the masked speech to re-predict the masked segments, effectively enabling the model to perceive semantic information in the text as well as contextual style information of the masked segments, without any guidance from style representation vectors. Secondly, to address the long-text generation and single-sample adaptation problems faced by the proposed context-aware acoustic model under single-sample scenarios, this paper proposes a word-level autoregressive generation method and a single-sample adaptation method, respectively, further improving the performance and practicality of the proposed method under single-sample conditions. Experiments show that the proposed method can effectively introduce contextual style information and outperform the currently widely used single-sample acoustic modeling methods based on style representation, both in objective metrics and subjective evaluations of speech quality and similarity. In the field of vocoder modeling based on few or single samples, this paper addresses the issue of poor style generalization of neural network-based vocoders. To address this problem from the perspective of combining the high style generalization of traditional vocoders with the high speech quality of neural network-based vocoders, we conducted in-depth research on multi-style vocoder modeling based on neural source-filter models. Specifically, we proposed a neural network vocoder based on the decoupling of deterministic and stochastic components, which includes four modules: a deterministic source module, a stochastic source module, a neural voiced/unvoiced decision module, and a neural filter module. In addition, to model the source signal more finely, we proposed a multi-band excitation strategy to further enrich the excitation source. The proposed method combines the principles of traditional source-filter models with the powerful fitting capability of neural networks. Compared with pure neural network vocoders, it effectively reduces the number of model parameters and improves interpretability, thereby enhancing the generalization and speech quality of the model. The experiments demonstrate that the proposed method has significantly improved both objective and subjective indicators, as well as operational efficiency, compared to the baseline system. The effectiveness of the proposed method is further verified through detailed ablation experiments, improving the expressiveness and performance of speech synthesis systems with limited samples.
关键词	语音合成，声学建模，风格参数解耦，上下文风格感知，多风格声码器
语种	中文
七大方向——子方向分类	智能交互
国重实验室规划方向分类	语音语言处理
是否有论文关联数据集需要存交	否
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/51923
专题	毕业生_博士学位论文
推荐引用方式 GB/T 7714	汪涛. 基于受限样本的语音合成方法研究[D],2023.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
202018014628087汪涛-se（10568KB）	学位论文		限制开放	CC BY-NC-SA