个性化语音合成方法研究

	个性化语音合成方法研究
	傅睿博
	2020-05-28
页数	152
学位类型	博士
中文摘要	目前的语音合成技术已经发展相对成熟，然而在合成风格上普遍较为单一，且构建门槛高，已不能满足当前用户日益增长的个性化应用需求。个性化语音合成旨在提供多风格定制化的合成系统，提升用户在教育、娱乐、导航、智能硬件等诸多应用场景的交互体验，具有重要的研究价值。建模的数据有限且缺乏专业标注等因素对实现高质量的个性化语音合成是个巨大的挑战。语音合成技术正在经历从级联式框架到端到端框架的转变，两者各有利弊，在个性化语音合成中都占有重要地位。本文以基于深度学习语音合成方法为理论基础，聚焦于个性化语音合成方法研究，具体围绕级联式语音合成和端到端语音合成框架展开研究。现有个性化语音合成系统的不足主要体现在三个方面：发音欠清晰，音色易失真，韵律较平淡。为了提高个性化合成语音的整体质量，本文分别从声学建模、说话人特征空间建模、韵律建模这三个方面对个性化语音合成方法进行探索，主要研究成果如下：声学建模是本文研究的基础，本文以提高个性化语音合成的清晰度和鲁棒性为目标，针对两种语音合成框架下的声学模型分别进行了深入的研究。在级联式语音合成框架下，为了避免模型在自适应微调过程中出现灾难性遗忘，提出了基于渐进式神经网络的声学建模方法。该方法通过在说话人相关的隐层间建立横向连接，来实现有效地知识迁移，步进式的学习策略保证了任务最优化。主客观评测验证了该方法能够有效提升建模精度。在端到端语音合成框架下，为了解决小数据个性化建模的泛化能力差，易发生过拟合，且缺乏有效模型优化停止准则的问题，本文提出了基于匹配度识别网络的模型优化策略。该方法以模型编码器-解码器的注意力机制为核心，利用卷积网络对“文本-音频”数据对匹配程度进行建模，具体提出了三大策略：离线语音数据库自动优化；在线个性化建模学习率自适应调整；基于对齐质量的模型优化停止准则。实验验证了匹配度识别卷积网络的有效性，对错误数据对的召回率达到89.8%，同时所提模型优化策略有效提升个性化语音合成系统的性能，且具有可拓展性。说话人特征空间建模是本文研究的核心，本文以提高个性化语音合成的相似度为目标，分别针对两种语音合成框架的说话人特征建模进行了深入研究。传统方法所提取的说话人特征是文本无关的，缺乏对文本所导致的声学差异进行建模，且提取方法基于说话人识别方法，提取过程相对独立，对于个性化语音合成任务不是最优的。在级联式语音合成框架下，本文提出了多层级融合的音素相关说话人特征建模方法。该方法从句子和音素两个层面提取，来实现对说话人特征的文本差异化建模，注意力机制将多层次特征融合保证了声学模型的最优化。主客观评估结果验证了该方法在建模精度上的提升，相比基线建模客观指标整体提升约20%。在端到端语音合成框架下，本文提出了基于门控网络的说话人特征偏移建模。该方法将文本相关说话人嵌入向量进行分解，拆分成全局本文无关的说话人特征向量和局部文本相关说话人特征偏移向量，在门控网络中利用注意力机制对偏移向量进行动态调整，提高说话人特征对音色风格的可控性。相似度MOS评估在多组实验中达到了0.28 分以上的提升，采用50 句语音数据进行个性化模型训练即可达到一个可接受的效果。韵律建模是本文研究的亮点，本文以提高个性化语音合成的韵律自然度为目标。在端到端语音合成框架下，分别从节奏稳定可控和时长风格迁移两个角度进行了深入的研究。在节奏方面，训练数据的韵律边界标注存在精度低和一致性差的问题，极易导致生成语音节奏不稳定。本文采用融合韵律边界信息的端到端合成框架，提出了基于模型融合的韵律边界自动标注。该方法运用循环神经网络分别对文本和音频两个通道训练子模型，以词为单位提取了静音时长，与传统的声学特征相比具有更明确的物理意义，模型决策融合方法提高了韵律边界自动标注的性能，进而提高了合成语音节奏的稳定性，自然度MOS分平均提升0.29 分。在时长方面，训练数据有限导致时长建模存在过平均问题，本文在模型编码器-解码器结构中构建了基于反馈机制的时长控制模块，强化了对解码器状态转移的建模与控制，提升了语音合成系统的稳定性，整句发音错误率从基线的29.52% 降低至8.82%。此外，通过加入时长风格嵌入向量，实现了时长风格的迁移，提升了个性化语音合成的整体韵律表现。
英文摘要	Although speech synthesis has reached certain degree of maturity, the style of synthesized voice is still lack of variety. And the construction threshold is high, which can not meet the growing personalized application needs of current users. Personalized speech synthesis aims to provide a multi-style customized synthesis system, improve the user's interaction experience in many application scenarios such as education, entertainment, navigation, intelligent hardware, which has important research value. It is a great challenge to realize high quality personalized speech synthesis because of limited data and lack of professional annotation. Speech synthesis technology is undergoing the transformation from cascaded framework to end-to-end framework, both of which have advantages and disadvantages and play an important role in personalized speech synthesis. Based on the theory of deep learning speech synthesis method, this dissertation focuses on the research of personalized speech synthesis methods, specifically around the framework of cascaded speech synthesis and end-to-end speech synthesis. The shortcomings of the existing personalized speech synthesis system are mainly reflected in three aspects: the pronunciation is not clear, the timbre is easy to be distorted, and the prosody is relatively flat. In order to improve the overall quality of personalized speech synthesis, this dissertation explores the methods of personalized speech synthesis method from three aspects: acoustic modeling, speaker feature space modeling and prosody modeling. The main contributions are as follows: Acoustic modeling is the basis of this dissertation. In order to improve the intelligibility and robustness of personalized speech synthesis, this dissertation studies the acoustic models in two different speech synthesis frameworks. In the framework of cascaded speech synthesis, in order to avoid catastrophic forgetting of the model in the process of adaptive fine-tuning, the progressive neural network based acoustic modeling is proposed. In this method, knowledge transfer is realized effectively by establishing lateral connections between the speaker dependent hidden layers, and the step-by-step learning strategy ensures the task reaching optimization. Subjective and objective evaluations show that the proposed method can effectively improve the accuracy of acoustic modeling. In the framework of end-to-end speech synthesis, in order to solve the problem of poor generalization ability, over-fitting phenomenon, and lack of effective model optimization stopping criteria, this dissertation proposes a model optimization strategy based on matching degree recognition network. This method is based on the attention mechanism of encoder-decoder. And the convolution network is deployed to model the matching degree of ‘text-audio’ data pair. Three strategies are proposed, automatic offline speech database optimization; adaptive learning rate for online personalized modeling; model optimization stopping criteria based on the alignment quality. Experiments show that the convolution network of matching degree recognition is effective, and the recall rate of error data pairs is 89.8%. Besides, the proposed model optimization strategy effectively improves the performance of personalized speech synthesis system, and has strong scalability. Speaker feature space modeling is the core of this dissertation. In order to improve the similarity of personalized speech synthesis, this dissertation studies the speaker feature modeling in two different speech synthesis frameworks. The speaker features extracted by traditional methods are text independent and lack of modeling the acoustic differences caused by the text. Moreover, the extraction method is based on the speaker recognition method, and the extraction process is relatively independent, which is not optimal for the personalized speech synthesis task. In the framework of cascaded speech synthesis, this dissertation proposes a multi-level phoneme dependent speaker feature modeling method. This method is extracted from two levels of sentence and phoneme to realize the text dependent speaker features modeling. The attention mechanism integrates multi-level features to ensure the optimization of acoustic model. The subjective and objective evaluation results verify the improvement of the modeling accuracy by this method, which is about 20% higher than the baseline in objective evaluations. In the framework of end-to-end speech synthesis, this dissertation proposes speaker feature shift modeling based on the gating network. This method decomposes the text dependent speaker embedding vector into global text independent speaker feature vector and local text dependent speaker feature shift vector. In the gating network, attention mechanism is used to dynamically adjust the shift vector speaker feature to improve the controllability of timbre. The similarity MOS evaluations have achieved an improvement of more than 0.28 points in multiple experiments, and an acceptable effect can be achieved by using 50 sentences of speech data for personalized model training. Prosody modeling is the highlight of this dissertation, which aims to improve the prosody naturalness of personalized speech synthesis. Under the framework of end-to-end speech synthesis, the research is carried out from two aspects: rhythm stability and controllable, and duration style transfer. In the aspect of rhythm, there are problems of low accuracy and poor consistency in the prosodic boundary labeling for the training data, which easily leads to the instability rhythm of synthetic speech. In this paper, the prosodic boundary information is fused into the end-to-end synthesis framework. This method uses recurrent neural network to train the sub models from text and audio channels respectively, and extracts the silence duration in terms of words. Compared with the traditional acoustic features, it has a clearer physical meaning. The model decision fusion method improves the performance of automatic prosody boundary labeling, and then improves the rhythm stability of the synthetic speech. The naturalness MOS score is increased by 0.29 points on average. In terms of duration, the limited training data leads to the over-average problem for duration modeling. In this dissertation, a duration control module based on feedback mechanism is embedded in the encoder-decoder model structure, which strengthens the modeling and control of decoder states transition, improves the stability of the speech synthesis system, and reduces the whole sentence pronunciation error rate from 29.52% of the baseline to 8.82%. In addition, by adding the duration style embedding vector, the duration prosody transfer is realized, and the overall prosody performance of personalized speech synthesis is improved.
关键词	语音合成个性化定制声学建模说话人特征空间建模韵律建模
语种	中文
七大方向——子方向分类	语音识别与合成
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/39310
专题	多模态人工智能系统全国重点实验室_智能交互
通讯作者	傅睿博
推荐引用方式 GB/T 7714	傅睿博. 个性化语音合成方法研究[D]. 中国科学院大学. 中国科学院大学,2020.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
博士学位论文_个性化语音合成方法研究最新（3985KB）	学位论文		开放获取	CC BY-NC-SA