基于数据驱动的混合语音合成方法研究

CASIA OpenIR > 毕业生 > 博士学位论文

	基于数据驱动的混合语音合成方法研究
其他题名	Research on Data Driven based Hybrid Unit Selection Speech Synthesis
	刘善峰
	2015-05-26
学位类型	工学博士
中文摘要	近年来，随着语音合成的发展，基于统计声学建模的基元选取系统得到了越来越多的研究者的关注。基于隐马尔柯夫模型（Hidden Markov Model, HMM）的混合语音合成综合了基于HMM的统计参数语音合成的稳定性的特点与传统拼接合成高自然度的特点，合成出的语音较基于HMM的参数语音合成与传统拼接合成都有了一定的提高。现阶段，基于统计参数的混合语音合成研究仍处于起步阶段，存在着较多的不足之处，也没有一个较为完善的可市场化的系统。本文以研究混合语音合成为研究对象，以提高韵律表现、合成音质量与系统运行速度为出发点，从预选算法、基元挑选算法、代价计算方法与指导模型建模方法出发展开了深入的研究。本文的具体研究工作与成果如下：第一章是绪论，在这一章，我们回顾了语音合成的研究历史，并对语音合成方法的发展历程进行了一定的介绍，并提出了我们详细的研究目标。第二章首先介绍了波形拼接系统相关的概念与流程，我们致力于提升波形拼接系统的合成语音的韵律表现，提出了一种基于文本特征的波形拼接合成方法。我们脱离了传统拼接合成中用机器学习方法预测声学参数指导选音的框架，采用待合成语句经过文本分析之后得到的文本特征来指导选音，采用决策树与线性回归算法相结合的M5P算法进行相关文本特征权重的预测；并且在预选阶段，我们采用了一种分层预选的方法，提高了系统的运行速度。特别是分层预选中的时长预测模型，进一步保证了所选基元在时长上的稳定性。实验证明，基于文本特征的波形拼接系统在自然度上有了很大的提高。第三章将重点介绍基于HMM的混合语音合成系统并提出了我们自己的系统。在这一章，我们首先介绍了基于HMM的混合语音合成，介绍了相关的HMM的声学建模与基于HMM的混合语音合成的系统框架，并详细介绍了几个典型的基于HMM的混合语音合成系统。在此基础上，我们提出了一种基于数据驱动的混合语音合成方法。这种方法结合了上一章的实验结果，将基于文本特征的多元线性回归模型用作基元的预选，生成预选代价；目标代价计算时，我们用原始基元的真实声学参数估算出一个模型，并计算该估算模型与指导模型之间的KLD，结合预选代价，作为最终的目标代价；通过相关的实验证明了该方法比传统基于HMM的混合语音合成系统在合成音质与自然度上有了很大的提升。在此基础上，我们优化了该系统，提出了一种基于KLD与似然值的单元挑选系统，目标代价分为三部分组成：模型间的KLD、基于文本特征的文本预选代价与候选基元与指导模型间的似然值。该方法进一步提升了系统合成语音的自然度。第四章从基于统计模型的混合语音合成中的指导模型出发，提出了一种基于深度学习方法的混合语音合成系统，该系统在建模精度上比传统混合语音合成的基于HMM-GMM模型有了一定的提升，作为指导模型进行选音时，合成语音也有一定程度的提升。第五章在对全文工作进行了总结，并对未来的工作开展提出了方向。
英文摘要	With the development of speech synthesis, the unit selection speech synthesis system based on statistical parametric models has caught a great of researchers’ attention recently. Hidden Markov Model based hybrid unit selection speech synthesis system combines the advantages of statistical parametric speech synthesis system and unit selection speech synthesis system, and the quality of synthesized speech is improved. HUS system is still in the initial stage and has lots of disadvantages. So far, there is no hybrid unit selection system could meet the demand of the market totally. In order to improve the quality of synthesized speech, performance of prosody and the system speed, research has been carried out in pre-selection, unit selection and the cost calculation method in this paper. The main research work and results are as follows in this paper: The first chapter is the introduction. We reviewed the research history of speech synthesis. Classical speech synthesis methods are introduced in detail and the research objectives are presented here. The second chapter introduced the related concepts and framework of speech concatenation system. In order to improve the prosody performance in the speech concatenation system, context features based unit selection system is put forward in this chapter. Without using traditional machine learning methods to predict the acoustic parameters, context features is considered. The context features after the text analysis is used to guiding the units selection. Linear regression and decision tree based M5P algorithm is applied to calculate the target cost. A hierarchical pre-selection method is proposed to improve the speed of the system. Especially the duration predicted model added in the hierarchical pre-selection ensures the stability in the duration of the selected units. Experiments show that context features based unit selection system has been greatly improved in the naturalness of the synthesized speech. The third chapter focuses on the HMM based hybrid unit selection system and a novel system is presented in this chapter. Firstly, HMM based hybrid unit selection is introduced in detail including the acoustic modeling and system framework. Several typical HMM-based hybrid unit selection systems are introduced in this chapter. On this basis, a data driven based hybrid unit selection system is proposed. This approach combines the results of the previous chapter. Context feature based multiple linear regression ...
关键词	语音合成隐马尔可夫模型数据驱动混合语音合成系统深度学习 Speech Synthesis Hidden Markov Models Data-driven Hybrid Unit Selection System Deep Learning
语种	中文
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/6692
专题	毕业生_博士学位论文
推荐引用方式 GB/T 7714	刘善峰. 基于数据驱动的混合语音合成方法研究[D]. 中国科学院自动化研究所. 中国科学院大学,2015.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
CASIA_20121801462805（4154KB）			暂不开放	CC BY-NC-SA