基于HMM的混合语音合成系统的研究

CASIA OpenIR > 毕业生 > 博士学位论文

	基于HMM的混合语音合成系统的研究
其他题名	Research on HMM-based hybrid speech synthesis system
	张冉
	2014-05-28
学位类型	工学博士
中文摘要	近年来，基于隐马尔可夫模型（Hidden Markov Model, HMM）的混合语音合成系统吸引了越来越多研究者的关注。混合语音合成系统结合了很优点：一方面，它采用真实语音片段拼接合成，合成语音音质较高；另一方面，它结合了统计参数方法训练的HMM模型来指导选音，合成语音的总体韵律比较稳定。但是该合成方法的研究仍处于初级阶段，还存在较多的盲点和不足之处。本文以基于隐马尔可夫模型的混合语音合成方法为研究对象，从建模和建库单元选择方法、选音算法、动态搜索算法等三个方面展开了深入的研究。本文的具体研究工作和成果如下：对基于HMM的混合语音合成系统的建库单元和建模单元进行了深入分析。首先确定了音节建模时的最佳状态数，然后比较了音节HMM和声韵母HMM两种模型的精确度和泛化性能，并得到了一组具有指导意义的结论。在此基础上提出了一种采取声韵母建模和音节建库的中文普通话混合语音合成系统，既保证了指导模型的精度，又改善了合成语音的自然度。提出了基于相似度评测的选音算法。该方法在选音合成时，用相似度分类器排除与目标音节不相似的候选基元，从而提高选音结果的稳定性。在相似度分类器的训练中，目标音节采用基于HMM的参数合成系统生成的音节，而训练的输入特征向量则采用了候选基元参数与对应指导模型间的似然值。实验结果表明，基于分类器的选音方法剔除了候选基元中不符合人主观听感的基元，有效的增强了合成结果的稳定性和可懂度。提出了基于层级维特比的搜索算法。该方法先针对所有连续浊音段（Consecutive Voiced speech Segments, CVS）区域进行局部最优路径搜索，然后再搜索一条CVS区域外的全局最优路径，来将上面搜索出的所有子路径连接起来。传统的维特比算法的优化目标是最小化全局代价函数，虽然它在全局是最优的，但在CVS区域可能会选到有较大局部误差的基元，而这种基元往往很难被听者忍受，造成合成音质的急剧下降。该方法针对CVS区域优先选音，减少CVS区域内部的选音错误以及可感知到的合成错误，从而提高了整体语音的自然度和可懂度。
英文摘要	Recently, HMM-based hybrid speech synthesis system has grown in popularity and been more and more interested. There are many advantages for this hybrid approach. On the one hand, the use of natural speech segments in concatenation preserves natural variation which is hard to model; on the other hand, the underlying HMM-based prediction can insure the smoothness and consistency of generated trajectories, which can guide unit selection to match several features such as spectrum, pitch and duration;. Currently, the main drawback of HMM-based hybrid speech synthesis is the synthetic voice is not stable enough. This dissertation aims at reasarch on the hybrid speech synthesis system from three aspects, ie. choices of basic units for model training and concatenation, unit selection method and dynamic search algorithm. The detailed research works and achievements are as follows: The HMM-based hybrid speech synthesis method is fully reviewd. Several basic factors which influence the quality of synthetic speech are studied in depth, including state number of HMM topology, basic unit for HMM modeling, size of training data and basic unit for concatenation. A set of useful conclusions are then drawn. We then propose a new hybrid Mandarin TTS system, which uses initial/final for model training and syllable for concatenation. The synthetic speech is more natural and expressive with this method. A novel unit selection method using similarity measure is proposed. In the training stage, a group of classifiers are trained based on human perceptual judgments. The outputs of the classifiers are used to make a distinction rather than using traditional methods such as continuously-valued cost. In order to obtain a better classification result, different combinations of features are tried as input vectors, and the similarity rating is carried out dexterously. Listening tests on a Mandarin female corpus show that the proposed classifier based speech synthesis system outperforms the traditional unit-selection system. A hierarchical Viterbi algorithm for dynamic searching is proposed. In this method we proposed a hierarchical Viterbi algorithm which involves two rounds of Viterbi search: one is for the sub-paths in the CVS regions; the other is for the utterance path that connecting all the sub-paths. In the proposed technique, we defined CVS Region as a region which is formed by two or more voiced phones, and have none or very short silence (less than 2 frames) within. Subjec...
关键词	语音合成隐马尔可夫模型混合语音合成系统选音算法动态搜索算法 Speech Synthesis Hybrid Speech Synthesis Unit Selection Hierarchical Viterbi Cvs Region
语种	中文
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/6628
专题	毕业生_博士学位论文
推荐引用方式 GB/T 7714	张冉. 基于HMM的混合语音合成系统的研究[D]. 中国科学院自动化研究所. 中国科学院大学,2014.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
CASIA_20111801462807（1569KB）			暂不开放	CC BY-NC-SA