Recently, HMM-based hybrid speech synthesis system has grown in popularity and been more and more interested. There are many advantages for this hybrid approach. On the one hand, the use of natural speech segments in concatenation preserves natural variation which is hard to model; on the other hand, the underlying HMM-based prediction can insure the smoothness and consistency of generated trajectories, which can guide unit selection to match several features such as spectrum, pitch and duration;. Currently, the main drawback of HMM-based hybrid speech synthesis is the synthetic voice is not stable enough. This dissertation aims at reasarch on the hybrid speech synthesis system from three aspects, ie. choices of basic units for model training and concatenation, unit selection method and dynamic search algorithm. The detailed research works and achievements are as follows: The HMM-based hybrid speech synthesis method is fully reviewd. Several basic factors which influence the quality of synthetic speech are studied in depth, including state number of HMM topology, basic unit for HMM modeling, size of training data and basic unit for concatenation. A set of useful conclusions are then drawn. We then propose a new hybrid Mandarin TTS system, which uses initial/final for model training and syllable for concatenation. The synthetic speech is more natural and expressive with this method. A novel unit selection method using similarity measure is proposed. In the training stage, a group of classifiers are trained based on human perceptual judgments. The outputs of the classifiers are used to make a distinction rather than using traditional methods such as continuously-valued cost. In order to obtain a better classification result, different combinations of features are tried as input vectors, and the similarity rating is carried out dexterously. Listening tests on a Mandarin female corpus show that the proposed classifier based speech synthesis system outperforms the traditional unit-selection system. A hierarchical Viterbi algorithm for dynamic searching is proposed. In this method we proposed a hierarchical Viterbi algorithm which involves two rounds of Viterbi search: one is for the sub-paths in the CVS regions; the other is for the utterance path that connecting all the sub-paths. In the proposed technique, we defined CVS Region as a region which is formed by two or more voiced phones, and have none or very short silence (less than 2 frames) within. Subjec...
修改评论