Recently, statistical parametric speech synthesis has grown in popularity and been more and more interested for such advantages as the high stability and smoothing of synthetic voice, the rapid and automatic building of system, the flexible control of voice characteristics and speaking styles, etc. One representative instance of these techniques is Hidden Markov model (HMM)-based speech synthesis. Currently, the main drawback of HMM-based speech synthesis is the synthetic voice is not natural enough, including the unsatisfying speech quality and the flat prosody. This dissertation aims at reasarch on high naturalness statistical parametric speech synthesis, where the HMM-based speech synthesis method is adopted as one instance. The detailed research works and achievements are as follows: The HMM-based speech synthesis method is fully reviewd. From these three aspects, i.e. accuracy of HMM modeling, speech parameter generation and synthesis with vocoder, several key reasons for the naturalness degradation of synthetic speech is analyzed and discussed in depth, which gives the hints for the later research. Several basic factors which have influences on the naturalness of synthetic speech are studied in depth, including state number of HMM topology, basic unit for HMM modeling, size of training data and MDL (Minimum Description Length) factor. A set of useful conclusions are then drawn. The applied evaluation method consisting of HMM likelihood, generation error and subjective evaluation proved to be useful for the evaluation of HMM accuracy. The speech parameter generation method considering global variance (GV) is extended by two ways. One is that a global variance modeling on frequency domain delta LSP is proposed for HMM-based speech synthesis. A speech parameter generation algorithm considering this new global variance model is given in detail. With this method, the over-smoothing problem of generated spectral parameters is better allieviated and the naturalness of synthetic speech is improved. The other is that a state duration generation method considering global variance is proposed to allieviate the over-averaging problem of generated state duration. The synthetic speech is more natural and expressive with this method. An HMM-based unit selection (HUS) method is proposed. In this method a CART (Classification and Regression Tree) based boundary F0 dependency model is built to model the relationship between boundary F0s of adjacent units...
修改评论