Prosodic structure prediction plays an important role in text-to-speech system, it is a prerequisite for the generation of prosodic parameters, such as silence, fundamental frequency and duration, and its accuracy to a large extent determines both the naturalness and intelligibility of synthesized voice. This dissertation defines Chinese rhythm as a three-tier hierarchy consisting of prosodic word, prosodic phrase and intonation phrase. On the basis of detailed statistics and analysis of Chinese rhythm, this dissertation compares a variety of statistical machine learning models for predicting Chinese prosodic structure, and then selects the maximum entropy based framework. How to effectively optimize reliable information in order to improve the performance of prosodic structure prediction is the focus of this dissertation. In detail, the main work of this dissertation includes the following: (1) A large-scale rhythm-tagged corpus is constructed. Statistical analysis points out that there is tight correlation between the shallow syntax information and the low-level rhythm units, but the deep syntactic information, both the level of grammatical structure and the phrase type, can not provide precise information for the high-level rhythm units. (2) Three levels of rhythm units are statistically modeled respectively. According to the merger and decomposition of lexicon words to generate prosodic words, the prosodic word prediction model is divided into the merging model and splitting model. The prosodic phrase and intonation phrase prediction models not only consider the grammar constraints, but also take the phrase length distribution into account. Based on maximum entropy model, a variety of feature selection methods are compared. Experimental results show that, as long as features are statistically stable enough, different methods of feature selection have similar performance. Intonation phrase prediction in this dissertation try to use deep syntactic information, but the effect is not obvious. This dissertation also proposes a variety of length constraint model, analyzes the contribution of the length information to intonation phrase prediction in detail, and draw some interesting conclusions: people tend to alternate the interval length between pauses when they speak; the rhythm planning is a short-term local planning; independent modeling of phrase length can effectively inhibit the error transmission, so its performance is better than directly addi...
修改评论