英文摘要 | With the development of speech synthesis technology, people have more requirements for the text-to-speech (TTS) system, especially the requirement for the diversification of synthetic speech. Although the current large-sized speech corpus based concatenative speech synthesis has good performance, its shortcoming of too long cycle to built speech corpus and poor expansibility, which limit the use of diversification. In recent years, HMM-based speech synthesis system (HTS) has been proposed, which can be automatically constructed in a short time without human intervention, and its voice characteristics, speaking style, or emotions can be controlled flexibly by transforming HMM parameters appropriately. So it has high research significance and application value. Therefore, this thesis studies the topic of HMM-based speech synthesis system in depth and systematically, including the framework construction, the key technology improvements. The main research works can be summarized as follows: 1. Based on the available HMM training method and parameter generation algorithm, the whole technique framework of HMM-based speech synthesis system is constructed, which include an automatic training procedure and a synthesis back-end. For the users’ requirement, a corresponding synthesis system can be quickly constructed under this framework by training with the input speech data. Moreover, based on this framework, we construct a Chinese HMM-based speech synthesis system. User input arbitrary text, this system can output the synthesized speech in real-time. 2. In the traditional HTS, there is an inconsistency: although the speech is synthesized from HMMs with explicit state duration probability distributions, HMMs are trained without them. So the NIT’s researchers introduce a hidden semi-Markov model (HSMM), and construct an HSMM-based speech synthesis system. To certificate the effect of this method, we re-derive parameter reestimation formulae and construct a Chinese HSMM-based speech synthesis system. 3. In HSMM-based speech synthesis system, there is still an inconsistency: although HSMM has explicit state duration probability distributions, the state transition probabilities are duration-invariant. And considering in the model training stage, too much detailed information, especially the timescale distortion at particular instant of an utterance, is missed by a lot of statistical processing. To resolve the problem, we introduce duration-dependent state transit... |
修改评论