英文摘要 | The recently booming technique of deep learning has greatly advanced the innovation of acoustic modeling in speech synthesis. Based on deep learning, this work carries out an in-depth investigation and exploration on multi-information fusion based end-to-end speech synthesis. Compared with concatenative speech synthesis and statistical parametric speech synthesis (SPSS), end-to-end neural text-to-speech (TTS) has become a new trend due to its simpler module, low cost and good performance. In order to further improve the overall quality of synthesized speech, including its pronunciation accuracy, naturalness and sound quality, this paper makes an in-depth study on end-to-end speech synthesis technology. The main contributions are as follows:
1. In this paper, based on an end-to-end TTS model Tacotron2, the influences of different modeling units on Chinese Mandarin speech synthesis are studied. The attention-based encoder-decoder architecture is employed as our end-to-end model, which integrates prosodic prediction model, duration model and acoustic model together. The end-to-end model simplifies the conventional speech synthesis pipeline, and reduces the reliance on complex data annotation since it is capable of learning the prosodic pattern embedded in input text implicitly. This paper focuses on the effects of three modeling units on speech synthesis, including character, pinyin and phoneme. Experimental results show that both pinyin and phoneme based model significantly outperform character-based model, and this indicates that it is challenging to build a character-based TTS system for Chinese.
2. In the goal of improving character-based Chinese TTS system, an end-to-end TTS model, which incorporates pronunciation information, is proposed to alleviate the problem of data sparsity and the pronunciation of polyphonic characters. This model employs two novel and simple methods: multi-task learning and dictionary tutoring. Multi-task learning method supplements pinyin domain knowledge by adding an auxiliary task of pinyin prediction to assist the encoder learning better feature representations. Dictionary tutoring method leverages the abundant information from the external dictionary to correct the pronunciation of polyphonic and uncommon characters in Chinese. Experimental results show that compared with the character-based model, the proposed methods clearly enhance the naturalness and intelligibility of the synthetic speech, making the system being able to synthesize speech directly from the Chinese character sequences.
3. In the goal of improving the naturalness and prosody of synthetic speech, an end-to-end TTS model that explicitly uses information from pre-trained text embeddings is proposed. This model utilizes the text embeddings extracted by pre-trained BERT as an additional input to a Tacotron2-based TTS model. The text embeddings contain information about linguistics and semantics, which help the system produce more natural speech. This paper compares the effects of two different approaches (feature-based approach and fine-tuning approach) of using the pre-trained text information. For featurebased approach, extended researches are carried out to compare the effects of adding text information on different places (input side enhancement and output side enhancement). The experimental results show that using text embeddings from pre-trained BERT can enhance the naturalness and prosody of synthesized speech, and the feature-based approach with input side enhancement works best.
|
修改评论