The recently booming technique of deep learning has greatly advanced the innovation of modeling methods in speech synthesis. Based on deep learning, this dissertation carries out an in-depth investigation and exploration on modeling methods for speech synthesis, including the conventional complex pipeline and end-to-end architecture. There is still a clear gap between synthesized speech and recording, which can be easily distinguished by human. This gap mainly reflects in three aspects: the flat prosody, the unsatisfying speech quality and the lack of diversity of synthesized speech style. This dissertation focuses on the design and improvement of deep learning based speech synthesis methods from three aspects: prosody modeling, acoustic modeling and speech style diversity modeling. The main contributions are as follows:
In the goal of improving the prosody expressiveness, the prosody (both prosodic boundary and duration) modeling methods are studied in detail. In the prosodic boundary modeling, a novel model architecture using character-enhanced embedding feature and model fusion approach is proposed to effectively improve the accuracy of prosodic boundary prediction. Based on this, a language-independent end-to-end prosodic boundary prediction method is further proposed. This method combines feature extraction and modeling method into a unified framework, which does not rely on any language-dependent linguistic features. The proposed method not only greatly simplifies the modeling process, but also further improves the accuracy of prosodic boundary prediction, thereby improving the prosody expressiveness of synthesized speech. In the duration modeling, a discrete duration modeling method based on multi-task learning (by improving feature representations, loss functions, duration decoding and modeling methods) is proposed to alleviate the ``over-averaged" effect of the generated phone duration. Both the objective and subjective evaluation results show that the proposed method improves the phone duration distribution, and thus significantly improves the prosody expressiveness of synthesized speech.
In the goal of improving the speech quality and alleviating the ``exposure bias" caused by autoregressive generation, two different end-to-end acoustic modeling methods based on forward and backward decoding regularization are proposed. These two methods can learn to predict future information by improving the agreement between forward and backward decoding sequences. Specifically, the first one is operated on model-level, which aims to reduce the mismatch between two directional models, namely L2R (which generates targets from left-to-right) and R2L (which generates targets from right-to-left). While the second one operates on decoder-level and tries to reduce the mismatch between the forward and backward decoders. Experimental results show that both two proposed methods lead a significant improvement on both robustness and overall speech quality than the baseline (a revised version of Tacotron2). Meanwhile, the second method achieves the best performance: with a MOS of 4.55 in a general test (which is close to the recording 4.65), and obvious preference advantage in a challenging test.
In the goal of improving the accuracy of speech style diversity modeling, the generation of interrogative, exclamatory and declarative speech is studied on both the conventional complex pipeline and end-to-end architecture. On the conventional complex pipeline, in order to alleviate the shortcomings of HMM-based multi-style speech synthesis pipeline, a deep neural network based multi-style speech synthesis method is proposed. By sharing hidden layers among different styles, the hidden layers are able to learn maximum knowledge shared among different styles, thereby assisting the modeling of each style. Experimental results show that the proposed method can effectively improve the effect of multi-style modeling. On the end-to-end architecture, an end-to-end multi-style speech synthesis method is proposed. The proposed method not only greatly simplifies the modeling complexity of the traditional multi-style speech synthesis pipeline, but also can explicitly model multi-style information or emotional types through unsupervised learning. Experimental results show that the proposed method greatly improves the effect of multi-style modeling, with a MOS of 4.13 and 4.21 on the generation of the exclamatory and interrogative speech style, respectively.