英文摘要 | Although speech synthesis has reached certain degree of maturity, the style of synthesized voice is still lack of variety. And the construction threshold is high, which can not meet the growing personalized application needs of current users. Personalized speech synthesis aims to provide a multi-style customized synthesis system, improve the user's interaction experience in many application scenarios such as education, entertainment, navigation, intelligent hardware, which has important research value. It is a great challenge to realize high quality personalized speech synthesis because of limited data and lack of professional annotation. Speech synthesis technology is undergoing the transformation from cascaded framework to end-to-end framework, both of which have advantages and disadvantages and play an important role in personalized speech synthesis. Based on the theory of deep learning speech synthesis method, this dissertation focuses on the research of personalized speech synthesis methods, specifically around the framework of cascaded speech synthesis and end-to-end speech synthesis. The shortcomings of the existing personalized speech synthesis system are mainly reflected in three aspects: the pronunciation is not clear, the timbre is easy to be distorted, and the prosody is relatively flat. In order to improve the overall quality of personalized speech synthesis, this dissertation explores the methods of personalized speech synthesis method from three aspects: acoustic modeling, speaker feature space modeling and prosody modeling. The main contributions are as follows:
Acoustic modeling is the basis of this dissertation. In order to improve the intelligibility and robustness of personalized speech synthesis, this dissertation studies the acoustic models in two different speech synthesis frameworks. In the framework of cascaded speech synthesis, in order to avoid catastrophic forgetting of the model in the process of adaptive fine-tuning, the progressive neural network based acoustic modeling is proposed. In this method, knowledge transfer is realized effectively by establishing lateral connections between the speaker dependent hidden layers, and the step-by-step learning strategy ensures the task reaching optimization. Subjective and objective evaluations show that the proposed method can effectively improve the accuracy of acoustic modeling. In the framework of end-to-end speech synthesis, in order to solve the problem of poor generalization ability, over-fitting phenomenon, and lack of effective model optimization stopping criteria, this dissertation proposes a model optimization strategy based on matching degree recognition network. This method is based on the attention mechanism of encoder-decoder. And the convolution network is deployed to model the matching degree of ‘text-audio’ data pair. Three strategies are proposed, automatic offline speech database optimization; adaptive learning rate for online personalized modeling; model optimization stopping criteria based on the alignment quality. Experiments show that the convolution network of matching degree recognition is effective, and the recall rate of error data pairs is 89.8%. Besides, the proposed model optimization strategy effectively improves the performance of personalized speech synthesis system, and has strong scalability.
Speaker feature space modeling is the core of this dissertation. In order to improve the similarity of personalized speech synthesis, this dissertation studies the speaker feature modeling in two different speech synthesis frameworks. The speaker features extracted by traditional methods are text independent and lack of modeling the acoustic differences caused by the text. Moreover, the extraction method is based on the speaker recognition method, and the extraction process is relatively independent, which is not optimal for the personalized speech synthesis task. In the framework of cascaded speech synthesis, this dissertation proposes a multi-level phoneme dependent speaker feature modeling method. This method is extracted from two levels of sentence and phoneme to realize the text dependent speaker features modeling. The attention mechanism integrates multi-level features to ensure the optimization of acoustic model. The subjective and objective evaluation results verify the improvement of the modeling accuracy by this method, which is about 20% higher than the baseline in objective evaluations. In the framework of end-to-end speech synthesis, this dissertation proposes speaker feature shift modeling based on the gating network. This method decomposes the text dependent speaker embedding vector into global text independent speaker feature vector and local text dependent speaker feature shift vector. In the gating network, attention mechanism is used to dynamically adjust the shift vector speaker feature to improve the controllability of timbre. The similarity MOS evaluations have achieved an improvement of more than 0.28 points in multiple experiments, and an acceptable effect can be achieved by using 50 sentences of speech data for personalized model training.
Prosody modeling is the highlight of this dissertation, which aims to improve the prosody naturalness of personalized speech synthesis. Under the framework of end-to-end speech synthesis, the research is carried out from two aspects: rhythm stability and controllable, and duration style transfer. In the aspect of rhythm, there are problems of low accuracy and poor consistency in the prosodic boundary labeling for the training data, which easily leads to the instability rhythm of synthetic speech. In this paper, the prosodic boundary information is fused into the end-to-end synthesis framework. This method uses recurrent neural network to train the sub models from text and audio channels respectively, and extracts the silence duration in terms of words. Compared with the traditional acoustic features, it has a clearer physical meaning. The model decision fusion method improves the performance of automatic prosody boundary labeling, and then improves the rhythm stability of the synthetic speech. The naturalness MOS score is increased by 0.29 points on average. In terms of duration, the limited training data leads to the over-average problem for duration modeling. In this dissertation, a duration control module based on feedback mechanism is embedded in the encoder-decoder model structure, which strengthens the modeling and control of decoder states transition, improves the stability of the speech synthesis system, and reduces the whole sentence pronunciation error rate from 29.52% of the baseline to 8.82%. In addition, by adding the duration style embedding vector, the duration prosody transfer is realized, and the overall prosody performance of personalized speech synthesis is improved. |
修改评论