CASIA OpenIR  > 模式识别国家重点实验室  > 语音交互
基于深度学习的语音合成方法研究
郑艺斌
Subtype博士
Thesis Advisor陶建华
2019-05-24
Degree Grantor中国科学院大学
Place of Conferral中国科学院自动化研究所
Degree Name工学博士
Degree Discipline模式识别与智能系统
Keyword语音合成 深度学习 韵律建模 端到端声学建模 多风格建模
Abstract

深度学习的蓬勃发展极大地推动了语音合成技术的革新。本文以深度学习为理论基础,对基于深度学习的语音合成方法进行了深入的研究和探索,包括基于深度学习的级联式语音合成和端到端语音合成。当前合成语音与真实语音之间的明显差距主要体现在三方面:韵律过于平淡、音质不够高和合成语音风格多样性欠佳。为了提高合成语音的整体质量,本文分别从韵律建模、声学建模和多风格建模等三个方面对语音合成方法进行改进和提升。主要创新成果如下:
    在韵律建模方面,以提高合成语音的韵律表现为研究目标,对韵律节奏和时长建模方法进行深入地研究。在韵律节奏建模上,分别从特征表示和建模层面进行改进,提出了一种基于字增强的词向量和模型融合的韵律节奏建模方法,有效地提升韵律节奏预测的精度。在此基础上,进一步提出了一种语言无关的端到端韵律节奏预测方法。该方法能够将特征提取和建模方法统一到同一框架中,且不依赖于具有语言相关性的基础文本特征。所提方法不仅大大简化了建模流程,还进一步提升韵律预测的精度,进而提升合成语音的节奏感和表现力。在时长建模上,针对现有预测时长“过平均”的问题,分别从特征表示、损失函数、时长解码方法和建模等层面进行改进,提出了一种基于多任务学习的离散时长建模方法。主客观评测结果表明,所提方法改善了合成语音的时长分配的效果,明显提高了合成语音的韵律表现。
    在声学建模方面,以提高合成语音的音质和缓解端到端声学模型解码器本身自回归性质的解码方式引起的曝光偏差问题为研究目标,提出了两种不同的基于前后向解码正则化的端到端语音合成方法,旨在通过让前向解码序列和后向解码序列达成一致,进而达到预测未来信息的目的。具体地,第一种方法基于模型层面,通过前后向模型正则化的方法,让L2R模型(从左到右解码)和R2L模型(从右到左解码)的解码序列达成一致。第二种方法基于解码器层面,通过前后向解码器正则化的方法,让前向解码器和后向解码器的解码序列达成一致。所提出的两种不同的基于前后向解码的端到端语音合成方法在域内测试和域外测试中合成语音的音质均超过了现有最好的基线系统Tacotron2。其中前后向解码器正则化的方法取得了最优的性能提升,在域内主观测评中实现MOS得分4.55,十分接近于录音水平(4.65)。另外,在域外倾向性测评中,前后向解码器正则化的方法也获得了更高的倾向性得分。
    在多风格建模方面,以提高多风格语音合成建模精度为研究目标,分别在级联式语音合成和端到端语音合成两种框架下对中文口语中常见的疑问句、感叹句、陈述句等风格的合成进行了深入地研究。在级联式语音合成框架上,为了缓解基于HMM的级联式多风格语音合成的缺点,本文首次提出了一种基于深度学习的级联式多风格语音合成方法。该方法通过在不同发音风格的语音之间共享隐藏层,学习为所有语音风格所共享的全局共性知识,从而来辅助各个风格本身的模型训练。实验验证该方法能够有效提升多风格语音合成的建模效果。在端到端语音合成框架上,本文进一步提出了一种端到端多风格语音合成方法,所提方法不仅大大简化了传统基于级联式多风格语音合成系统的建模的复杂度,而且还能通过无监督学习对多风格信息或情感类型进行显式建模。实验证明所提方法大大提升了多风格语音合成的建模效果,其生成的感叹句、疑问句在主观评测中MOS得分分别达到了4.13、4.21。

Other Abstract

The recently booming technique of deep learning has greatly advanced the innovation of modeling  methods in speech synthesis. Based on deep learning, this dissertation carries out an in-depth investigation and exploration on modeling methods for speech synthesis, including the conventional complex pipeline and end-to-end architecture. There is still a clear gap between synthesized speech and recording, which can be easily distinguished by human. This gap mainly reflects in three aspects: the flat prosody, the unsatisfying speech quality and the lack of diversity of synthesized speech style. This dissertation focuses on the design and improvement of deep learning based speech synthesis methods from three aspects: prosody modeling, acoustic modeling and speech style diversity modeling. The main contributions are as follows:
    In the goal of improving the prosody expressiveness, the prosody (both prosodic boundary and duration) modeling methods are studied in detail. In the prosodic boundary modeling, a novel model architecture using character-enhanced embedding feature and model fusion approach is proposed to effectively improve the accuracy of prosodic boundary prediction. Based on this, a language-independent end-to-end prosodic boundary prediction method is further proposed. This method combines feature extraction and modeling method into a unified framework, which does not rely on any language-dependent linguistic features. The proposed method not only greatly simplifies the modeling process, but also further improves the accuracy of prosodic boundary prediction, thereby improving the prosody expressiveness of synthesized speech. In the duration modeling, a discrete duration modeling method based on multi-task learning (by improving feature representations, loss functions, duration decoding and modeling methods) is proposed to alleviate the ``over-averaged" effect of the generated phone duration.  Both the objective and subjective evaluation results show that the proposed method improves the phone duration distribution, and thus significantly improves the prosody expressiveness of synthesized speech.
    In the goal of improving the speech quality and alleviating the ``exposure bias" caused by autoregressive generation, two different end-to-end acoustic modeling methods based on forward and backward decoding regularization are proposed. These two methods can learn to predict future information by improving the agreement between forward and backward decoding sequences. Specifically, the first one is operated on model-level, which aims to reduce the mismatch between two directional models, namely L2R (which generates targets from left-to-right) and R2L (which generates targets from right-to-left). While the second one operates on decoder-level and tries to reduce the mismatch between the forward and backward decoders. Experimental results show that both  two proposed methods lead a significant improvement on both robustness and overall speech quality than the baseline (a revised version of Tacotron2). Meanwhile, the second method achieves the best performance: with a MOS of 4.55 in a general test (which is close to the recording 4.65), and obvious preference advantage in a challenging test.
    In the goal of improving the accuracy of speech style diversity modeling, the generation of interrogative, exclamatory and declarative speech is studied on both the conventional complex  pipeline and end-to-end architecture. On the conventional complex pipeline, in order to alleviate the shortcomings of HMM-based multi-style speech synthesis pipeline, a deep neural network based multi-style speech synthesis method is proposed. By sharing hidden layers among different styles, the hidden layers are able to learn maximum knowledge shared among different styles, thereby assisting the modeling of each style. Experimental results show that the proposed method can effectively improve the  effect of multi-style modeling. On the end-to-end architecture, an end-to-end multi-style speech synthesis method is proposed. The proposed method not only greatly simplifies the modeling complexity of the traditional multi-style speech synthesis pipeline, but also can explicitly model multi-style information or emotional types through unsupervised learning. Experimental results show that the proposed method greatly improves the  effect of multi-style modeling, with a MOS of 4.13 and 4.21 on the generation of the exclamatory and interrogative speech style, respectively.

Pages144
Language中文
Document Type学位论文
Identifierhttp://ir.ia.ac.cn/handle/173211/23884
Collection模式识别国家重点实验室_语音交互
Recommended Citation
GB/T 7714
郑艺斌. 基于深度学习的语音合成方法研究[D]. 中国科学院自动化研究所. 中国科学院大学,2019.
Files in This Item:
File Name/Size DocType Version Access License
Thesis.pdf(8630KB)学位论文 开放获取CC BY-NC-SAApplication Full Text
Related Services
Recommend this item
Bookmark
Usage statistics
Export to Endnote
Google Scholar
Similar articles in Google Scholar
[郑艺斌]'s Articles
Baidu academic
Similar articles in Baidu academic
[郑艺斌]'s Articles
Bing Scholar
Similar articles in Bing Scholar
[郑艺斌]'s Articles
Terms of Use
No data!
Social Bookmark/Share
All comments (0)
No comment.
 

Items in the repository are protected by copyright, with all rights reserved, unless otherwise indicated.