CASIA OpenIR  > 毕业生  > 博士学位论文
语音合成声学建模技术研究
王文富
学位类型工学博士
导师徐波
2018-05
学位授予单位中国科学院研究生院
学位授予地点北京
关键词语音合成 声学建模 门控循环混合密度网络 卷积输出层 对抗学习 端到端
摘要深度学习的蓬勃发展极大地推动了语音合成声学建模技术的革新。本文以深度学习技术为理论基础,对语音合成声学建模技术进行了深入的研究和探索。语音合成技术正在经历从管道式框架到端到端框架的转变,不管是管道式语音合成还是端到端语音合成,声学模型都占有举足轻重的作用。本文重点关注管道式语音合成和端到端语音合成两种系统框架下的模型设计与改进,同时探索更为有效的声学模型训练策略。本文沿着声学模型的改进与简化、训练方法的改进与优化和系统框架改进的思路展开研究。主要创新成果如下:
1. 在管道式语音合成框架上,以提高声学建模的精度为研究目标,提出了一种集门控循环网络与混合密度模型于一体的深度混合密度网络,称之为门控循环混合密度网络(Gated Recurrent Mixture Density Network, GRMDN)。GRMDN 结合了门控循环网络建模长时依赖性的能力和混合密度模型能完备地描述目标数据的条件概率密度的优势,是一个通用的条件序列生成器。因此,GRMDN 很适合应用于序列生成任务,比如本文所研究的声学建模任务。一方面, GRMDN 利用门控循环结构的长时建模能力充分捕捉语言学输入的长期依赖性;另一方面, GRMDN 利用混合密度模型完备地建模声学特征的多模态属性,生成具有丰富变化性的声学特征。与单一的基线系统相比,基于 GRMDN的声学模型能合成自然度更高、变化性更丰富的语音。
2. 在管道式语音合成框架上,以提高声学建模的精度、缓解参数生成时的过度平滑效应为研究目标,提出了一种结合单向长短时记忆(Unidirectional Long-Short Term Memory, ULSTM)和卷积输出层(Convolutional Output Layer, COL)的高性能声学结构,简记为 ULSTM-COL。其中卷积输出层采用非对称上采样卷积实现。这种“高性能”体现在以下三个方面: 1)建模能力强。单向 LSTM 与非对称卷积输出层作用互补,建模能力显著超过同样配置的基于单向 LSTM 和双向 LSTM 的声学模型; 2)缓解语音参数生成时的过度平滑效应。建模时不再需要动态差分特征,因为上采样卷积输出层作为语音参数轨迹生成器能起到很好的参数平滑作用,因此不再需要最大似然参数生成(Maximum Likelihood Parameter Generation, MLPG)平滑算法; 3)低延迟。一方面避开了 MLPG 算法的使用,简化了合成流程;另一方面,单向 LSTM 和卷积输出层都是单向结构,保证了 ULSTM-COL 可以方便地应用到低延迟的实时合成系统。实验证明, ULSTM-COL 可以显著提高声学模型的性能,合成自然度更高的语音。
3. 在管道式语音合成框架上,提出使用无监督生成式对抗网络(Generative Adversarial Network, GAN)来进一步改善管道式语音合成中的过度平滑问题,分别从对抗语音参数后滤波和对抗声学建模两个方面进行了研究。 GAN 无需对语音参数的条件分布做任何假设,利用无监督对抗训练的方式驱使模型生成更接近自然分布的语音参数轨迹,从而提高合成的感知自然度。实验主观、客观评价都证明了对抗学习的有效性,相比有监督训练的声学模型具有更好的合成质量。
4. 以端到端语音合成技术为研究目标,提出了一种中文普通话端到端语音合成方法。该方法使用带有关注机制的编码器-解码器框架来实现端到端语音合成系统,端到端系统同时集成了管道式语音合成系统中的韵律预测模型、时长模型和声学模型,隐式地学习输入序列中的韵律模式,不仅能简化现有的管道式语音合成框架,而且能减少对数据标注的依赖。具体地,所提出的端到端模型直接采用中文带调拼音序列作为输入,生成相应的短时傅里叶变换幅度谱序列,最后使用 Griffin-Lim 算法合成语音。本文所提出的端到端方法在主观评测中可实现均值意见得分(Mean Opinion Score, MOS) 3.81,合成自然度超过了内部最佳的管道式语音合成系统。在此基础上,本文进一步研究了多说话人端到端语音合成技术以及说话人自适应技术,所提出的方法不仅可以合成集内每个说话人的音色及说话风格,而且只需要集外说话人的少量数据便能合成可接受质量的语音,提供了一种快速、简便地构建语音合成系统的可能性。
其他摘要The recently booming technique of deep learning has greatly advanced the innovation of acoustic modeling in speech synthesis. Based on deep learning, this work carries out an in-depth investigation and exploration on acoustic modeling techniques for speech synthesis. At present, speech synthesis is undergoing a transition from the conventional complex pipeline to end-to-end modeling in terms of system architecture. But no matter in what system, the acoustic model is always playing a dominant role. This work focuses on designs and improvements of acoustic models regarding to both the two system architectures. Also, this work explores more powerful training strategies in order to improve modeling accuracy. Investigation is carried out along the route of improvement and simplification of acoustic models, optimization of training methods, and improvement of system architectures. The main contributions are as follows:
 
1. In the goal of improving acoustic modeling accuracy, the gated recurrent mixture density network (GRMDN) that integrates a gated recurrent network and a mixture density model, is proposed and employed as acoustic model. GRMDN combines the capability of gated recurrent network in modeling long term dependency and the advantage of mixture density model in modeling the conditional density of target data. Hence, GRMDN is very suitable for sequence generation task, e.g., acoustic modeling for speech synthesis. On one hand, GRMDN can capture long-time dependencies across the linguistic input by the use of gated recurrent architecture; On the other hand, the multi-modal acoustic output features can be completely described by incorporating mixture density model. Compared with the baselines based on respective single model, the proposed GRMDN-based acoustic model can synthesize speech with higher naturalness and richer variability.
 
2. In the goal of improving acoustic modeling accuracy and alleviating the over-smoothing effect of parameter generation, a novel model architecture combining unidirectional long-short term memory (ULSTM) and convolutional output layer (COL) is proposed. For short, it is named as ULSTM-COL, where the COL is an up-sampling convolutional layer. The proposed model can achieve high-performance speech synthesis. Specifically, the advantages are threefold. First, it can significantly improve acoustic modeling accuracy over both ULSTM and bidirectional LSTM, and best perceived naturalness is also achieved. Second, the unique operation mechanism of the convolutional output layer makes itself serve as a fine parameter trajectory smoother between consecutive frames of acoustic parameters; hence, the dynamic feature constraints and maximum likelihood parameter generation (MLPG) algorithm used to produce smooth trajectories are not required any more. Third, the unidirectional nature with negligible latency of the proposed architecture allows low-latency synthesis in real-time application. Experiments demonstrate the ULSTM-COL model can synthesize speech with higher naturalness.
 
3. In the pipeline speech synthesis system, the unsupervised generative adversarial network (GAN) is proposed to improve the over-smoothing effect. Adversarial post-filtering of speech parameters and adversarial acoustic modeling are respectively investigated. GAN is capable of generating speech trajectories closer to natural distribution in an unsupervised adversarial manner, without assuming the conditional distribution of speech parameters. Experiments demonstrate the effectiveness of adversarial learning both subjectively and objectively. GAN-based training strategies, compared with the supervised criterion (e.g. mean squared error) is able to generate speech with better perceived quality.
 
4. In the goal of simplifying the existing complex pipeline, a neural approach towards end-to-end text-to-speech (TTS) synthesis for Mandarin Chinese is proposed. The attention-based encoder-decoder architecture is employed as our end-to-end model, which integrates prosodic prediction model, duration model and acoustic model together. The end-to-end model simplifies the conventional synthesis pipeline, and reduce the reliance on complex data annotation since it is capable of learning the prosodic pattern embedded in input text implicitly. Concretely, the proposed end-to-end model generates short-time Fourier transformation (STFT) magnitude spectrograms directly from Mandarin tonal syllables (a. k. a. pinyin), and finally Griffin-Lim algorithm is used to recover speech. The proposed method achieves a mean opinion score (MOS) of 3.81 in naturalness, outperforming the internal best parametric system. Based on the end-to-end model, multi-speaker end-to-end TTS is also explored. The multi-speaker variant is capable of synthesizing speech with each speaker’s timbre and speaking style well retained with only some simple modifications to the decoder. Furthermore, speaker adaption is investigated based on the trained multi-speaker model. Experiments show it can transfer to a new voice producing acceptable quality using a small amount of data, making it a promising way to quickly build a TTS system.
文献类型学位论文
条目标识符http://ir.ia.ac.cn/handle/173211/21125
专题毕业生_博士学位论文
作者单位中国科学院自动化研究所
推荐引用方式
GB/T 7714
王文富. 语音合成声学建模技术研究[D]. 北京. 中国科学院研究生院,2018.
条目包含的文件
文件名称/大小 文献类型 版本类型 开放类型 使用许可
王文富博士论文.pdf(4177KB)学位论文 暂不开放CC BY-NC-SA请求全文
个性服务
推荐该条目
保存到收藏夹
查看访问统计
导出为Endnote文件
谷歌学术
谷歌学术中相似的文章
[王文富]的文章
百度学术
百度学术中相似的文章
[王文富]的文章
必应学术
必应学术中相似的文章
[王文富]的文章
相关权益政策
暂无数据
收藏/分享
所有评论 (0)
暂无评论
 

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。