基于多信息融合的端到端语音合成方法研究

	基于多信息融合的端到端语音合成方法研究
	邹雨巷
	2020
页数	64
学位类型	硕士
中文摘要	深度学习的蓬勃发展极大地推动了语音合成声学建模技术的革新。本文以深度学习技术为理论基础，围绕基于多信息融合的端到端语音合成方法展开。相比于传统的拼接合成方法和统计参数方法，端到端方法建模简单，代价小，性能佳，已成为当前学术研究的主流方法。为了进一步提升合成语音的整体质量，包括其发音准确性、自然度和音质，本文对融合发音信息和语义信息的端到端语音合成技术展开了深入研究。本文的主要贡献如下： 1. 本文在端到端语音合成模型Tacotron2 的框架下，研究了不同粒度的建模单元对中文普通话语音合成的影响。该端到端模型使用带有注意力机制的编码器-解码器结构，集成了管道式语音合成系统中的韵律预测模型、时长模型以及声学模型，能隐式地学习输入序列中的韵律模式，不仅能简化现有的管道式语音合成框架，而且能减少对数据标注的依赖。本文着重研究了汉字、拼音和音素这三种建模单元对语音合成效果的影响。实验结果表明，拼音建模和音素建模的效果优于汉字建模，这也说明了直接使用汉字建模是一项非常具有挑战性的任务。 2. 为了提高基于汉字建模的中文语音合成的效果，针对数据稀疏性和多音字发音问题，提出了一种融合发音信息的端到端语音合成模型。该模型采用了两种新颖而简单的策略：多任务学习策略和词典指导机制。多任务学习策略通过增加拼音预测辅助任务，增补了拼音领域知识，从而有助于编码器学习更好的特征表示。词典指导机制利用了外部词典中丰富的文本与发音信息，有助于纠正多音字和生僻字的发音错误。实验证明，对比基于汉字建模的基线系统，所提出的这两种方法能显著提升合成语音的自然度和可懂度，使得系统能直接从中文汉字序列合成语音。 3. 为了提升合成语音的自然度和韵律感，提出了一种融合语义信息的端到端语音合成模型。该模型将预训练模型BERT 提取的文本嵌入作为额外输入，加入到基于Tacotron2 的端到端语音合成模型中。文本嵌入包含语言学和语义相关的信息，有助于语音合成系统生成更自然的语音。本文比较了两种不同的融合预训练文本嵌入的方式（基于特征的方法和基于微调的方法）对语音合成效果的影响；对于基于特征的方法，进一步研究了文本嵌入的不同加入位置（输入端增强和输出端增强）的实验效果。实验结果表明，融合BERT 提取的文本嵌入可以加强端到端语音合成模型的训练，从而提升了合成语音的自然度和韵律感。其中，使用基于特征的方法，在输入端增强，效果最佳。
英文摘要	The recently booming technique of deep learning has greatly advanced the innovation of acoustic modeling in speech synthesis. Based on deep learning, this work carries out an in-depth investigation and exploration on multi-information fusion based end-to-end speech synthesis. Compared with concatenative speech synthesis and statistical parametric speech synthesis (SPSS), end-to-end neural text-to-speech (TTS) has become a new trend due to its simpler module, low cost and good performance. In order to further improve the overall quality of synthesized speech, including its pronunciation accuracy, naturalness and sound quality, this paper makes an in-depth study on end-to-end speech synthesis technology. The main contributions are as follows: 1. In this paper, based on an end-to-end TTS model Tacotron2, the influences of different modeling units on Chinese Mandarin speech synthesis are studied. The attention-based encoder-decoder architecture is employed as our end-to-end model, which integrates prosodic prediction model, duration model and acoustic model together. The end-to-end model simplifies the conventional speech synthesis pipeline, and reduces the reliance on complex data annotation since it is capable of learning the prosodic pattern embedded in input text implicitly. This paper focuses on the effects of three modeling units on speech synthesis, including character, pinyin and phoneme. Experimental results show that both pinyin and phoneme based model significantly outperform character-based model, and this indicates that it is challenging to build a character-based TTS system for Chinese. 2. In the goal of improving character-based Chinese TTS system, an end-to-end TTS model, which incorporates pronunciation information, is proposed to alleviate the problem of data sparsity and the pronunciation of polyphonic characters. This model employs two novel and simple methods: multi-task learning and dictionary tutoring. Multi-task learning method supplements pinyin domain knowledge by adding an auxiliary task of pinyin prediction to assist the encoder learning better feature representations. Dictionary tutoring method leverages the abundant information from the external dictionary to correct the pronunciation of polyphonic and uncommon characters in Chinese. Experimental results show that compared with the character-based model, the proposed methods clearly enhance the naturalness and intelligibility of the synthetic speech, making the system being able to synthesize speech directly from the Chinese character sequences. 3. In the goal of improving the naturalness and prosody of synthetic speech, an end-to-end TTS model that explicitly uses information from pre-trained text embeddings is proposed. This model utilizes the text embeddings extracted by pre-trained BERT as an additional input to a Tacotron2-based TTS model. The text embeddings contain information about linguistics and semantics, which help the system produce more natural speech. This paper compares the effects of two different approaches (feature-based approach and fine-tuning approach) of using the pre-trained text information. For featurebased approach, extended researches are carried out to compare the effects of adding text information on different places (input side enhancement and output side enhancement). The experimental results show that using text embeddings from pre-trained BERT can enhance the naturalness and prosody of synthesized speech, and the feature-based approach with input side enhancement works best.
关键词	语音合成多信息融合端到端
语种	中文
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/39148
专题	复杂系统认知与决策实验室_听觉模型与认知计算
推荐引用方式 GB/T 7714	邹雨巷. 基于多信息融合的端到端语音合成方法研究[D]. 中国科学院自动化研究所. 中国科学院大学,2020.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
邹雨巷论文(电子签名版).pdf（1954KB）	学位论文		开放获取	CC BY-NC-SA