低资源语言的多语言语音识别建模方法研究

CASIA OpenIR > 毕业生 > 博士学位论文

	低资源语言的多语言语音识别建模方法研究
其他题名	Research on Multilingual Speech Recognition for Low-resource Languages
	周世玉
	2018-12
学位类型	博士
中文摘要	语音识别系统目前仅限于世界上使用最广泛的十几种语言。对于那些资源受限的语言，搭建语音识别系统仍然受到很大限制。本文的研究目的是采用多语言语音识别技术提升资源受限语言的语音识别性能，主要创新成果如下： 1、提出了基于LSTM共享隐层的多语言模型（Shared-Hidden-Layer Multilingual LSTM，SHL-MLSTM），用于低资源数据的多语言语音识别任务。由于LSTM相比DNN具有更强的时序建模能力，因此LSTM更适合作为共享隐层来提取多语言之间的共性声学特征。在资源受限的多语言CALLHOME数据集的6种语言上，实验结果表明SHL-MLSTM相比于基于DNN共享隐层的多语言模型（Shared-Hidden-Layer Multilingual Deep Neural Network，SHL-MDNN）可以获得2.1-6.8\%的相对词错误率（Word Error Rate，WER）下降，相比于单语言的LSTM可以获得2.6-7.3\%的相对WER下降。除此之外，我们将残差学习引入到SHL-MLSTM模型中，有效缓解了深层SHL-MLSTM网络的退化问题，相比于普通SHL-MLSTM模型可以获得额外2\%的相对WER下降，进一步提升了模型性能。 2、提出了基于自注意力机制的端到端模型ASR Transformer，并对五种中文语音识别建模单元进行了性能对比，包括上下文无关音素、有调拼音、词、子词和汉字。首先，在中文和英文多个数据集上基于自注意力机制的端到端模型ASR Transformer都取得了不错的识别结果，验证了该端到端模型适合连续语音识别任务。其次，通过实验验证了在中文语音识别任务上，该模型采用发音词典无关的建模单元的性能要优于发音词典相关的建模单元，为后续多语言语音识别任务中采用发音词典无关的建模单元奠定了基础。 3、提出了多语言Transformer模型（Multilingual ASR Transformer，Multi-Transformer）。首先，该模型采用基于双字节编码（Byte Pair Encoding，BPE）的子词建模单元，完全取消了对发音词典的依赖，也不需要复杂的通用音子集构建，这对缺乏发音词典的小语种至关重要；其次，该模型将语种识别和语音识别两个任务统一在一个模型中，不再需要语种识别和语种切分等预处理，可以直接支持多语言语音识别任务；最后，该模型采用端到端框架，不再需要传统混合模型框架的GMM对齐以及决策树聚类等流程，极大简化了多语言语音识别任务的流程。在资源受限的多语言CALLHOME数据集的6种语言上，实验结果表明相比于基于LSTM共享隐层的SHL-MLSTM模型，在有预训练模型的条件下多语言模型Multi-Transformer具有显著优势。另外，我们将该模型应用到多语言语音识别任务的特例中英混合语音识别任务上，实验结果表明该模型对句内语码转换和句间语码转换两种情况都可以取得不错的性能。 4、针对资源受限语言的训练数据不足的问题，提出了采用多语言模型Multi-Transformer对标注数据充分的语言和标注数据匮乏的语言进行联合训练的方法来弥补低资源语言在声学模型上训练不足的问题，采用冷融合（Cold Fusion）融入外部语言模型和GMM-HMM生成伪标注数据的方式弥补低资源语言在语言模型上训练不足的问题。实验结果显示两种方法均可以在不同程度上提升低资源语言的语音识别性能。
英文摘要	Current speech recognition systems are limited to the most widely used languages in the world. For those resource limited languages, it is still very challenging to establish an automatic speech recognition (ASR) system. The purpose of this thesis is to improve the speech recognition performance of those resource limited languages by multilingual speech recognition technology. The main contributions are as follows: 1. We propose a shared-hidden-layer with Long Short-Term Memory recurrent neural networks (SHL-MLSTM) for multilingual low-resource speech recognition considering LSTM has outperformed DNN as the acoustic model. Experimental results demonstrate that SHL-MLSTM can relatively reduce word error rate (WER) by 2.1-6.8\% over SHL-MDNN trained using six languages and 2.6-7.3\% over monolingual LSTM trained using the language specific data on CALLHOME datasets. Additional WER reduction, about relatively 2\% over SHL-MLSTM, can be obtained through residual learning on CALLHOME datasets, which demonstrates residual learning is useful for SHL-MLSTM on multilingual low-resource ASR. 2. We are concerned with modeling units on Mandarin Chinese ASR tasks using sequence-to-sequence attention-based models with the ASR Transformer. Five modeling units are explored including context-independent phonemes (CI-phonemes), syllables, words, sub-words and characters. The ASR Transformer performs very well on Chinese and English datasets, which verifies that it is very suitable for ASR tasks. Moreover, experiments on HKUST datasets demonstrate that the lexicon free modeling units can outperform lexicon related modeling units in terms of character error rate, which lays the foundation for multilingual speech recognition with lexicon free modeling units. 3. We propose a multilingual ASR Transformer (Multi-Transformer) for multilingual speech recognition tasks. First, sub-words encoded by byte pair encoding (BPE) are employed as the multilingual modeling unit to remove the dependency on the pronunciation lexicon and a common phone set, which is crucial for some low-resource languages without a pronunciation dictionary. Second, Multi-Transformer are well suited for multilingual ASR tasks since it encapsulates an acoustic, pronunciation and language model jointly in a single network and eliminates pre-processing steps such as language identification and language segmentation. Last, it eliminates the GMM alignment and decision tree clustering in traditional hybrid ASR framework, which greatly simplifies multilingual speech recognition tasks. A comparison with SHL-MLSTM with residual learning is investigated on CALLHOME datasets with 6 languages. Experimental results reveal that a single Multi-Transformer with a pre-trained model has a significant advantage over SHL-MLSTM with residual learning. What's more, we investigate English-Mandarin bilingual speech recognition with the Multi-Transformer. Experimental results reveal that the Multi-Transformer performs very well on intra-sentential code-switching and inter-sentential code-switching problems. 4. We propose to train high-resource and low-resource languages together with a Multi-Transformer, which can relieve the insufficient training on the acoustic model in low-resource languages. The cold fusion method is employed to integrate an external language model and the GMM-HMM method to generate the fake data, which can compensate the insufficient training on the language model in low-resource languages. Experimental results demonstrate that these two methods can improve the performance.
关键词	语音识别多语言低资源跨语言端到端多语言语音识别中英混合语音识别 Asr Multilingual Low-resource Cross-language Sequence-to-sequence Multilingual Speech Recognition English-mandarin Bilingual Speech Recognition
学科门类	工学::计算机科学与技术（可授工学、理学学位）
语种	中文
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/22394
专题	毕业生_博士学位论文
通讯作者	周世玉
推荐引用方式 GB/T 7714	周世玉. 低资源语言的多语言语音识别建模方法研究[D]. 北京. 中国科学院研究生院,2018.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
低资源语言的多语言语音识别建模方法研究_（2353KB）	学位论文		限制开放	CC BY-NC-SA