基于语言知识迁移的端到端语音识别方法研究

	基于语言知识迁移的端到端语音识别方法研究
	白烨
	2021-05
页数	116
学位类型	博士
中文摘要	大规模无标注文本语料中蕴含着丰富的语言知识。提炼出无标注文本语料中的知识来提升分类、匹配、序列标注等语言信息处理任务的性能已经被证实是一种行之有效的手段。然而，对采用神经网络一体化建模的语音识别、机器翻译等文本生成模型，无标注文本语料的优势并没有完全显现出来。这是由于实用的文本生成模型往往是条件化的(如根据语音、图像等生成文本)，需要成对数据训练，所以其难以直接利用无标注纯文本数据。已有的一些利用方法存在识别阶段增加额外模型导致开销大、无法利用已训练好的语言模型导致不灵活等问题。如何有效地令基于神经网络一体化建模的文本生成模型利用大规模无标注文本语料中的语言知识，同时避免开销大和不灵活这两个问题，还缺乏深入地研究。本文从“如何利用纯文本数据提升端到端语音识别性能”这一具体的实际问题出发，以迁移学习为主线方法，面向从大规模无标注文本语料中迁移知识到端到端语音识别模型，在“上文语言知识迁移”、“全局上下文语言知识迁移”、“跨模态全局上下文语言知识迁移”三个递进的层面上，完成了四项创新工作。 1. 提出一种文本知识利用方法。针对已有方法存在识别阶段增加额外模型导致开销大、无法利用已训练好的语言模型导致不灵活的问题，本文提出了一种基于老师-学生学习的文本知识利用方法LST，利用大规模无标注文本语料中的语言知识，来提升端到端语音识别的性能：首先利用语言模型将大规模纯文本中的语言知识表示起来，然后利用老师-学生学习将此语言知识迁移到端到端语音识别系统中。与其它方法相比，该方法不增加预测阶段的计算代价，比较高效；同时，该方法可以利用其它开放获取的已经训练好的语言模型而不需要自行训练，方便灵活。本文还分析比较了该方法与另一种典型的文本知识利用方法浅融合，发现平滑模型估计的分数空间是这两种方法提升识别性能的重要性质。同时，该方法不仅可以应用在语音识别任务，还可以应用在其它所有条件化的文本生成任务中。 2. 提出一种全局上下文语言模型。针对端到端编码器-解码器模型没有利用文本中下文知识的问题，本文提出一种全局上下文语言模型“因果完形填空器”，然后利用LST方法将此全局上下文语言知识迁移到端到端编码器-解码器模型中，使得编码器-解码器模型也可以利用全局上下文语言知识。相比其它的利用双向语言知识的方法，该方法不增加识别阶段复杂性，还可以灵活地利用无标注纯文本。 3. 证实利用语音中包含的语言知识而不进行显式语言建模也可以有效进行语音识别。针对已有端到端编码器-解码器模型的自回归模式束搜索阶段耗时较大的问题，基于观察到的语音与文本的语言知识同构现象，本文提出一种端到端非自回归语音识别模型LASO。该方法没有显示地自回归语言建模，所以可以并行地实现同时预测所有的词。实验表明，所提模型在两种规模的公开中文语音数据集上都可以表现出与自回归模型可比的性能，但处理速度是自回归模型的近50倍。这些结果表明，不进行显式地自回归语言建模，而是利用语音中的语言知识，也可以进行高效的语音识别。 4. 提出一种跨模态全局语言知识迁移方法，有效提升了单模态语音识别模型性能。根据文本与语音的语言知识同构性，本文提出将大规模预训练语言模型中的语言知识跨模态地迁移到非自回归语音识别模型LASO中。实验证明，所提方法可以提升纯语音模态建模的端到端语音识别模型的效果。同时结果表明，利用不同模态的语言知识同构性进行知识迁移，可以有效地提升不同模态模型的性能。
英文摘要	Large-scale text-only data contains rich knowledge. It has been confirmed that distilling knowledge in text-only data can improve performance in many natural language processing tasks, such as classification, matching, and sequence labeling. However, the advantages of text-only data have not been shown in deep learning based end-to-end text generation models. Because practical text generation is conditional, these models need paired data to train. It is non-trivial to directly use text-only data to train these models. Previous work will add extra modules during recognition or cannot use pre-trained language models. Thus, it is worth investigating methods, which are flexible to use pre-trained language models, to use text-only data for improving end-to-end text generation models without extra computation during inference. This thesis focuses on a practical problem: how to use text-only data to improve end-to-end speech recognition. Taking transfer learning, we would like to transfer knowledge from text-only data to end-to-end speech recognition. We discuss three aspects: transferring knowledge of left context, transferring knowledge of whole-sentence context, and cross-modal knowledge transfer. The four contributions are as follows. 1. Propose a method to using knowledge in text-only data. We propose a teacher-student learning based method called LST. It uses knowledge in text-only data to improve end-to-end speech recognition. It first uses language models to represent knowledge in text-only data. Then, the knowledge is transferred to speech recognition models. Compared with other methods, LST is more efficient since it does not add computation during the test stage. In addition, it is flexible for LST to use language models pre-trained by others. This thesis also analyzes and compares LST and another method shallow fusion. We found that smoothing the score space of a model is an important factor to improve performance. The proposed LST can not only be used for speech recognition but also all the other conditional text generation. 2. Propose a whole-sentence language model. The encoder-decoder model does not use the "future" context during text generation. To address this issue, this thesis proposes a whole-sentence language model called causal cloze completer. We use the proposed LST to transfer the whole-sentence knowledge to end-to-end speech recognition. Compared with other methods which use bidirectional information in a sentence, the proposed method does not add extra computation at the test stage. And it can use text-only data flexibly. 3. Confirm that the language semantics in speech can be used to speech recognition without explicit language modeling. Based on the observed isomorphism between speech and text, this thesis proposes an end-to-end non-autoregressive speech recognition model called LASO. LASO is non-autoregressive so that it can generate all tokens in parallel. The experiments show that the proposed LASO achieves comparable performance on two public Chinese speech datasets. The processing speedup is about 50 times, compared with the autoregressive baseline. These results show that it is feasible to speech recognition without explicit language modeling. 4. Propose a cross-modal knowledge transfer method to improve the performance of a unimodal speech recognition model. Based on the isomorphism between speech and text, this thesis proposes to transfer knowledge from large-scale pre-trained language models to the proposed non-autoregressive model. The experiments show that the proposed method can improve the performance of the unimodal end-to-end speech recognition model. The results reveal that using the isomorphism between the speech and the text and transferring the knowledge from the text-based model can improve the performance of the speech model.
关键词	端到端语音识别、迁移学习、知识蒸馏、老师-学生学习、BERT、非自回归语音识别
语种	中文
七大方向——子方向分类	语音识别与合成
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/44391
专题	多模态人工智能系统全国重点实验室_智能交互
推荐引用方式 GB/T 7714	白烨. 基于语言知识迁移的端到端语音识别方法研究[D]. 中国科学院自动化研究所. 中国科学院自动化研究所,2021.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
白烨博士论文-打印-0601.pdf（7085KB）	学位论文		开放获取	CC BY-NC-SA