基于编解码框架的端到端语音识别技术研究

	基于编解码框架的端到端语音识别技术研究
	董林昊
	2020-06
页数	120
学位类型	博士
中文摘要	二十一世纪一十年代初以来，被深度学习赋能后的神经网络技术，凭借其突出的建模能力逐步发展为语音识别技术中的主流。同期，为了更好地承载并发挥神经网络的建模优势，语音识别系统框架也不断地推陈出新，兴起了上下文相关-深度神经网络-隐马尔可夫（context-dependent deep neural network hidden Markov model, CD-DNN-HMM）、链接实时分类（connectionist temporal classification, CTC）以及编解码（encoder-decoder）等代表性框架。其中，完全依赖于神经网络进行端到端建模的编解码框架，凭借其“搭建简洁性”及“优化整体性”的优点，成为更具性能潜力及应用优势的一类框架。但由于处在发展起步阶段，基于编解码框架的模型（编解码模型）存在着计算并行性差、识别性能不足、覆盖场景有限等问题，使其在实际语音识别系统中的应用还比较少，其模型潜力亟需进一步发掘。针对于此，本文重点关注于语音识别中编解码模型的设计与改进，并沿着新型模型结构的探索、编码器和解码器的设计与优化、对齐机制的设计与优化的思路展开研究，主要创新成果如下： 1、首次将转换器（transformer）模型引入到语音识别领域，并设计了有效的卷积下采样、前端模块以及相关训练策略，使 transformer 模型以极小的训练代价获得了与基于注意力机制的编解码模型（注意力模型）相当的识别性能，从而侧面缓解了注意力模型由于计算并行性差导致的“训练瓶颈”问题。另外，对 transformer 模型在语音识别任务上的超参数组合进行了对比探究，实验中所验证的最佳超参数组合及相关模型结构被多篇论文所引用，从而在一定程度上推动了 transformer 这种高并行计算的编解码模型在语音识别中的发展。 2、将一种支持在线识别的编解码模型——循环神经对齐器（recurrent neural aligner, RNA）应用到了汉语普通话的语音识别任务，并根据汉语普通话的特点对 RNA 模型的编码器和解码器进行了相应的结构设计。具体地，根据汉语普通话的时域熵密度低且带调的“发音特点”，探究了最佳下采样率及结构组合，并引入了一种门控卷积层来捕捉声学细节（如声调）。根据汉字中大量的同音异形字易引发错别字的“语言特点”，引入了一种置信度惩罚算法来鼓励更充分的备选搜索，并提出了一种使 RNA 模型与语言模型进行联合训练的方法。结合了以上扩展设计后的 RNA 模型在汉语基准数据集上获得了突出的在线识别表现，从而验证了编解码模型在汉语在线语音识别任务上的有效性。 3、提出了一种编解码模型：自注意力对齐器（self-attention aligner, SAA），其使用自注意力网络（self-attention network, SAN）对 RNA 模型中的长短时记忆单元（long short-term memory, LSTM）进行了完全替代。并根据 SAN 的建模特点，对 SAA 模型的编码器和解码器进行了相应的设计与优化，使其不仅在汉语基准数据集上获得了当时最好的端到端识别性能，而且可以支持在线识别。同时，对 SAN 与 LSTM 在编解码模型中的性能表现、训练速度、推理速度进行了对比，证实了 SAN 在语音识别任务上的建模优势。 4、提出了一种低计算复杂度并且具有单调一致性的序列对齐机制：连续整合发放（continuous integrate-and-fire, CIF），来应对主流的注意力模型无法支持在线语音识别、无法进行声学边界定位以及计算复杂度高的问题。同时，还提出了若干支撑策略来进一步精炼基于 CIF 的编解码模型的识别性能，使其在覆盖不同语种、不同语音类型的多个数据集上获得了突出的识别结果。而基于 CIF 的编解码模型可对语音认知中最重要的声学边界进行定位的特点，为语音识别融合各种知识模型提供了新的手段和路径，有效地拓宽了编解码模型潜在的应用场景。
英文摘要	Since the early 2010s, neural network techniques that are energized by deep learning have gradually developed into the mainstream techniques of automatic speech recognition due to their outstanding modeling capacities. In the same period, in order to better carry and leverage the modeling advantages of neural networks, the framework of speech recognition system has also been continuously innovated, some representative frameworks have emerged such as context-dependent deep neural network hidden markov model (CD-DNN-HMM), connectionist temporal classification (CTC) and encoder-decoder. Among them, the encoder-decoder framework, which entirely relies on neural networks to conduct end-to-end modeling, has better performance potential and application advantages due to the simplicity of construction and the integrity of optimization. However, since it is in the initial stage of development, the models based on the encoder-decoder framework (the encoder-decoder models) suffer from the problems of low computing efficiency, insufficient recognition performance and limited covered application scenarios, which make them less applied in actual speech recognition system, their model potential needs to be further explored. In this regard, this thesis focuses on the design and improvement of the encoder-decoder models in speech recognition, and conducts research along the thoughts of the exploration of new model structure, the design and optimization of the encoder and the decoder, and the design and optimization of the alignment mechanism. The main innovations are as follows: 1. This thesis introduces the transformer model into the field of speech recognition for the first time, and designs effective convolutional downsampling, front-end modules, and the corresponding training strategies to make the transformer model suitable for speech recognition. With these designs, the transformer achieves comparable performance with the attention-based encoder-decoder model (the attention model) under significantly small training cost, thus alleviating the "training bottleneck" problem of the attention model caused by the poor parallelism from the side. In addition, different hyper-parameter combinations of the transformer model are compared, and the hyper-parameter combination with the best performance and the corresponding model structure verified in the experiments have been used by many subsequent papers. Thus, to some extent, it has promoted the development of the transformer (which is a highly parallel computing encoder-decoder model) in speech recognition. 2. This thesis applies the recurrent neural aligner (RNA) model that supports online speech recognition to the speech recognition task of Mandarin Chinese, and makes corresponding designs to the structure of encoder and decoder of the RNA model according to the characteristic of Mandarin Chinese. Specifically, according to the "pronunciation characteristic" of Mandarin Chinese including the low temporal entropy density and the tone, this thesis explores the best downsampling rate and the corresponding implementing structure for Mandarin Chinese, and introduces a gated convolutional layer to capture acoustic details (such as tone). According to the "linguistic characteristic" of Chinese that the large number of homophones easily cause typos, this thesis introduces a confidence penalty regularization to encourage the search on more alternatives that are sensible, and proposes a method for the joint training of the RNA model and a language model. The RNA model extended with above designs has achieved outstanding online recognition performance on the Chinese benchmark dataset, thus verifying the effectiveness of encoder-decoder model on the Chinese online speech recognition task. 3. This thesis proposes an encoder-decoder model: self-attention aligner (SAA), which uses the self-attention network (SAN) to entirely replace the long short-term memory (LSTM) layer in the RNA model. In addition, this thesis designs and optimizes the encoder and the decoder of the SAA model according to the network characteristics of SAN, which not only makes it achieve the best recognition performance on the Chinese benchmark dataset at that time, but also support online speech recognition. Meanwhile, the recognition performance, training speed and inference speed between the SAN-based and LSTM-based encoder-decoder model are also compared, and the results confirm the advantages of SAN in the modeling of speech recognition task. 4. This thesis proposes a monotonic sequence alignment mechanism with low computational complexity: continuous integrate-and-fire (CIF) to deal with the problems faced by the mainstream attention model that cannot support online speech recognition, cannot locate the acoustic boundaries, and has high computational complexity. In addition, this thesis presents several supporting strategies to further refine the recognition performance of CIF-based encoder-decoder model, which makes it obtain outstanding recognition result on multiple datasets that cover different languages and different speech types. Since the CIF-based encoder-decoder model could locate the acoustic boundaries that are regarded as the most important information in speech cognition, it could provide new means and paths for the integration of various knowledge models into speech recognition, thus effectively broadening the potential application scenarios of the encoder-decoder model.
关键词	语音识别技术神经网络编解码框架端到端建模
学科门类	工学
语种	中文
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/39273
专题	复杂系统认知与决策实验室_听觉模型与认知计算
推荐引用方式 GB/T 7714	董林昊. 基于编解码框架的端到端语音识别技术研究[D]. 中国科学院自动化研究所. 中国科学院大学,2020.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
董林昊_博士学位论文.pdf（5860KB）	学位论文		开放获取	CC BY-NC-SA