基于自注意力机制的流式端到端语音识别方法研究

CASIA OpenIR > 毕业生 > 博士学位论文

	基于自注意力机制的流式端到端语音识别方法研究
	田正坤
	2022-05-17
页数	132
学位类型	博士
中文摘要	随着人工智能技术发展，人机交互需求迎来爆发增长。语音识别任务作为人机交互的重要入口获得了学术界和工业界的广泛关注。而流式语音识别方法旨在使机器能够实现边听边识别，这一技术方案能够显著的降低人机交互的延迟，提升交互体验。流式语音识别已经成功应用在语音输入法、手机助手、智能音箱和客服机器人等诸多场景下。相对于传统混合语音识别方法而言，端到端模型极大的提升了语音识别的准确率，简化了识别系统构建的流程，然而其在流式语音识别场景下还有很多问题亟待解决。本文面向流式语音识别这一核心需求，以代表性的流式端到端转写器模型(Transducer-Based Models)为基础，围绕“下文声学信息丢失与序列建模能力不足导致模型识别效果差”，“逐帧解码策略效率低下严重影响了模型的推理速度”，“流式与非流式语音识别模型不兼容”三个具体问题进行递进式地研究，共计完成了四项创新性的工作。提出了基于自注意力机制的转写器模型和先验路径正则化优化方法。由于“不能利用下文声学信息辅助识别”和“长距离序列建模能力不足”的问题，流式单向循环神经网络转写器模型(RNN-T)语音识别准确率往往比较差。本文提出了自注意力转写器模型(SA-T)，使用序列建模能力更好且效率更高的自注意力机制完全取代循环神经网络进行序列建模。此外为了降低模型的训练难度，加快收敛速度，本文同时提出了一种基于先验路径正则化的模型优化方法。实验表明SA-T模型相比原始RNN-T在流式与非流式场景下均能获得更好的识别表现；先验路径正则化的引入，极大地加速了模型收敛过程，且进一步提升了模型识别表现。提出了基于定长滑窗机制的转写器模型加速解码方法。转写器模型在推理过程中采用逐帧解码的策略，其计算效率低下，严重影响了模型的推理速度。针对这一问题，本文将自注意力转写器模型(SA-T)与语音变换器(Speech-Transformer)模型进行了深入地融合，提出了基于定长滑窗机制的转写器模型加速解码方法。定长滑窗机制将声学编码器生成的声学编码表示序列切分生成多个等长且连续的声学编码块，自注意力解码器对其逐块解码。为了使得模型能够学习到文本标记序列与声学编码块的对齐关系，本文引入了前后向算法来对所有可行对齐路径同时进行优化。实验表明定长滑窗机制能够有效地提升转写器模型的解码效率，同时对于模型识别准确率的提升也有一定的促进作用。提出了基于快速跳帧机制的转写器模型加速解码方法。针对转写器模型逐帧解码策略效率低下，严重影响模型推理速度的问题，本文提出了一种基于快速跳帧机制的转写器模型加速解码方法，从另一个角度达成“提升模型解码效率”的目标。快速跳跃正则化方法使得SA-T模型能够学习到CTC模型预测标记的位置信息，并进行对齐。在解码过程中，模型先基于CTC解码器来预测空格标记的位置，SA-T解码器将预测出的非关键帧(空格标记)跳过，仅在关键帧(非空格标记)位置进行解码。实验结果表明，所提出的快速跳帧机制使得模型能够在识别表现损失极小的情况下获得近3.5倍的解码速度提升，极大地提升了模型解码效率。提出了基于一体双模的流式与非流式兼容语音识别方法。虽然流式语音识别已经获得了识别准确率和解码效率的提升，然而和非流式的语音识别模型相比仍有些差距，在非流式场景下并不能取代现有方法。以基于注意力机制的编码解码模型(AED)为代表的非流式语音识别方法由于对全局声学信息的依赖也不能直接适配于流式语音识别任务。流式与非流式模型的不兼容问题浪费了极大的人力成本与算力成本。为了解决模型的兼容性问题，本文提出了混合流式与非流式模型，其将流式CTC模型与非流式AED模型进行深入地融合，通过联合训练与解码模式重构的方法，实现了一个模型具有流式与非流式两种解码模式。实验表明，所提出的模型能够兼容两种识别任务，并获得了识别准确率与解码效率的双重提升。
英文摘要	With the development of artificial intelligence(AI) technology, the demand for human-computer interaction has seen an explosive growth. As an important portal for human-computer interaction, automatic speech recognition (ASR) has gained wide attention from academia and industry. The streaming speech recognition approach aims to enable the machine to recognize while listening, which can significantly reduce the delay of human-computer interaction and improve the interaction experience. The streaming speech recognition technology has been successfully applied in various scenarios such as voice input methods, cell phone assistants, smart speakers and robotic customer service. Compared with traditional hybrid speech recognition methods, the end-to-end models greatly improve the accuracy of speech recognition and simplify the process of ASR system construction, but there are still many problems to be solved in streaming speech recognition scenarios. Facing the core requirement of streaming speech recognition, this thesis is based on the representative streaming end-to-end transducer models and focuses on three specific problems: "poor recognition performance due to the loss of future acoustic context information and insufficient sequence modeling capability", "the inefficient frame-by-frame decoding strategy seriously affects the inference speed of the model", and "ncompatibility between streaming and non-streaming speech recognition models". The thesis have accomplished the following four innovative works: A self-attention transducer(SA-T) model and a path-aware regularization (PAR) optimization method are proposed. Due to the problems of "inability to use future acoustic context information to improve recognition" and "insufficient long-range sequence modeling capability", the accuracy of the streaming recurrent neural network transducer model(RNN-T) for speech recognition is often poor. The thesis proposes a self-attention transducer model , which utilizes the self-attention mechanism with better and more efficient sequence modeling capability completely instead of the recurrent neural networks. In addition, in order to reduce the training difficulty of the model and accelerate the convergence speed, this thesis also proposes a model optimization method named path-aware regularization. The experiments show that the SA-T model achieves better recognition performance than the original RNN-T in both streaming and non-streaming scenarios; the introduction of the path-aware regularization accelerates the convergence of the model and further improves the recognition performance greatly. A fixed-length sliding-window mechanism is proposed to accelerate the inference process of transducer models. The transducer-based model performs frame-by-frame decoding during the inference process, which is computationally inefficient and seriously affects the inference speed of the model. To address this problem, this thesis deeply integrates the self-attention transducer model(SA-T) with the speech-transformer model and proposes a method named the fixed-length sliding-window mechanism to accelerate the inference of transducer model. The fixed-length sliding-window mechanism slices the acoustic encoded representation sequence generated by the acoustic encoder into multiple equal-length and consecutive acoustic encoded blocks, which are decoded by the self-attention decoder in a block-wise fashion. In order to enable the model to learn the alignment of text token sequences with the acoustic encoded blocks, a forward-backward algorithm is introduced to optimize all feasible alignment paths. The experiments show that the fixed-length sliding-window mechanism improves the decoding efficiency of the transducer model greatly, and also has a positive effect on improving the performance of the model. A fast-skip mechanism is proposed to accelerate the inference process of transducer models. The frame-by-frame inference strategy of transducer-based models is very inefficient, which seriously affects the inference speed of the models. The thesis proposes a method named the fast-skip mechanism to accelerate the inference of transducer model, which achieves the goal of "improving decoding efficiency" from another perspective. The fast-skip mechanism enables the SA-T model to learn positional information of the tokens predicted by the CTC model. In the inference process, the model first predicts the position of blank tokens based on the CTC decoder, and the SA-T decoder skips the non-key frames(represented by the predicted blank tokens) and decodes only the key frames(represented by the predicted non-blank tokens). The experiments show that the proposed fast-skip mechanism enables the model to obtain nearly 3.5 times decoding speedup with minimal performance degradation and greatly improve the inference efficiency. A one-model dual-mode streaming and non-streaming compatible speech recognition method is proposed. Although streaming speech recognition has gained recognition accuracy and decoding efficiency, however, there are still some gaps compared with non-streaming speech recognition models, and it does not replace the existing methods in non-streaming scenarios. The non-streaming speech recognition methods represented by the attention-based encoder-decoder(AED) models cannot be directly adapted to the streaming speech recognition task due to the reliance on global acoustic information. The incompatibility problem between streaming and non-streaming models wastes great human and computing resources. In order to solve the model incompatibility problem, this paper proposes a hybrid streaming and non-streaming model. It fuses the streaming CTC model with the non-streaming AED model deeply, and realizes a model with both streaming and non-streaming decoding modes by the joint training and reconfiguration of decoding strategies. The experiments show that the proposed model is compatible with both streaming and non-streaming recognition tasks, and obtains the improvement on both recognition accuracy and decoding efficiency.
关键词	请输入关键词
语种	中文
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/48510
专题	毕业生_博士学位论文
推荐引用方式 GB/T 7714	田正坤. 基于自注意力机制的流式端到端语音识别方法研究[D]. 中国北京. 中国科学院自动化研究所,2022.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
田正坤-基于自注意力机制的流式端到端语音（8871KB）	学位论文		限制开放	CC BY-NC-SA