With the development of artificial intelligence(AI) technology, the demand for human-computer interaction has seen an explosive growth. As an important portal for human-computer interaction, automatic speech recognition (ASR) has gained wide attention from academia and industry. The streaming speech recognition approach aims to enable the machine to recognize while listening, which can significantly reduce the delay of human-computer interaction and improve the interaction experience. The streaming speech recognition technology has been successfully applied in various scenarios such as voice input methods, cell phone assistants, smart speakers and robotic customer service. Compared with traditional hybrid speech recognition methods, the end-to-end models greatly improve the accuracy of speech recognition and simplify the process of ASR system construction, but there are still many problems to be solved in streaming speech recognition scenarios.
Facing the core requirement of streaming speech recognition, this thesis is based on the representative streaming end-to-end transducer models and focuses on three specific problems: "poor recognition performance due to the loss of future acoustic context information and insufficient sequence modeling capability", "the inefficient frame-by-frame decoding strategy seriously affects the inference speed of the model", and "ncompatibility between streaming and non-streaming speech recognition models". The thesis have accomplished the following four innovative works:
A self-attention transducer(SA-T) model and a path-aware regularization (PAR) optimization method are proposed. Due to the problems of "inability to use future acoustic context information to improve recognition" and "insufficient long-range sequence modeling capability", the accuracy of the streaming recurrent neural network transducer model(RNN-T) for speech recognition is often poor. The thesis proposes a self-attention transducer model , which utilizes the self-attention mechanism with better and more efficient sequence modeling capability completely instead of the recurrent neural networks. In addition, in order to reduce the training difficulty of the model and accelerate the convergence speed, this thesis also proposes a model optimization method named path-aware regularization. The experiments show that the SA-T model achieves better recognition performance than the original RNN-T in both streaming and non-streaming scenarios; the introduction of the path-aware regularization accelerates the convergence of the model and further improves the recognition performance greatly.
A fixed-length sliding-window mechanism is proposed to accelerate the
inference process of transducer models. The transducer-based model performs frame-by-frame decoding during the inference process, which is computationally inefficient and seriously affects the inference speed of the model. To address this problem, this thesis deeply integrates the self-attention transducer model(SA-T) with the speech-transformer model and proposes a method named the fixed-length sliding-window mechanism to accelerate the inference of transducer model. The fixed-length sliding-window mechanism slices the acoustic encoded representation sequence generated by the acoustic encoder into multiple equal-length and consecutive acoustic encoded blocks, which are decoded by the self-attention decoder in a block-wise fashion. In order to enable the model to learn the alignment of text token sequences with the acoustic encoded blocks, a forward-backward algorithm is introduced to optimize all feasible alignment paths. The experiments show that the fixed-length sliding-window mechanism improves the decoding efficiency of the transducer model greatly, and also has a positive effect on improving the performance of the model.
A fast-skip mechanism is proposed to accelerate the inference process of transducer models. The frame-by-frame inference strategy of transducer-based models is very inefficient, which seriously affects the inference speed of the models. The thesis proposes a method named the fast-skip mechanism to accelerate the inference of transducer model, which achieves the goal of "improving decoding efficiency" from another perspective. The fast-skip mechanism enables the SA-T model to learn positional information of the tokens predicted by the CTC model. In the inference process, the model first predicts the position of blank tokens based on the CTC decoder, and the SA-T decoder skips the non-key frames(represented by the predicted blank tokens) and decodes only the key frames(represented by the predicted non-blank tokens). The experiments show that the proposed fast-skip mechanism enables the model to obtain nearly 3.5 times decoding speedup with minimal performance degradation and greatly improve the inference efficiency.
A one-model dual-mode streaming and non-streaming compatible speech recognition method is proposed. Although streaming speech recognition has gained recognition accuracy and decoding efficiency, however, there are still some gaps compared with non-streaming speech recognition models, and it does not replace the existing methods in non-streaming scenarios. The non-streaming speech recognition methods represented by the attention-based encoder-decoder(AED) models cannot be directly adapted to the streaming speech recognition task due to the reliance on global acoustic information. The incompatibility problem between streaming and non-streaming models wastes great human and computing resources. In order to solve the model incompatibility problem, this paper proposes a hybrid streaming and non-streaming model. It fuses the streaming CTC model with the non-streaming AED model deeply, and realizes a model with both streaming and non-streaming decoding modes by the joint training and reconfiguration of decoding strategies. The experiments show that the proposed model is compatible with both streaming and non-streaming recognition tasks, and obtains the improvement on both recognition accuracy and decoding efficiency.
|田正坤. 基于自注意力机制的流式端到端语音识别方法研究[D]. 中国 北京. 中国科学院自动化研究所,2022.|
|Files in This Item:|
|Recommend this item|
|Export to Endnote|
|Similar articles in Google Scholar|
|Similar articles in Baidu academic|
|Similar articles in Bing Scholar|
Items in the repository are protected by copyright, with all rights reserved, unless otherwise indicated.