|Place of Conferral||北京|
|Keyword||声学建模 循环神经网络 多维残差学习 词打散 改进的低帧率模型 多场景中文语音识别|
The voice-based human-computer interaction is becoming more and more popular because of their convenience and high efficiency.As one of the most critical technologies, speech recognition has attracted a large number of researchers for a long time.In recent years, the speech recognition based on gated recurrent neural networks has gradually become mainstream due to its excellent performance.However, the actual performance of different types of gated recurrent neural networks is various.At the same time, the degrade problem of deep recurrent neural networks based (RNNs-based) acoustic model is more serious since it is multi-dimensional deep architecture in time and space. Further, the excessive dependencies modeling of intra-word and inter-word greatly impair the performance of the RNN-based acoustic model.In addition, the independent modeling of a single scene seriously hampers the application and development of actual products.
This thesis focuses on the RNNs-based acoustic modeling for speech recognition, and the main contributions are as follows:
1. We proposed that the main function of the projection matrix in LSTMP is to recombine and select sparse information, and enhance the generalization through weight sharing.The differences between different kinds of gated recurrent neural networks are studied and analyzed, especially the processing of historical information by the GRU、LSTM and the projection layer in LSTMP. In addition, we elaborate the implied assumptions of RNNs-based acoustic modeling, and make a detailed analysis of the problems encountered in practical application.
2. A training algorithm based on multi-dimensional residual learning is proposed to solve the degrade problem of deep RNNs in the spatial and temporal dimensions.The introduction of identity mapping in the space dimension ensures the smooth flow of information. In the time dimension, the short time stationary of speech is used to solve the problem of information transmission blocking by adjusting the time granularity. In addition, the row convolution is placed on the top layer to comprehensively understand the information from multiple parallel sequences and prepare for classification. On both tasks of phoneme recognition and large vocabulary continuous speech recognition, they achieved a relative performance improvement of more than 10%.
3. The word-level permutation algorithm and the improved low frame rate model are proposed to relieve the over modeling problem in the RNNs-based acoustic modeling. The word-level shuffling greatly overcomes the problem of modeling inter-word dependency, making the model not excessively dependent on training data. In addition, the promotion is fully enhanced, only combining the corresponding language model can be applied to the new fields, without retraining or adjusting the acoustic model. The improved low frame rate model makes full use of all training data, avoiding the problem of losing training with low frame rate models.It increases the robustness of the model and reduces the computational cost and delay time during decoding. Jointing word-level shuffling and improved lower frame rate can achieve a relative CER reduction of 7% or more.
4. A CTC acoustic model based on context-independent syllables for multi-scenario Chinese speech recognition was proposed. This model overcomes the disadvantages of context-dependent modeling that naturally learns scene information.Furthermore, longer syllables can effectively model the co-articulation and have good generalization and better robustness.In addition, the VGG network is used to extract the low-level feature for data with different sampling rates, and a layer normalization algorithm is introduced. When narrow-band telephone data and wide-band phone data are combined, the proposed method achieves 7% and 15% performance improvements relative to the scenario-related modeling approach.
|Affiliation||Institute of Automation, Chinese Academy of Sciences|
|赵媛媛. 基于循环神经网络的声学建模方法研究[D]. 北京. 中国科学院大学,2018.|
|Files in This Item:|
|基于循环神经网络的声学建模方法研究+赵媛（4227KB）||学位论文||暂不开放||CC BY-NC-SA||Application Full Text|
|Recommend this item|
|Export to Endnote|
|Similar articles in Google Scholar|
|Similar articles in Baidu academic|
|Similar articles in Bing Scholar|
Items in the repository are protected by copyright, with all rights reserved, unless otherwise indicated.