基于循环神经网络的声学建模方法研究

CASIA OpenIR > 毕业生 > 博士学位论文

	基于循环神经网络的声学建模方法研究
	赵媛媛
	2018
学位类型	工学博士
中文摘要	基于语音的人机交互方式，因便捷高效而越来越受到人们的青睐。语音识别作为其中最关键的技术之一，长期以来都吸引了大量的科研人员。近年来，基于加门循环神经网络的语音识别技术因其卓越的性能表现而逐渐成为主流。然而，不同类型的加门循环神经网络实际性能表现不一，同时深层循环神经网络的多维退化以及其过度建模词间词内依赖关系等问题而极大地损害了模型的性能。另外，单一场景的独立建模严重束缚着实际产品的应用与发展。本文主要关注循环神经网络在语音识别声学建模中的应用，主要创新成果如下： 1、提出了LSTMP中投影矩阵的主要作用是对稀疏信息重新组合和选择，同时通过共享提升泛化能力。研究并分析了不同类型的加门循环神经网络之间的区别，重点研究了投影层、GRU和LSTM对于历史信息的处理。另外，指出了循环神经网络应用于语音识别时所做的隐含假设，并分析了这些隐含假设在实际应用中遇到的问题。 2、提出了基于多维残差学习的训练算法解决了深层循环神经网络在空间和时间维的退化问题。在空间维引入恒等映射，以确保信息的传递更加畅通。在时间维，利用语音的短时平稳特性通过调节时间粒度解决信息流通不畅问题。同时将行卷积置于顶层来综合多个并行序列的信息。在音素识别和大词汇量连续语音识别两个任务上都获得了相对10%以上的性能提升。 3、提出了词打散算法和改进的低帧率模型解决了循环神经网络的过度建模问题。词打散算法极大地克服了声学模型建模词间依赖的问题，使模型不过分依赖训练数据而推广性得到充分加强，结合相应的语言模型即可应用到新的领域。改进的低帧率模型充分利用全部训练数据，避免了低帧率模型丢失数据的问题，增加了模型鲁棒性，且降低了解码的计算成本和延迟时间。最终在HKUST数据集上获得了7%以上的相对错误率下降。 4、提出了基于上下文无关音节的CTC中文多场景语音识别方法。克服了上下文相关建模天然学习场景信息的缺点，实现了不同场景数据的混合建模。时长更长的音节可以有效建模协同发音，同时具有很好的泛化性和鲁棒性。针对不同采样率的数据融合进一步提出了基于VGG的底层特征提取，并引入了层归一化算法。在窄带电话数据和宽带手机数据上，相对于场景独立建模方法分别获得7%和15%的性能提升，实现了单模型可同时服务多场景的目标。
英文摘要	The voice-based human-computer interaction is becoming more and more popular because of their convenience and high efficiency.As one of the most critical technologies, speech recognition has attracted a large number of researchers for a long time.In recent years, the speech recognition based on gated recurrent neural networks has gradually become mainstream due to its excellent performance.However, the actual performance of different types of gated recurrent neural networks is various.At the same time, the degrade problem of deep recurrent neural networks based (RNNs-based) acoustic model is more serious since it is multi-dimensional deep architecture in time and space. Further, the excessive dependencies modeling of intra-word and inter-word greatly impair the performance of the RNN-based acoustic model.In addition, the independent modeling of a single scene seriously hampers the application and development of actual products. This thesis focuses on the RNNs-based acoustic modeling for speech recognition, and the main contributions are as follows: 1. We proposed that the main function of the projection matrix in LSTMP is to recombine and select sparse information, and enhance the generalization through weight sharing.The differences between different kinds of gated recurrent neural networks are studied and analyzed, especially the processing of historical information by the GRU、LSTM and the projection layer in LSTMP. In addition, we elaborate the implied assumptions of RNNs-based acoustic modeling, and make a detailed analysis of the problems encountered in practical application. 2. A training algorithm based on multi-dimensional residual learning is proposed to solve the degrade problem of deep RNNs in the spatial and temporal dimensions.The introduction of identity mapping in the space dimension ensures the smooth flow of information. In the time dimension, the short time stationary of speech is used to solve the problem of information transmission blocking by adjusting the time granularity. In addition, the row convolution is placed on the top layer to comprehensively understand the information from multiple parallel sequences and prepare for classification. On both tasks of phoneme recognition and large vocabulary continuous speech recognition, they achieved a relative performance improvement of more than 10%. 3. The word-level permutation algorithm and the improved low frame rate model are proposed to relieve the over modeling problem in the RNNs-based acoustic modeling. The word-level shuffling greatly overcomes the problem of modeling inter-word dependency, making the model not excessively dependent on training data. In addition, the promotion is fully enhanced, only combining the corresponding language model can be applied to the new fields, without retraining or adjusting the acoustic model. The improved low frame rate model makes full use of all training data, avoiding the problem of losing training with low frame rate models.It increases the robustness of the model and reduces the computational cost and delay time during decoding. Jointing word-level shuffling and improved lower frame rate can achieve a relative CER reduction of 7% or more. 4. A CTC acoustic model based on context-independent syllables for multi-scenario Chinese speech recognition was proposed. This model overcomes the disadvantages of context-dependent modeling that naturally learns scene information.Furthermore, longer syllables can effectively model the co-articulation and have good generalization and better robustness.In addition, the VGG network is used to extract the low-level feature for data with different sampling rates, and a layer normalization algorithm is introduced. When narrow-band telephone data and wide-band phone data are combined, the proposed method achieves 7% and 15% performance improvements relative to the scenario-related modeling approach.
关键词	声学建模循环神经网络多维残差学习词打散改进的低帧率模型多场景中文语音识别
语种	中文
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/21190
专题	毕业生_博士学位论文
作者单位	Institute of Automation, Chinese Academy of Sciences
第一作者单位	中国科学院自动化研究所
推荐引用方式 GB/T 7714	赵媛媛. 基于循环神经网络的声学建模方法研究[D]. 北京. 中国科学院大学,2018.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
基于循环神经网络的声学建模方法研究+赵媛（4227KB）	学位论文		限制开放	CC BY-NC-SA