CASIA OpenIR  > 毕业生  > 硕士学位论文
基于WFST的语音识别解码优化研究
倪浩
学位类型工学硕士
导师陶建华
2017-05
学位授予单位中国科学院研究生院
学位授予地点北京
关键词语音识别 Wfst 解码 Rnn 自适应
摘要
近年来,语音识别技术突飞猛进。2011 年,随着深度学习技术被引入语音识别领域,语音识别技术翻开了新的篇章。随后的几年里,卷积神经网络(convolutional neural network, CNN)、长短时记忆模型(Long Short Term Memory, LSTM)、CNN混合LSTM的建模技术在语音识别工业产品中不停涌现,并持续提升语音识别产品效果。虽然语音识别技术的广泛使用已经让其深入人心,但是语音识别的性能仍然不能令人满意。作为语音识别的核心,解码器的性能直接关系着语音识别的应用与用户体验。如何提高解码器的性能与速度一直是该领域的重要研究课题。随着计算机硬件和软件的快速发展,使得基于加权有限状态转换器(Weighted Finite State Transducers,WFST) 的静态网络解码器运用于实际系统成为可能。但是当前的WFST还存在不少问题,WFST的大小和构建语言模型的文本量几乎呈线性关系,WFST是一个有向图,巨大的网络加载到内存中会占用巨大的内存资源,不同的加载方式占用内存大小也有明显的差别。另一方面,当前WFST 解码器的速度针对声学模型的特点还有待进一步提高。同时,解码器的精度也可以结合当前比较流行的循环神经网络(Recurrent Neural Networks,RNN)语言模型进一步提高。本文的研究目标是降低WFST的大小,优化WFST的存储结构便于运用于实际系统中,另外加快解码器的解码速度,提高解码器精度。论文的主要内容和创新点如下:
针对解码网络占用内存过大的问题,在准确率没有明显降低的情况下降低WFST的大小,并且降低解码网络在内存中占用。降低WFST的大小使用了三种方法:1、 对N-gram语言模型进行剪枝,采用基于相对熵的N元剪枝方法,使得剪枝前后模型的相对熵尽量小,剪枝后的模型尽量逼近剪枝前的模型,不同的剪枝力度得到的WFST大小不同,如果剪枝力度太大,解码网络的能变很小,但是精度降低。相反,剪枝力度太小时解码网络大小变化不大。剪枝需要寻找一个平衡点,在尽量不明显降低准确率的情况下尽量降低解码网络的大小; 2、 优化WFST 的结构,基于连接时序分类(Connectionist Temporal Classification,CTC) 声学模型的WFST由三部分组成:语言模型、发音词典和音素,这三部分均可表示成WFST 的形式。其中语言模型使用方法1 的剪枝可以显著降低大小,发音词典是固定的,优化的可能较小,通过设计音素的WFST形式,去除音素的WFST 中的冗余部分,能使最终的解码网络降低30%-40% ,由于是等效变换,解码器的精度不受影响;3、 改变WFST 在内存中的存储结构,常用的网络在内存加载方式是链式结构,将链式结构改为连续结构,内存占用降低50% 以上。
针对解码器解码速度慢的问题,提出两种加快解码速度的方法:跳帧和剪枝。这两种方法均是针对声学模型的损失函数CTC,计算后验概率的时候只在某一帧或者连续的某几帧出现标签(Label)的“尖峰”,其他帧为另外引入的一个标签,基于CTC的声学模型计算出的后验概率有大约80%的语音帧会标记为。针对这个特点跳帧的策略是:帧不参与解码,只参与声学模型的后验概率计算,这样只有大约20% 的语音帧参加解码,能显著加快解码速度。另外一个加速方法剪枝的策略是:所有帧均参与解码,但是当解码是 帧的时候,只有少量令牌加入到解码器中,其他令牌被剪掉,其他语音帧保持不变,这样整个搜索空间大大减小。实验表明这两种方法均能在不明显降低解码器的精度的前提下显著加快解码,尤其是第1种方法准确率略有提高。
针对解码精度的问题,使用N-gram结合RNN语言模型进行二次解码,提高语音识别准确率。由于N-gram对语句中长距依存描述能力较弱和数据稀疏的问题,而RNN语言模型能较好缓解这两个问题,所以结合N-gram和RNN用于二次解码是个很好的选择。由于RNN训练速度较慢,尤其当词表和文本规模较大的时候,RNN 训练有明显的瓶颈,本文探索了不同的RNN 优化方法和加速方法,使得RNN 的训练速度有较大的提高。其次,由于RNN相比于N-gram而言计算较慢,无法直接运用于一次解码,所以先利用N-gram一次解码缩小搜索空间,再利用RNN结合N-gram进行二次解码。实验表明二次解码能提高解码精度。另外,由于RNN能利用通用网络进行某一领域的自适应,本文还探索了RNN自适应对识别性能的提高。
其他摘要
In recent years, the speech recognition technology has developed rapidly . In 2011, with the introduction of deep learning technology , speech recognition technology has opened a new chapter. In the following years, the modeling techniques such as convolutional neural network (CNN), Long Short Term Memory (LSTM) and CNN along with LSTM were emerging in speech recognition industrial products and continued enhance the effectiveness of voice recognition products. Although speech recognition technology becomes
popular, the speech recognition performance is still not satisfactory.As the core
component of the speech recognition, the decoder performance is directly related
to the speech recognition user experience and its' application. How to improve the performance and speed of the decoder has been an important major challenge in this field. With the development of computer hardware and software technology,
it is possible to apply weighted finite state transducer(WFST) into practical use.
 However there are still many problems in WFST, the size of WFST almost has linear relationship with the size of text which are used to build the language model , WFST is a directed graph, a huge of memory will be used when load into memory, different loading methods also leading to significant difference memory useage. On the other hand, the speed of the current decoder still can be further improved for the characteristics of the acoustic model. At the same time, the accuracy of the decoder can be further improved  combining with the  Recurrent Neural Networks (RNN) language model. The goal of this paper is to reduce the size of WFST, optimize the storage structure of WFST to facilitate its' usage, in addition, to speed up the decoder's speed, improve the decoder's accuracy. The main contents and innovations of the paper are as follows:
In order to solve the problem that the decoding network occupies excessive memory, reducing the WFST size in the condition of  the accuracy rate is not significantly reduced, and reduce the memory usage when load the decoding network to the memory. There are three methods to reduce the size of WFST: 1. The N-gram language model is pruned  based on relative entropy method, the goal is make the difference of  relative entropy of the model before and after pruning as small as possible. Different pruning force obtained different  WFST size , if the pruning force is too large, the decoding network can be very small, but the accuracy is reduced. On the contrary, when the pruning force is too small, the decoding network size does not change much. 2. To optimize the structure of WFST, WFST based on CTC acoustic model is composed of three parts: language model, pronunciation dictionary and phoneme  , these three parts can be expressed as WFST form. Where the language model uses the pruning of method 1 can be significantly reduced in size, the pronunciation dictionary is fixed, the optimization may be smaller, by designing the phoneme WFST form, removing the redundant part of the phoneme WFST, the final decoding network can be reduced 30% - 40%, and accuracy not effected.  3. Changing the storage structure of WFST in memory, the commonly used network in the memory loading mode is the list structure, change the list structure into continuous structure, the memory useage reduced by more than 50%.
In order to speed up the decoder, two ways  are proposed: skip frame and pruning. These two methods are  aimed at the characteristics of CTC, the probability of calculating the posterior probability only in a frame or a row of a few labels (Label) "spikes ", The other frame is another tag of   , and the speech frame calculated based on the CTC's acoustic model has about 80% of the speech frame marked as   . The strategy for dkiping frames for this feature is that    frames are not involved in decoding, and only participate in the posterior probability calculation of the acoustic model, so that only about 20% of the speech frames participate in decoding which can significantly speed up the decoding. Another acceleration method is pruning strategy that all frames are involved in decoding, but when the decoding is    frame, only a small number of tokens are added to the decoder, the other tokens are clipped, and the other speech frames remain unchange, so that the entire search space is greatly reduced. Experiments show that both methods can significantly improve the decoding without significantly reducing the accuracy of the decoder, especially the first method which can improve the accuracy a little bit.
Aiming at the problem of decoding accuracy, the N-gram combined with the RNN language model is used to  decode in second pass to improve the accuracy of speech recognition. Because N-gram language model cann't descripe long distance dependency and solve data sparse, but RNN language model can easy the two problems, so it is a good choice to combine the N-gram and RNN for the second passs decoding. On the other hand, RNN training is slow, especially when the lexicon and text scale are large, RNN training has obvious bottleneck. This paper explores different RNN optimization methods and acceleration methods, which makes RNN training speed greatly improved. Secondly, since RNN is slower than N-gram, it can not be applied directly to first pass decoding. Therefore, N-gram is used to decode in first pass to reduce the search space, and then the RNN is combined with N-gram for second pass decoding. Experiments show that the second pass decoding can improve the decoding accuracy. In addition, since RNN can be used to adapt to a certain domain, this paper also explores the RNN adaptive to improve the performance of certain domain.
文献类型学位论文
条目标识符http://ir.ia.ac.cn/handle/173211/14712
专题毕业生_硕士学位论文
作者单位中科院自动化研究所
推荐引用方式
GB/T 7714
倪浩. 基于WFST的语音识别解码优化研究[D]. 北京. 中国科学院研究生院,2017.
条目包含的文件
文件名称/大小 文献类型 版本类型 开放类型 使用许可
基于WFST的语音识别解码优化研究.pd(3436KB)学位论文 暂不开放CC BY-NC-SA请求全文
个性服务
推荐该条目
保存到收藏夹
查看访问统计
导出为Endnote文件
谷歌学术
谷歌学术中相似的文章
[倪浩]的文章
百度学术
百度学术中相似的文章
[倪浩]的文章
必应学术
必应学术中相似的文章
[倪浩]的文章
相关权益政策
暂无数据
收藏/分享
所有评论 (0)
暂无评论
 

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。