语音识别的鲁棒性关键技术研究

CASIA OpenIR > 毕业生 > 博士学位论文

	语音识别的鲁棒性关键技术研究
其他题名	research on key technologies of robust speech recognition
	谭应伟
	2015-05-30
学位类型	工学博士
中文摘要	随着智能手机、穿戴式设备、智能家居和车载设备等的普及，基于智能语音的人机交互技术正越来越引起整个IT学术界及工业界的重视，成为移动互联网领域的绝对热点之一。除了国外的苹果公司的Siri、Google Now、微软必应的语音搜索等产品外，近一两年来，国内也陆续出现讯飞语点、百度语音助手、搜狗语音助手等产品。因为语音交互技术是人类最自然的交流方式，所以让机器能够理解人类的语音已经成为人们的迫切需求。语音识别技术是智能语音产品的一项关键技术。对于这些产品来说，在周边环境比较干净的情况下，语音识别技术往往能够达到实际应用的标准。但是，在周边环境出现噪声干扰的情况下，语音识别技术的效果不令人满意。除此之外，语音中还包含多变的语气、语速、韵律和真实的情绪，以及较严重的协同发音，这都会导致大量的音素级插入、删除和替换现象。因此语音识别系统的鲁棒性问题受到了研究人员的广泛关注。本文在认真总结了前人研究成果的基础上，对语音识别技术的鲁棒性问题进行了探讨，对与语音识别系统相关的各种特征提取技术进行了详细分析，提出了新的鲁棒性特征提取算法，并从不同的角度，利用模型融合的方法，针对具体任务提出了新的系统框架和模型。进行的主要工作有： (1) 研究了语音端点检测算法（Voice Activity Detection，VAD）的噪声鲁棒性。在特征提取方面，实现了基于短时与长时谱特征融合的语音端点检测算法。该算法不但结合了短时谱峰特征鲁棒性较强的优点，而且还考虑了长时谱散度估计特征能够合并语音上下文信息的好处。在分类建模方面，实现了基于支持向量机与隐马尔科夫模型融合的语音端点检测算法。该算法考虑了支持向量机（Support Vector Machine，SVM）具有区分性强以及非线性等优点，以及隐马尔科夫模型（Hidden Markov Model，HMM）能够对上下文关联进行建模的优势。在结合特征与模型的方面，实现了基于融合短时与长时谐波峰的两层区分性权重训练框架的语音端点检测算法。该算法不但结合了短时与长时谐波峰的优势，而且还在一个区分性框架下同时考虑了观测和频点的权重分配问题。在背景噪声干扰的环境下，这三种算法都提高了语音端点检测的性能。 (2) 实现了基于语音划分的标准化能量特征提取算法。该算法将语音划分为浊音、清音以及静音，针对不同的语音采用不同的处理方式来提高语音识别的准确率。传统的标准化能量特征提取方法基于噪声平稳的假设。当遇到非平稳噪声时，条件并不满足标准化能量特征提取方法的假设。在这种情况下，标准化能量提取算法的优势就不能够充分发挥出来。因此，本研究提出了应用加权的谐波噪声模型来弥补这一缺陷。同时，该算法也通过VAD排除静音的干扰来分析了在带背景噪声的情况下VAD对于语音识别的性能的影响。通常情况下，如果VAD的性能更好，语音识别的效果也会更好。 (3) 实现了基于深度神经网络（Deep Neural Network，DNN）与隐马尔科夫模型融合的面向发音学知识的建模算法。该算法不但利用了DNN所具有的特征学习的能力，而且结合了HMM所具有的建模上下文关联的能力。建立的发音学模型能够在音素网格重打分的...
英文摘要	With the popularity of smart mobile phones, wearable devices, intelligent home and on-board equipment, the human computer interaction technologies based on intelligent speech have attracted more and more attention from ITacademia and industry, and become one of the focuses in the field of mobile Internet. There are Apple’s Siri, Google Now, Microsoft’s Bing Voice Search, etc abroad. In addition, iFlyVoice, Baidu voice assistant, Sogou voice assistant, etc appear at home in recent one or two years. Because the speech interaction technology is the most natural human communication mode, the machine can understand human speech has become the urgent needs of the people. Speech recognition technology is one key technique of intelligent speech products. For these products, when the surrounding environment is relatively clean, speech recognition technology is often able to meet the practical application standard. However, when noise interference emerges, the result of speech recognition is not satisfactory. Besides, speech contains volatile mood, tempo, rhythm and real emotion, and severe coarticulation, which will lead to a large number of phoneme-level insert, delete and replace phenomena. Hence, the robustness of speech recognition systems has attracted much attention from researchers widely. This dissertation, which is based on the summarization of previous research findings, deals with the problem of robustness to speech recognition technology. We analyze and compare various feature extraction algorithms related with speech recognition systems in detail, present new robust feature extraction algorithms, and propose several new frameworks to combine different models. The main research work focused on the following four aspects: (1) Researching noise robustness of speech detection algorithms. In the aspect of feature extraction, we present a voice activity detection (VAD) algorithm based on the combination of short-term and long-term spectral patterns. Not only the algorithm combines the advantages of the feature based on short-time spectral peaks, but also exploits the virtues of long-term spectral divergence estimation, which can incorporate speech context information. On the classification modeling, we present a voice activity detection algorithm based on a hybrid architecture of support vector machine (SVM) and hidden Markov Model (HMM). The algorithm retains discriminative and nonlinear properties of SVM and models the inter-frame correlation powerfu...
关键词	语音端点检测语音识别鲁棒性特征提取模型融合 Voice Activity Detection Speech Recognition Robustness Feature Extraction Model Combination
语种	中文
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/6737
专题	毕业生_博士学位论文
推荐引用方式 GB/T 7714	谭应伟. 语音识别的鲁棒性关键技术研究[D]. 中国科学院自动化研究所. 中国科学院大学,2015.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
CASIA_20111801462908（3437KB）			暂不开放	CC BY-NC-SA