融合多种特征的基于深度学习技术的汉语语音识别研究

CASIA OpenIR > 毕业生 > 博士学位论文

	融合多种特征的基于深度学习技术的汉语语音识别研究
其他题名	Research on Deep Learning Based Mandarin Speech Recognition With Fusing Multiple Features
	陈明明
	2015-05-30
学位类型	工学博士
中文摘要	本文研究的重点是深度学习技术在大词汇量连续汉语语音识别系统中的应用。深度学习技术在许多任务中凭借其层级特征学习能力和数据建模能力，取得了超过以往浅层学习技术的性能。深度学习技术已经在包括语音识别在内的语音处理领域得到了广泛的应用并且取得了巨大的成功。汉语作为世界上使用人数最多的语言，有如下特点： 1. 汉语是一门声调语言. 在汉语中，使用不同的声调发同一个音素时可以表达不同的涵义，声调信息可以起到区分不同词语含义的作用。因此对声调信息的准确建模可以提高汉语语音识别系统的性能。 2. 其次，因为汉语使用人数众多而且分布区域广泛，所以形成很多种地方方言，这些方言的使用在使用普通话时，其普通话往往带有口音。口音是影响汉语语音识别系统的关键因素，往往会降低语音识别系统的性能。对口音信息更好地建模可以提高带口音汉语语音识别系统的性能。深度学习技术中使用的模型与传统的浅层模型相比，除了可以被看作一个深层统计模型外，又可以被看作一个特征学习器，如何更好地将深层模型的这两个特点相结合用于汉语语音识别系统以提高系统的性能具有重要的意义。本文主要有以下三个方面的创新： 1. 提出了基于深度学习技术的汉语语音声调识别模型。相对于浅层模型，深层模型具有更好地融合不同类型的输入特征，具有更强的特征学习能力，从而取得了更好地声调识别性能。在此基础上，使用语音谱特征和基频特征作为深度神经网络声学模型的输入，提高了汉语语音识别系统的性能。 2. 在口音分类任务中，提出了基于深度学习技术的口音分类模型。相对于传统的高斯混合模型，深层神经网络是一种判别式模型，并且它针对于该任务可以逐层学习出更具区分性的特征，因而提高了口音分类的准确率。除此之外，我们发现上下文信息对提高口音识别的准确性也有帮助。 3. 对于带口音的汉语语音识别研究，提出了使用融合I-vectors特征以及模型自适应技术的算法来提高了识别性能。特征融合技术是指融合使用声学谱特征和包含口音信息的说话人特征，以达到显式表示输入特征中的口音信息的目的；模型自适应技术是指针对于某种特定的口音，使用该种口音的训练数据对声学模型进行模型自适应。基于深度神经网络可以将特征学习和统计建模两种能力结合在一起，本文提出一种将特征融合和模型自适应技术相结合的方法，它在深度神经网络的输入层融合谱特征和说话人相关特征，在输出层对不同口音进行模型自适应。它可以将两种方法简单有效地结合在一起，显著提高带口音汉语语音识别系统的性能。
英文摘要	In this thesis, we focus on the application of deep learning techniques in the domain of speech processing. Deep learning techniques which have been resurrect recently have become the-state-of-art methods in acoustic modelling of speech recognition. Compared to the shallow models, deep models, such as deep neural networks,deep convolutional neural networks, which can extract high-level robust features through multiple hidden layers can be seen as more powerful feature extractors and fuse different kinds of features in the input layer. Chinese is spoken by many more people than other languages in the world. As it has been spoken by so many people, it has some characteristics: 1. Chinese is a tonal language. In a tonal language, pitch information is used to discriminate the meaning of different words. A word spoken with different tones corresponds to different meanings. Therefore, tone related information can play an important role in the Chinese speech recognition systems. 2. Chinese has many dialects. China is a vast country and has a long history. In the development of Chinese, people coming from different regions gradually spoke different dialects. As is well known, Mandarin or Putonghua is the official language in modern China, therefore, most of the Chinese people can speak Mandarin. However, Mandarin spoken in different regions can be affected by local dialects and result in different kinds of native accented Mandarin. Accent is one of the key factors in worsening the performance of speech recognition systems. Therefore, accurate accent identification and accent adaptation methods can improve the performance of Mandarin speech recognition systems. The main contributions and novelties of this thesis are listed as follow: 1. We proposed a deep neural networks based tone modeling method. Our focus is on the capacity of extracting high-level robust features and fusing different kinds of serially concatenated features of deep models. Furthermore, Maxout networks have been proposed to integrate dropout naturally and achieve state-of-the-art results. Therefore, we investigate the advantage of DMNs when the training data is limited and imbalanced. Our experiments on the ASCCD corpus show that comparing with shallow models such as one-hidden layer multi-perception (MLP) and support vector machine(SVM), deep models improve Mandarin tone recognition significantly. Among the deep models, DMNs can get better performance comparing with other deep neural net...
关键词	汉语语音识别带口音语音识别深度学习技术声调分类 I-vectors特征深层神经网络 Mandarin Speech Recognition Accented Speech Recognition Deep Learning Tone Recognition Accent Identification I-vectors Deep Neural Networks
语种	中文
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/6738
专题	毕业生_博士学位论文
推荐引用方式 GB/T 7714	陈明明. 融合多种特征的基于深度学习技术的汉语语音识别研究[D]. 中国科学院自动化研究所. 中国科学院大学,2015.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
CASIA_20111801462802（2467KB）			暂不开放	CC BY-NC-SA