结合发音知识的声学模型深度学习建模方法研究

CASIA OpenIR > 毕业生 > 博士学位论文

	结合发音知识的声学模型深度学习建模方法研究
	郑昊
	2016-05-29
学位类型	工学博士
中文摘要	发音知识是一种从声学角度描述语音特性的由领域专家设计的描述方式。其中富含内容、音色、情感、环境等诸多信息，已被证实有助于对语音识别系统性能的提升。然而传统研究对于发音知识的应用主要局限于证据融合器（Evidence Merger）和词网格重打分（Lattice Rescore）。随着深度学习的兴起，语音识别系统框架发生了翻天覆地的改变。本文的研究重点在于如何结合深度学习框架，通过更好地融合发音知识，来对声学模型进行建模，从而达到提升整体系统性能的目的。本文的主要工作及创新点如下： 1. 提出了一种基于多目标学习的发音知识提取框架。该框架结合神经网络结构灵活的优势，通过多目标学习策略将发音属性分类作为副任务以提升主任务发音建模单元识别的性能。本文针对每个音素可能同时包含于多个发音属性的问题，采用分块Softmax输出结构，此举也有利于确保主副任务梯度数量级的一致，有利于参数的调节。在该框架的基础上，本文还提出了利用融合器进行进一步分类。实验不仅包括不同数据集、不同任务和不同参数的训练情况，还包括不同数据量和非匹配训练数据的情况。实验结果证明多目标学习作为一种正则手段具有抑制过拟合的作用，另外其副任务也提供了有利于隐层节点增强区分性的信息，有利于模型的收敛。该框架在数据量不足和数据不匹配的情况尤其有效。 2. 提出了基于深层变换的发音知识特征。考虑到多目标学习在数据量较为充足情况下性能提升有限，本文提出了基于深层变换的发音知识特征，包括深度级联发音特征和深度瓶颈发音特征。利用发音属性信息通过深度神经网络多层次的非线性变换，将其中有利于识别的区分性信息抽取出来并降维得到深度变换特征。与模型域的多目标学习框架不同，该特征从特征域融合发音知识，以提升声学模型的建模能力。实验探究了两种结构的优劣和隐层的选择，以及特征维度的选择等，并结合多目标发音知识提取框架，获得了明显好于基线系统和两个独立改进的结果。 3. 提出了一种基于统计量建模的发音知识特征。经过观察，我们发现逐帧提取的发音知识特征虽然对系统性能有提升，但局限于性能提升不明显且提升效果不稳定。我们分析认为其主要问题在于帧级别发音特征与目标senones相关度过高，以及前后信息不足导致估计不稳定。我们还发现逐帧计算后验概率的深度神经网络声学模型有一个天然缺陷，即其无法对句级统计量做出响应，这也是句级的归一化和说话人级别的归一化始终有提升效果。考虑到以上情况，我们提出了基于统计量建模的发音知识特征。该特征包含整句中发音知识的伪Baum-Welch统计量信息，通过全变化空间矩阵进行建模，得到该句的统计量特征。该特征与传统i-vector特征的不同之处在于输入特征不是常规梅尔滤波器特征或线性感知系数，而是发音知识对应的基本特征。该特征弥补了深度神经网络句级/说话人级别建模的无力，并通过全句统计的方式变得稳定，从而得到了明显的性能提升。我们还将该特征与多任务学习框架结合，其识别错误率分别在数据充足和数据不足的情况下获得降低。 4. 提出了一种通过递归神经网络输出计算得到的统计量建模方法。在传统i-vector特征提取框架中，我们使用通用背景模型来对声学背景进行建模，并通过全变量空间来得到低阶子空间。近年来，我们发现对于声学建模问题，高斯混合模型的性能远不如同等情况下的深度学习模型，而深度学习模型中，基于长短时记忆的递归神经网络在相同情况下往往取得优于传统全连接神经网络的建模效果。藉此，我们假设每个神经网络的senones后验输出可以用单高斯近似表示，于是我们用基于长短时记忆递归神经网络的后验概率替代传统基于高斯混合模型的通用背景模型，提取Baum-Walch统计量，并利用全变量空间模型来获得其低阶子空间，从而得到统计量模型。
英文摘要	The attributes of speech can be comprehended by a collection of information from fundamental speech sounds. The information of sounds contains speaker characteristics and speaking environment, including linguistic interpretations, a speaker profile encompassing gender, accent, emotional state etc. It has been demonstrated that the speech recognition performance can be improved by adding extra articulatory information, and subsequently, how to use such information effectively becomes a challenging problem. The traditional work on knowledge integration is restricted to detect-and-merge architectures, such as evidence merger and lattice re-score. With deep learning techniques resurrected in recent years, deep neural networks become the mainstream acoustic model in ASR system, which leads to a totally revolution in speech recognition area. By combining deep learning framework and knowledge integration, This article focused on improving the acoustic model of ASR system to achieve a better performance. The main contributions and novelties of this thesis are listed as follow: 1. We proposed a knowledge integration based on multi-task learning (MTL), which is realized by modeling and learning both acoustic and articulatory cues simultaneously in a uniform framework. The attribute classification is used as the secondary task to improve the performance of an multi-task learning deep neural network (MTL-DNN) used for speech recognition acoustic modeling by lifting the discriminative ability on pronunciation. Different from the conventional classification tasks, the phoneme can contain not only one attributes, which makes the traditional classifier framework inappropriately here. To solve this problem, we apply block-softmax layer to makes each phoneme to be able to have multi-class labels, which can also make sure the gradients to each tasks have the same order of magnitudes. The evidence merger is also applied to do post-classification with the outputs of the MTL-DNN to promote the performance. The experimental conditions contain not only different data sets, different tasks and different training parameters, but also different amount of training data and mis-matched conditions. The results show that the multi-task learning framework can be regarded as a regulizer to solve the over-fitting problem and as a secondary task; the attribute classification promotes the discrimination ability of hidden layer nodes by providing articulatory knowledge, which is benefit for convergence. The multi-task learning architecture produces the desired improvement especially when training data is limited. 2. We proposed deep articulatory features to further improve the knowledge integration. The multi-task learning architecture produces the desired improvement when training data is limited. However, there's only a minor improvement with sufficient training data. To further improve the knowledge integration system, we proposed deep articulatory features, including deep tandem feature and deep bottleneck feature. The deep architecture and multi-layer non-linear translation of deep neural network has a strong ability on extracting useful information from complex raw speech features. Different from the multi-task learning architecture, the deep articulatory features are extracted from a deep network to provide discrimination to the acoustic model in feature domain. The experiments explore the characteristic of the two kind of deep features with different number of hidden layers, different dimensions and data sets. The multi-task learning architecture is also jointly applied to get further improvement. Both the multi-task learning architecture and the deep articulatory feature outperform the baseline system and the combination achieves better performance then each of the individual modification. 3. We proposed statistic articulatory features to improve large vocabulary speech recognition systems. The frame-level articulatory feature (deep articulatory feature) can improve the performance of acoustic modeling, however the improvement is not significant and sometimes sensitive to the parameters, which is unstable for a practical system. We found it is because the frame-level articulatory feature that trained with senones as the labels has a strong direct correlation to the senones, which impacts the parameters training of original speech features, in which the senones information is deeply hidden. The output the acoustic DNN has so strong relationship with the frame-level articulatory feature that it fluctuates a lot while the frame-level articulatory feature is not accurately estimated. Besides, the DNN has a nature defect that it fails to learn the utterance-level or speaker-level information, which leads to that the speaker-level CMVN strategy can almost improve the DNN acoustic models. We proposed statistic articulatory features (utterance-level articulatory features) to solve the problems above. The statistic articulatory features are extracted in three steps: mapping the attributes to a high-dimension space by the universal background model; calculate the Baum-Welch statistics by accumulate the high-dimension vectors; reduce the dimension of high-dimension vectors with total variance space model. Different the traditional i-vector feature is modeled by the original speech features such as MFCC and PLP, the statistic articulatory feature is modeled by the speech attributes that extracted by an attribute extractor. The statistic articulatory feature is modeled in utterance-level and covers the shortcoming of DNN that is described above, and what's more, the statistics make the inaccurate estimation more stable and weaken the dependency between the articulatory features and the senones labels. We also improve the statistic articulatory feature with multi-task learning architecture. The experimental results show that the system with proposed feature achieves significant improvement compared with the baseline system and the multi-task learning architecture further improves the proposed features. 4. We proposed a statistic modeling method with LSTM-RNN trained with ASR task. One step for modeling statistic articulatory features is to mapping the attributes to a high-dimension space by the universal background model. The speech signal is sequential, however, the traditional universal background model is always GMM, which failed to model the sequential information of speech. In speech recognition tasks we found that the LSTM-RNN outperformed the GMM in acoustic modeling. And as a discriminative model, the LSTM-RNN has outputs much more discriminative than those of GMM, which is a generative model. The discriminative high-dimension mapping likely leads to the better classification results. We assume that each output of LSTM-RNN can expressed by a single Gaussian component and as a result we take the place of GMM by the LSTM-RNN to do the acoustic modeling to get RNN statistic features. Although the computation price is high, the experimental results show that the RNN statistic feature outperforms the traditional GMM statistic feature.
关键词	深度学习语音识别发音知识统计量建模递归神经网络
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/11778
专题	毕业生_博士学位论文
作者单位	中科院自动化研究所模式识别国家重点实验室
推荐引用方式 GB/T 7714	郑昊. 结合发音知识的声学模型深度学习建模方法研究[D]. 北京. 中国科学院研究生院,2016.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
thesis_hzheng.pdf（1904KB）	学位论文		限制开放	CC BY-NC-SA