|关键词||深度学习 语音识别 发音知识 统计量建模 递归神经网络|
|摘要||发音知识是一种从声学角度描述语音特性的由领域专家设计的描述方式。其中富含内容、音色、情感、环境等诸多信息，已被证实有助于对语音识别系统性能的提升。然而传统研究对于发音知识的应用主要局限于证据融合器（Evidence Merger）和词网格重打分（Lattice Rescore）。随着深度学习的兴起，语音识别系统框架发生了翻天覆地的改变。本文的研究重点在于如何结合深度学习框架，通过更好地融合发音知识，来对声学模型进行建模，从而达到提升整体系统性能的目的。本文的主要工作及创新点如下：|
|其他摘要||The attributes of speech can be comprehended by a collection of information from fundamental speech sounds. The information of sounds contains speaker characteristics and speaking environment, including linguistic interpretations, a speaker profile encompassing gender, accent, emotional state etc. It has been demonstrated that the speech recognition performance can be improved by adding extra articulatory information, and subsequently, how to use such information effectively becomes a challenging problem. The traditional work on knowledge integration is restricted to detect-and-merge architectures, such as evidence merger and lattice re-score. With deep learning techniques resurrected in recent years, deep neural networks become the mainstream acoustic model in ASR system, which leads to a totally revolution in speech recognition area. By combining deep learning framework and knowledge integration, This article focused on improving the acoustic model of ASR system to achieve a better performance.|
The main contributions and novelties of this thesis are listed as follow:
1. We proposed a knowledge integration based on multi-task learning (MTL), which is realized by modeling and learning both acoustic and articulatory cues simultaneously in a uniform framework. The attribute classification is used as the secondary task to improve the performance of an multi-task learning deep neural network (MTL-DNN) used for speech recognition acoustic modeling by lifting the discriminative ability on pronunciation. Different from the conventional classification tasks, the phoneme can contain not only one attributes, which makes the traditional classifier framework inappropriately here. To solve this problem, we apply block-softmax layer to makes each phoneme to be able to have multi-class labels, which can also make sure the gradients to each tasks have the same order of magnitudes. The evidence merger is also applied to do post-classification with the outputs of the MTL-DNN to promote the performance. The experimental conditions contain not only different data sets, different tasks and different training parameters, but also different amount of training data and mis-matched conditions. The results show that the multi-task learning framework can be regarded as a regulizer to solve the over-fitting problem and as a secondary task; the attribute classification promotes the discrimination ability of hidden layer nodes by providing articulatory knowledge, which is benefit for convergence. The multi-task learning architecture produces the desired improvement especially when training data is limited.
2. We proposed deep articulatory features to further improve the knowledge integration. The multi-task learning architecture produces the desired improvement when training data is limited. However, there's only a minor improvement with sufficient training data. To further improve the knowledge integration system, we proposed deep articulatory features, including deep tandem feature and deep bottleneck feature. The deep architecture and multi-layer non-linear translation of deep neural network has a strong ability on extracting useful information from complex raw speech features. Different from the multi-task learning architecture, the deep articulatory features are extracted from a deep network to provide discrimination to the acoustic model in feature domain. The experiments explore the characteristic of the two kind of deep features with different number of hidden layers, different dimensions and data sets. The multi-task learning architecture is also jointly applied to get further improvement. Both the multi-task learning architecture and the deep articulatory feature outperform the baseline system and the combination achieves better performance then each of the individual modification.
3. We proposed statistic articulatory features to improve large vocabulary speech recognition systems. The frame-level articulatory feature (deep articulatory feature) can improve the performance of acoustic modeling, however the improvement is not significant and sometimes sensitive to the parameters, which is unstable for a practical system. We found it is because the frame-level articulatory feature that trained with senones as the labels has a strong direct correlation to the senones, which impacts the parameters training of original speech features, in which the senones information is deeply hidden. The output the acoustic DNN has so strong relationship with the frame-level articulatory feature that it fluctuates a lot while the frame-level articulatory feature is not accurately estimated. Besides, the DNN has a nature defect that it fails to learn the utterance-level or speaker-level information, which leads to that the speaker-level CMVN strategy can almost improve the DNN acoustic models. We proposed statistic articulatory features (utterance-level articulatory features) to solve the problems above. The statistic articulatory features are extracted in three steps: mapping the attributes to a high-dimension space by the universal background model; calculate the Baum-Welch statistics by accumulate the high-dimension vectors; reduce the dimension of high-dimension vectors with total variance space model. Different the traditional i-vector feature is modeled by the original speech features such as MFCC and PLP, the statistic articulatory feature is modeled by the speech attributes that extracted by an attribute extractor. The statistic articulatory feature is modeled in utterance-level and covers the shortcoming of DNN that is described above, and what's more, the statistics make the inaccurate estimation more stable and weaken the dependency between the articulatory features and the senones labels. We also improve the statistic articulatory feature with multi-task learning architecture. The experimental results show that the system with proposed feature achieves significant improvement compared with the baseline system and the multi-task learning architecture further improves the proposed features.
4. We proposed a statistic modeling method with LSTM-RNN trained with ASR task. One step for modeling statistic articulatory features is to mapping the attributes to a high-dimension space by the universal background model. The speech signal is sequential, however, the traditional universal background model is always GMM, which failed to model the sequential information of speech. In speech recognition tasks we found that the LSTM-RNN outperformed the GMM in acoustic modeling. And as a discriminative model, the LSTM-RNN has outputs much more discriminative than those of GMM, which is a generative model. The discriminative high-dimension mapping likely leads to the better classification results. We assume that each output of LSTM-RNN can expressed by a single Gaussian component and as a result we take the place of GMM by the LSTM-RNN to do the acoustic modeling to get RNN statistic features. Although the computation price is high, the experimental results show that the RNN statistic feature outperforms the traditional GMM statistic feature.
|郑昊. 结合发音知识的声学模型深度学习建模方法研究[D]. 北京. 中国科学院研究生院,2016.|