|关键词||语种识别 神经网络模型 总体差异空间建模 注意力模型 端到端|
In the past few years, Deep Neural Network (DNN) based Language Identification (LID) technology develops very fast. Along with the development of deep learning algorithms, DNN-based LID undergoes a transition from the generative framework to discriminative framework, which dramatically improves the LID system. Though great advances have been made in LID technology, some problems still exist: the traditional acoustic feature is not robust enough to noises, back-end language modeling lacks discriminative, the LID framework is too complex and the performance of LID system decreased significantly on short speech duration. Based on deep learning, this work conducts research on both feature domain and model domain，focusing on diverse model structures and LID frameworks. The main contributions are as follows:
1. A phoneme dependent Deep Bottleneck Feature (DBF) extraction and fusion method based on DNN is proposed. The DBF is extracted from a special structured DNN (Bottleneck-DNN, BN-DNN) which contains one bottleneck hidden layer. Multi-frames adjacent acoustic features are fed into the BN-DNN and the DBF is obtained through multi-layer nonlinear discriminative transformation. This DBF is regarded as a high level feature representation which is robust to language independent interferes, such as speaker variability, channel variability and surrounding noises. Meanwhile, the DBF based LID system fusion is conducted to construct multilingual parallel LID systems. Compared with traditional acoustic feature based LID system, the DBF based LID method obtains 28.43%, 43.75%, 61.22% separately on 3s, 10s, 30s test set.
2. A new kind of Total Variability (TV) modeling method based on DBF and Posterior DNN (PDNN) is proposed. In model domain, the Gaussian Mixture Model-Universal Background Model (GMM-UBM) model is replaced by the phoneme dependent PDNN. The PDNN builds a direct mapping relationship between the acoustic feature and the well-defined phoneme unit, and it provides more accurate phoneme posterior for sufficient statistics extraction. Through the PDNN, a sparse sufficient statistics with rich phoneme is obtained and the TV modeling is largely improved. Specially, when combining the DBF and PDNN, rich phoneme information is obtained in the TV modeling.It largely boosts the LID system while remaining the classical identity vector (iVector) extraction unchanged.
3. A Gating recurrent enhanced memory network (GREMN) model is proposed to construct a discriminative frame level language identification framework. This discriminative LID system makes use of the sequential modeling capability of the gating recurrent neural networks (GRNN) to build a direct mapping relationship between the low level acoustic feature and the language class. It covers all the feature extraction, feature transformation and the classifier process and conducts language classification on the frame level. Combined with the modified model optimization method, the GREMN based LID system obtains the best performance on short duration test condition that 39.97\% relative EER reduction is obtained compared with GMM-iVector system.
4. A specific attention model is proposed to construct utterance level end-to-end language identification system. This attention model is constructed by the encoder modular, the classifier modular, the attentional selection and utterance vector generation modular. The GRNN is the base model of the proposed attention model, and it encodes the acoustic feature into high level feature vector. The attention mechanism is adopted to select the key frames from the high level feature sequence and compresses this sequence into a fixed-dimension vector as the utterance vector representation. This utterance vector is fed into the classifier to conduct utterance level language recognition. This attention model is the first time to build a direct mapping between utterance vector and language class to construct an end-to-end LID framework. This end-to-end LID system achieves good performance on short duration condition and largely reduces the complexity to build a LID system.
|耿旺. 面向语种识别的深度神经网络建模方法研究[D]. 北京. 中国科学院研究生院,2017.|