CASIA OpenIR  > 毕业生  > 博士学位论文
面向语种识别的深度神经网络建模方法研究
耿旺
学位类型工学博士
导师徐波
2017-05-27
学位授予单位中国科学院研究生院
学位授予地点北京
关键词语种识别 神经网络模型 总体差异空间建模 注意力模型 端到端
其他摘要

近年来,基于深度神经网络的语种识别技术发展迅速。
随着深度学习理论的进步,基于深度神经网络的语种识别技术经历了从生成式框架
到判别式框架的转变,极大地提升了语种识别系统的性能。
本文主要围绕语种识别技术中存在的
底层声学特征不鲁棒、后端语种建模区分性差、系统框架繁琐以及在短时语音段性能下滑明显等问题,
以深度学习为理论基础,重点关注不同神经网络模型结构、不同系统框架下的语种建模方法。
分别从特征域和模型域层面展开研究,主要创新成果如下:
    提出了一种利用深度神经网络模型提取音素相关深瓶颈特征(Deep Bottleneck FeatureDBF)并融合的方法。
    本文用带bottleneck层的深度神经网络(Bottleneck-Deep Neural NetworkBN-DNN)提取DBF特征,该DBF特征是多帧底层声学特征经过BN-DNN模型多层非线性变换得到的高层抽象特征,
    能有效抑制说话人差异、信道差异以及环境噪声等语种无关因素的干扰,提升特征的鲁棒性。
    同时,本文进一步将基于DBF特征的iVector语种识别系统在特征域和分数域进行融合,实现多语言DBF特征并行的语种识别方法,
    相比基于底层声学特征的iVector系统,在3s10s30s测试条件下,识别性能分别提升28.43%43.75%61.22%
    提出了一种融合深瓶颈特征和音子后验DNNPosterior DNNPDNN)的总体差异空间(Total VariabilityTV)建模方法。
    在模型域层面,用音素相关判别式PDNN模型代替生成式高斯混合模型-通用背景模型(Gaussian Mixture Model-Universal Background ModelGMM-UBM)模型,
    将底层声学特征和具有明确物理意义的音素单元建立联系,为提取充分统计量提供更精确的音子类后验概率。
    利用PDNN提供的音子类后验概率,提取出一种稀疏的、包含丰富音素信息的累加充分统计量,改善了后端TV建模的效果。
    同时,本文进一步提出了融合DBF特征和PDNNiVector语种识别方法,保持了标准iVector后端提取过程不变,显著提升了语种识别系统的性能。

    提出了一种带控制门的递归记忆增强网络模型,实现了特征提取、特征变换、分类器同步优化的判别式帧级语种分类方法。
    该判别式语种识别方法利用递归神经网络模型强大的时序建模能力,在特征帧和语种类别之间建立直接的映射关系,
    实现了在声学特征帧层面进行语种分类的方法,极大地改善了语种识别系统在短时语音测试条件下的识别性能。
    本文基于递归神经网络模型和序列记忆增强模块,增强了声学特征帧的表示性和语种区分性,
    结合本文改进的模型优化方法,比生成式GMM-iVector方法,在3s测试条件下,EER相对下降39.97%

    将注意力信号机制应用到语种识别关键帧的选择中,设计了语种任务相关的Attention模型,构建了句级分类的端到端判别式语种识别系统。
    Attention模型由编码模块,注意力选择和句级向量生成模块以及分类器模块构成。
    Attention模型以带控制门的递归神经网络模型为基本模型,利用递归神经网络模型强大的时序建模特性,生成语音声学特征帧的高层抽象表示,
    通过Attention机制选择特征序列中的关键帧,在模型内部将特征序列压缩成固定维度的句级向量,实现句级层面的语种分类。
    该方法首次在神经网络模型模型内部生成句级向量以及建立句级向量和语种类别的映射关系,搭建了端到端的语种识别框架,
    在短时语音测试条件下取得了良好的识别性能,极大地降低了搭建语种识别系统的复杂度。

;  
In the past few years, Deep Neural Network (DNN) based Language Identification (LID) technology develops very fast. Along with the development of deep learning algorithms, DNN-based LID undergoes a transition from the generative framework to discriminative framework, which dramatically improves the LID system. Though great advances have been made in LID technology, some problems still exist: the traditional acoustic feature is not robust enough to noises, back-end language modeling lacks discriminative, the LID framework is too complex and the performance of LID system decreased significantly on short speech duration. Based on deep learning, this work conducts research on both feature domain and model domain,focusing on diverse model structures and LID frameworks. The main contributions are as follows:
1. A phoneme dependent Deep Bottleneck Feature (DBF) extraction and fusion method based on DNN is proposed. The DBF is extracted from a special structured DNN (Bottleneck-DNN, BN-DNN) which contains one bottleneck hidden layer. Multi-frames adjacent acoustic features are fed into the BN-DNN and the DBF is obtained through multi-layer nonlinear discriminative transformation. This DBF is regarded as a high level feature representation which is robust to language independent interferes, such as speaker variability, channel variability and surrounding noises. Meanwhile, the DBF based LID system fusion is conducted to construct multilingual parallel LID systems. Compared with traditional acoustic feature based LID system, the DBF based LID method obtains 28.43%, 43.75%, 61.22% separately on 3s, 10s, 30s test set.
2. A new kind of Total Variability (TV) modeling method based on DBF and Posterior DNN (PDNN) is proposed. In model domain, the Gaussian Mixture Model-Universal Background Model (GMM-UBM) model is replaced by the phoneme dependent PDNN. The PDNN builds a direct mapping relationship between the acoustic feature and the well-defined phoneme unit, and it provides more accurate phoneme posterior for sufficient statistics extraction. Through the PDNN, a sparse sufficient statistics with rich phoneme is obtained and the TV modeling is largely improved. Specially, when combining the DBF and PDNN, rich phoneme information is obtained in the TV modeling.It largely boosts the LID system while remaining the classical identity vector (iVector) extraction unchanged.
3. A Gating recurrent enhanced memory network (GREMN) model is proposed to construct a discriminative frame level language identification framework. This discriminative LID system makes use of the sequential modeling capability of the gating recurrent neural networks (GRNN) to build a direct mapping relationship between the low level acoustic feature and the language class. It covers all the feature extraction, feature transformation and the classifier process and conducts language classification on the frame level. Combined with the modified model optimization method, the GREMN based LID system obtains the best performance on short duration test condition that 39.97\% relative EER reduction is obtained compared with GMM-iVector system.
4. A specific attention model is proposed to construct utterance level end-to-end language identification system. This attention model is constructed by the encoder modular, the classifier modular, the attentional selection and utterance vector generation modular. The GRNN is the base model of the proposed attention model, and it encodes the acoustic feature into high level feature vector. The attention mechanism is adopted to select the key frames from the high level feature sequence and compresses this sequence into a fixed-dimension vector as the utterance vector representation. This utterance vector is fed into the classifier to conduct utterance level language recognition. This attention model is the first time to build a direct mapping between utterance vector and language class to construct an end-to-end LID framework. This end-to-end LID system achieves good performance on short duration condition and largely reduces the complexity to build a LID system.
文献类型学位论文
条目标识符http://ir.ia.ac.cn/handle/173211/14852
专题毕业生_博士学位论文
作者单位80146
推荐引用方式
GB/T 7714
耿旺. 面向语种识别的深度神经网络建模方法研究[D]. 北京. 中国科学院研究生院,2017.
条目包含的文件
文件名称/大小 文献类型 版本类型 开放类型 使用许可
面向语种识别的深度神经网络建模方法研究.(5822KB)学位论文 暂不开放CC BY-NC-SA请求全文
个性服务
推荐该条目
保存到收藏夹
查看访问统计
导出为Endnote文件
谷歌学术
谷歌学术中相似的文章
[耿旺]的文章
百度学术
百度学术中相似的文章
[耿旺]的文章
必应学术
必应学术中相似的文章
[耿旺]的文章
相关权益政策
暂无数据
收藏/分享
所有评论 (0)
暂无评论
 

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。