CASIA OpenIR  > 毕业生  > 博士学位论文
基于深度神经网络的说话人识别建模方法研究
张姗姗
学位类型工学博士
导师徐波
2016
学位授予单位中国科学院研究生院
学位授予地点北京
关键词说话人识别 深度神经网络 I-vector Baum-welch 统计量 递归神 经网络 异步随机梯度下降
摘要近年来,深度神经网络取代传统的高斯混合模型,在连续语音识别领域已
经取得了巨大成功,而传统的说话人识别建模方法仍以产生式模型为主导。不
同于语音识别问题可以事先确定其音子集合,说话人识别问题由于其本身类别
的不确定性,使得深度神经网络这一强大的区分度模型难以直接用于说话人分
类。本文将深度神经网络引入到说话人识别的建模框架中,在i-vector建模层面
和统计量提取层面分别进行了探索。同时,针对深度神经网络的巨大参数量和
实际应用中海量的语音数据,本文还对深度神经网络的训练加速问题进行了探
究。本文的主要研究工作和创新点有:
1. 在i-vector建模层面,采用说话人标签对带Bottleneck层的深度神经网络
进行训练,提出了一种基于预训练神经网络的i-vector提取系统。与传
统TVM模型相比,该系统在建模过程中加入说话人区分度信息,以提
取更加有效的说话人特征。另一方面,由于说话人数据量的局限性,
预训练在模型训练中尤其重要。实验表明,该系统提取的i-vector与传
统TVM系统提取的i-vector性能相当,且具有一定互补性,两者融合后仍
有10%的性能提升。
2. 在统计量提取层面,提出了基于LSTM RNN的说话人统计量提取框架,
并探究了不同信道条件下DNN/RNN说话人统计量提取框架的鲁棒性。
该方法将神经网络的输出作为帧级后验概率,同时与相应帧的说话人特
征共同形成该条语音的统计量信息。在文本无关的说话人识别任务上的
实验表明,与无监督的GMM-UBM模型相比,深度神经网络产生的后验
概率在通道失配情况下更加准确,进而取得了比GMM-UBM更准确的说
话人识别结果。另外,在语音识别中有着更高帧准确率的LSTM RNN模
型取得了比DNN和GMM更好的说话人识别结果。
3. 首次将DNN/RNN说话人统计量提取建模框架应用到文本相关说话人识
别任务,并探究了不同训练数据的DNN/RNN系统在三种不同的文本相
关测试任务上的性能,由于文本相关说话人识别任务需要同时判定文本
信息和说话人信息,采用语音识别准则训练的深度神经网络在文本信息
判断上比GMM系统更加准确,实验结果表明,DNN/RNN系统在文本内
容不匹配的测试情况下优势明显。
4. 针对深度神经网络模型参数量大,传统随机梯度下降算法并行化困难的
问题,提出了针对多GPU卡训练平台的异步随机梯度下降算法,该算法
将各GPU卡作为一个客户端独立工作,与服务器端CPU分别进行数据和
参数的交互,从而实现多卡计算的并行化。实验表明,异步随机梯度下
降算法在保证模型识别性能的前提下取得了很好的加速效果。
其他摘要In recent years, deep neural network (DNN) has become the state-of-the-art architecture
in acoustic modeling for automatic speech recognition (ASR) task instead
of conventional GMM. While in the case of speaker recognition, generative models
have been the dominant approaches for over ten years. Di erent from the task in
speech recognition which the phone classification can be per-determined, it’s dicult
to utilize DNN models in speaker recognition tasks for the uncertainty of speaker classification.
In this thesis we investigate the DNN models into the framework of speaker
modeling, both for i-vector modeling and Baum-Welch statistics extracting. We also
propose an e ective approach to speed up DNN training to deal with the enormous
parameters of DNN and large scale of speech data. The contributions are summarized
as follows:
1. For i-vector modeling, we propose a new approach to the i-vector extraction in
speaker recognition tasks. DNN models with bottleneck layer trained with
speaker labels are used for the proposed i-vector extractor modeling. Pretraining
is useful in network training due to the limited training corpus of speaker
recognition. Experiments show that the extractor is capable of extracting
features which convey speaker-dependent information from the speech signal
features and yields result which is comparable to the state-of-the-art TVM
system. A further 10% reduction in equal error rates is achieved by combination
of the proposed extraction system and the TVM system which indicates
that i-vectors extracted by them are complementary.
2. For Baum-Welch statistics extracting, we introduce the LSTM RNN to the
Baum-Welch sucient statistics extraction in place of the conventional GMMUBM
in speaker recognition. In this framework, the network is trained for
automatic speech recognition (ASR) and each of the output unit corresponds
to a component of GMM-UBM. Then the outputs of network are combined
with acoustic features to calculate sucient statistics for speaker recognition.
Experiments on text-independent speaker recognition task show that this approach
have a significant superiority compared with conventional GMM-UBM
in the data mismatched conditions. Especially, we find that the LSTM RNN
implemented in this work achieves a further improvement in performance over
the traditional DNNs.
3. We also introduce the proposed Baum-Welch statistics extracting framework
into text-dependent speaker verification tasks, and evaluate and analyze the
performance of DNN/RNN with di erent configurations and training corpuses
in the three test conditions of the text-dependent speaker verification task.
As we need to verify both the text context information and the speaker information
in text-dependent speaker recognition, DNN/RNN trained for speech
recognition tasks works better than GMM system in text context verification.
Experimental results on text-dependent speaker recognition tasks also indicate
that DNN/RNN systems outperform GMM system in context mismatched conditions.
4. Asynchronous SGD approach is proposed to speed up DNN training for speech
recognition. Since the real parallelization of BP is prohibitive due to the sequential
property of SGD and the enormous parameters of DNN, we address
this issue by applying ASGD as an approximation of BP. For multiple GPUs on
a single server, this approach manages multiple GPUs to work asynchronously.
Each GPU calculates gradients and updates the global model parameters independently.
Experimental results show that ASGD approach speeds up DNN
training e ectively, without any performance loss.

语种中文
文献类型学位论文
条目标识符http://ir.ia.ac.cn/handle/173211/11800
专题毕业生_博士学位论文
作者单位中国科学院自动化研究所数字内容技术与服务中心
推荐引用方式
GB/T 7714
张姗姗. 基于深度神经网络的说话人识别建模方法研究[D]. 北京. 中国科学院研究生院,2016.
条目包含的文件
文件名称/大小 文献类型 版本类型 开放类型 使用许可
基于深度神经网络的说话人识别建模方法研究(2498KB) 暂不开放--请求全文
个性服务
推荐该条目
保存到收藏夹
查看访问统计
导出为Endnote文件
谷歌学术
谷歌学术中相似的文章
[张姗姗]的文章
百度学术
百度学术中相似的文章
[张姗姗]的文章
必应学术
必应学术中相似的文章
[张姗姗]的文章
相关权益政策
暂无数据
收藏/分享
所有评论 (0)
暂无评论
 

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。