基于深度神经网络的说话人识别建模方法研究

CASIA OpenIR > 毕业生 > 博士学位论文

	基于深度神经网络的说话人识别建模方法研究
	张姗姗
	2016
学位类型	工学博士
中文摘要	近年来，深度神经网络取代传统的高斯混合模型，在连续语音识别领域已经取得了巨大成功，而传统的说话人识别建模方法仍以产生式模型为主导。不同于语音识别问题可以事先确定其音子集合，说话人识别问题由于其本身类别的不确定性，使得深度神经网络这一强大的区分度模型难以直接用于说话人分类。本文将深度神经网络引入到说话人识别的建模框架中，在i-vector建模层面和统计量提取层面分别进行了探索。同时，针对深度神经网络的巨大参数量和实际应用中海量的语音数据，本文还对深度神经网络的训练加速问题进行了探究。本文的主要研究工作和创新点有： 1. 在i-vector建模层面，采用说话人标签对带Bottleneck层的深度神经网络进行训练，提出了一种基于预训练神经网络的i-vector提取系统。与传统TVM模型相比，该系统在建模过程中加入说话人区分度信息，以提取更加有效的说话人特征。另一方面，由于说话人数据量的局限性，预训练在模型训练中尤其重要。实验表明，该系统提取的i-vector与传统TVM系统提取的i-vector性能相当，且具有一定互补性，两者融合后仍有10%的性能提升。 2. 在统计量提取层面，提出了基于LSTM RNN的说话人统计量提取框架，并探究了不同信道条件下DNN/RNN说话人统计量提取框架的鲁棒性。该方法将神经网络的输出作为帧级后验概率，同时与相应帧的说话人特征共同形成该条语音的统计量信息。在文本无关的说话人识别任务上的实验表明，与无监督的GMM-UBM模型相比，深度神经网络产生的后验概率在通道失配情况下更加准确，进而取得了比GMM-UBM更准确的说话人识别结果。另外，在语音识别中有着更高帧准确率的LSTM RNN模型取得了比DNN和GMM更好的说话人识别结果。 3. 首次将DNN/RNN说话人统计量提取建模框架应用到文本相关说话人识别任务，并探究了不同训练数据的DNN/RNN系统在三种不同的文本相关测试任务上的性能，由于文本相关说话人识别任务需要同时判定文本信息和说话人信息，采用语音识别准则训练的深度神经网络在文本信息判断上比GMM系统更加准确，实验结果表明，DNN/RNN系统在文本内容不匹配的测试情况下优势明显。 4. 针对深度神经网络模型参数量大，传统随机梯度下降算法并行化困难的问题，提出了针对多GPU卡训练平台的异步随机梯度下降算法，该算法将各GPU卡作为一个客户端独立工作，与服务器端CPU分别进行数据和参数的交互，从而实现多卡计算的并行化。实验表明，异步随机梯度下降算法在保证模型识别性能的前提下取得了很好的加速效果。
英文摘要	In recent years, deep neural network (DNN) has become the state-of-the-art architecture in acoustic modeling for automatic speech recognition (ASR) task instead of conventional GMM. While in the case of speaker recognition, generative models have been the dominant approaches for over ten years. Dierent from the task in speech recognition which the phone classification can be per-determined, it’s dicult to utilize DNN models in speaker recognition tasks for the uncertainty of speaker classification. In this thesis we investigate the DNN models into the framework of speaker modeling, both for i-vector modeling and Baum-Welch statistics extracting. We also propose an eective approach to speed up DNN training to deal with the enormous parameters of DNN and large scale of speech data. The contributions are summarized as follows: 1. For i-vector modeling, we propose a new approach to the i-vector extraction in speaker recognition tasks. DNN models with bottleneck layer trained with speaker labels are used for the proposed i-vector extractor modeling. Pretraining is useful in network training due to the limited training corpus of speaker recognition. Experiments show that the extractor is capable of extracting features which convey speaker-dependent information from the speech signal features and yields result which is comparable to the state-of-the-art TVM system. A further 10% reduction in equal error rates is achieved by combination of the proposed extraction system and the TVM system which indicates that i-vectors extracted by them are complementary. 2. For Baum-Welch statistics extracting, we introduce the LSTM RNN to the Baum-Welch sucient statistics extraction in place of the conventional GMMUBM in speaker recognition. In this framework, the network is trained for automatic speech recognition (ASR) and each of the output unit corresponds to a component of GMM-UBM. Then the outputs of network are combined with acoustic features to calculate sucient statistics for speaker recognition. Experiments on text-independent speaker recognition task show that this approach have a significant superiority compared with conventional GMM-UBM in the data mismatched conditions. Especially, we find that the LSTM RNN implemented in this work achieves a further improvement in performance over the traditional DNNs. 3. We also introduce the proposed Baum-Welch statistics extracting framework into text-dependent speaker verification tasks, and evaluate and analyze the performance of DNN/RNN with dierent configurations and training corpuses in the three test conditions of the text-dependent speaker verification task. As we need to verify both the text context information and the speaker information in text-dependent speaker recognition, DNN/RNN trained for speech recognition tasks works better than GMM system in text context verification. Experimental results on text-dependent speaker recognition tasks also indicate that DNN/RNN systems outperform GMM system in context mismatched conditions. 4. Asynchronous SGD approach is proposed to speed up DNN training for speech recognition. Since the real parallelization of BP is prohibitive due to the sequential property of SGD and the enormous parameters of DNN, we address this issue by applying ASGD as an approximation of BP. For multiple GPUs on a single server, this approach manages multiple GPUs to work asynchronously. Each GPU calculates gradients and updates the global model parameters independently. Experimental results show that ASGD approach speeds up DNN training eectively, without any performance loss.
关键词	说话人识别深度神经网络 I-vector Baum-welch 统计量递归神经网络异步随机梯度下降
语种	中文
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/11800
专题	毕业生_博士学位论文
作者单位	中国科学院自动化研究所数字内容技术与服务中心
第一作者单位	中国科学院自动化研究所
推荐引用方式 GB/T 7714	张姗姗. 基于深度神经网络的说话人识别建模方法研究[D]. 北京. 中国科学院研究生院,2016.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
基于深度神经网络的说话人识别建模方法研究（2498KB）			限制开放	--