电话信道下说话人分离及识别研究

CASIA OpenIR > 毕业生 > 博士学位论文

	电话信道下说话人分离及识别研究
其他题名	Speaker Diarization and Recognition of Telephone Conversations
	张策
	2013-05-29
学位类型	工学博士
中文摘要	在电话信道环境下，说话人身份认证/识别面临的核心问题是由合路语音所带来的通道差异及通话双方信号的相互干扰，这种干扰对说话人的训练和测试都是严峻的考验。本文主要研究两人对话语音条件下说话人识别的鲁棒性问题。论文工作的主要内容和创新点如下： 1. 在联合因子分析框架下，研究和对比了多种置信度计算方法，在一阶近似的泰勒展开基础上提出对称形式的评分方式。该置信度计算方法克服了传统计算方法中训练和测试语音不对等的缺点，使得任意给定的两条语音在说话人层面的相似度能够保持统一，而与顺序无关。 2. 在此基础上深入分析了内积形式的分数归一化方法的意义，并将其推广到支持向量机的核函数中，直接在核函数形式上引入隐式的归一化准则，从而避免了系统后端的分数归一化后处理。 3. 由于目前主流的说话人算法均是基于通用背景的高斯混合模型，而高斯混合模型的充分统计量提取一直是影响系统速度的瓶颈所在。对此提出了一种数据驱动的高斯选择方法，利用数据对声学空间进行划分，然后结合后验概率提前绑定高斯列表，实现快速、高效的统计量提取。实验表明在性能几乎无损的情况下，统计量提取模块速度提升10倍左右。 4. 对于说话人分离，利用说话人识别中已趋成熟的iVector技术，提出将变分贝叶斯方法与iVector相结合，使得在聚类过程中每个片段以一定的概率属于某个说话人（软决策），并利用EM算法不断优化这个后验概率，最终在NIST-SRE2008合路测试数据上将分离错误率从13.8%降到6.88%，重分割之后进一步降低至5.34%。 5. 在涉及多条合路语音的训练阶段，提出用PLDA模型进行公共说话人的提取，针对不同组合方式的选择策略给出了多种目标函数的形式化描述。在NIST-SRE2008评测中的3summed-summed任务上，将等错误率从NIST官方公布的最好结果（约8%）降低至4.05%。
英文摘要	The most challenging part in speaker recognition of telephone conversations is the intra-session variability in the summed channel. We mainly focus on the robust speaker diarization and recognition for two speaker scenarios in this thesis and the contribution is shown as follows: 1. We compare several confidence measures in the framework of joint factor analysis and obtain symmetric scoring method based on the first order approximation of Taylor series for fully likelihood calculation, which com-pletely symmetrizes the problem so that it does not matter anymore which utterance in a trial is for enrollment and which is for test. 2. Based on the symmetric scoring we investigate various normalization meth-ods and extend the implicit normalization formula to any confidence mea-sures defined in the form of inner product. According to the general form of symmetric normalization we also modify the KL kernel to incorporate some kinds normalization in the kernel space. 3. Because of the dominance of GMMs in speaker related fields and the bottle-neck of sufficient statistics extraction especially when the number of com-ponents grows to thousands, we propose a data driven Gaussian componen-t selection algorithm based on multi-layer acoustic space partition which achieves a 10 times faster Baum-Welch statistic extraction without any performance loss. 4. Applying the variational Bayesian in the context of iVector representation for fuzzy clustering in speaker diarization which is proved to be more effec-tive than the traditional hierarchical agglomerative clustering. We decrease the diarization error rate from 13.8% to 6.88% and further improve it to 5.34% after Viterbi re-segmentation. 5. Finaly, we introduce the PLDA model into the target speaker selection for multiple summed-channel excerpts enrollment. We also propose and evalu-ate several kinds of objective function to measure the purity of selected seg-ments, which obtains a much better equal error rate(4.05%) than the best system of NIST-SRE 2008 on the 3summed-summed test condition(∼8%).
关键词	说话人识别说话人分离因子分析高斯混合模型贝叶斯分析 Speaker Recognition Speaker Diarization Factor Analysis Gaussian Mixture Models Bayesian Analysis
语种	中文
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/6536
专题	毕业生_博士学位论文
推荐引用方式 GB/T 7714	张策. 电话信道下说话人分离及识别研究[D]. 中国科学院自动化研究所. 中国科学院大学,2013.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
CASIA_20101801462807（1685KB）			暂不开放	CC BY-NC-SA