CASIA OpenIR  > 数字内容技术与服务研究中心  > 听觉模型与认知计算
基于深度度量学习的说话人识别方法及应用研究
吉瑞芳
Subtype硕士
Thesis Advisor徐波
2019-05-23
Degree Grantor中国科学院研究生院
Place of Conferral北京
Degree Discipline模式识别与智能系统
Keyword说话人辨认,说话人确认,说话人身份子空间模型,最大边缘余弦损失
Abstract

近年来,说话人识别技术逐渐从实验室走入市场,面对海量增加的语音数据和有效时长很短的短语料,传统产生式系统开始力不从心。与此同时,深度神经网络异军突起,在图像识别、视频识别、语音识别等多种模式识别领域显示出了强大的计算能力和建模能力。本文围绕说话人识别技术在应用过程中遇到的海量语料和短语料等问题,从模型建立和模型训练两个角度入手,引入深度神经网络,进行了一系列探索,主要研究成果如下:

  1.在iVector系统框架下,提出一种非线性度量学习方法,解决海量语料下说话人辨认准确度低的问题。本文用深度独立子空间分析(Independent Subspace Analysis,ISA)网络取代传统线性判别分析方法(Linear Discriminate Analysis, LDA),监督说话人特征向量的提取,使得提取的特征更具区分性。同时,该网络能有效抑制环境噪音、信道差异等因素影响,增强特征的鲁棒性。当目标语料达到400k 时,辨认系统的Top -50准确率达到95%以上。

  2.搭建了两个端到端深度神经网络说话人确认系统模型,实现了对短语音特征的有效提取。针对iVector系统无法有效提取语音时长小于15s的语音特征的问题,本文分别搭建了基于深度残差神经网络和基于增强门控循环神经网络的两个说话人确认系统。利用深度神经网络强大的拟合能力和表征能力,分别进行了多种短语音时长条件下的特征提取。实验显示,在多语言、多时长条件下,两个深度神经网络系统均能有效表征说话人特征,在2s时长下,达到等错误率小于8%的识别效果。

 3.提出了两个度量学习方法,监督端到端短语音说话人确认系统的模型训练。本文借鉴Contrast 损失中计算样本与类中心特征向量相似度的思想,提出了基于说话人身份子空间模型损失,通过模型训练来学习说话人的身份特征向量,避免了Contrast 损失的难训练情况。针对Softmax 损失缺乏类间距离约束的问题,本文增加角边界约束和类边界约束,由此提出了基于最大边缘余弦损失的度量学习方法。在多语言、多时长条件下,以及两个深度神经网络系统测试了两个度量函数的识别性能。结果显示,相同条件下,基于以上度量学习方法系统的识别效果均明显优于基于Softmax 损失的系统,性能普遍提升在20%左右。

Other Abstract

In recent years, speaker recognition technology has gradually entered the market from the laboratory. In the face of massive increasing voice data and very short effective time of short corpus, the traditional generative system is somewhat inadequate. At the same time, deep neural network suddenly rises, showing strong computing and modeling capabilities in image recognition, video recognition, speech recognition and other fields of pattern recognition. Based on deep learning, this work focuses on the problem of the massive corpus and short corpus that encountered in the application of speaker recognition technology, conducting research on both model building and metric learning. The main research results are as follows:

  1.In the framework of iVector system, a nonlinear metric learning method is proposed, to solve the problem of low accuracy of speaker identification in mass corpus. In this work, Deep Independent Subspace Analysis (DISA) network is adopted, replacing traditional Linear Discriminate Analysis (LDA) method in monitoring speaker feature extraction. Experimental results show that with this nonlinear metric learning method, the extracted features turn more extinct, and the system become more robost to environmental noise, channel differences and other factors. When the target corpus reaches 400k, the Top - 50 accuracy rate of the identification system is more than 95%.

 2.Two end-to-end deep neural network short utterance speaker verification system models are built. To settle the problem that the iVector system cannot effectively extract features from utterances less than 15s, this paper establishes two speaker verification systems : the Deep Residual Neural Network and the Enhanced Gated Loop Neural Network. By virtue of strong fitting and representation ability, the two deep neural networks extract features under the condition of multi-language and multi-duration speech duration. The experiment shows that, in all cases the two deep neural network systems both can effectively represent the speaker's characteristics. Even on the condition of 2s duration, the equal error rate of both systems are less than 8% .

 3.Two metric learning methods are proposed to monitor the model training of the end-to-end short utterance speaker verification system. By referring to Contrast loss’s idea of calculating the similarity between samples and the class center eigenvectors, this paper proposes a new loss named Speaker Identity Subspace Model Loss (SISML). This new loss learns the identity vectors of speakers through model training, so as to avoid the difficult training of the Contrast loss. Moreover, aiming at the problem that Softmax loss lacks the inter-class distance constraint, this work adds angle boundary constraint and class boundary constraint, and thus proposes another metric learning method —— the Maximum Marginal Cosine Loss (MMCL). The recognition performance of two measurement functions are tested in multi-language, multi-time and two deep neural networks. Results show that under the same conditions, the recognition performances of the systems based on the above measurement methods are significantly better than that of the system based on Softmax loss, and the performance is generally improved by about 20%.

Pages92
Language中文
Document Type学位论文
Identifierhttp://ir.ia.ac.cn/handle/173211/23794
Collection数字内容技术与服务研究中心_听觉模型与认知计算
Recommended Citation
GB/T 7714
吉瑞芳. 基于深度度量学习的说话人识别方法及应用研究[D]. 北京. 中国科学院研究生院,2019.
Files in This Item:
File Name/Size DocType Version Access License
论文_吉瑞芳.pdf(7048KB)学位论文 开放获取CC BY-NC-SAApplication Full Text
Related Services
Recommend this item
Bookmark
Usage statistics
Export to Endnote
Google Scholar
Similar articles in Google Scholar
[吉瑞芳]'s Articles
Baidu academic
Similar articles in Baidu academic
[吉瑞芳]'s Articles
Bing Scholar
Similar articles in Bing Scholar
[吉瑞芳]'s Articles
Terms of Use
No data!
Social Bookmark/Share
All comments (0)
No comment.
 

Items in the repository are protected by copyright, with all rights reserved, unless otherwise indicated.