CASIA OpenIR  > 毕业生  > 硕士学位论文
基于深度特征的声纹识别系统
方硕
学位类型工学硕士
导师陶建华
2016-05
学位授予单位中国科学院研究生院
学位授予地点北京
关键词声纹识别 说话人矢量因子 全局差异空间模型 深度神经网络 深度瓶颈特征
摘要声纹识别是根据语音对说话人进行自动区分,从而进行说话人身份鉴别以及认证的生物特征识别技术。其中,文本无关的声纹识别更是由于其应用环境的灵活性,成为学术界以及工业界的一个研究热点。本文将以文本无关的声纹识别为研究重点,致力于完成一个完善的且性能良好的声纹识别系统。主要采用四种建模方法进行研究以及系统实现,包括:
1、介绍基于GMM-UBM的声纹识别系统,以此为基础,研究并实现了基于TVM-I-Vector的声纹识别系统。混合高斯-通用背景模型(Gaussian Mixture Model-Universal Background Model,GMM-UBM)将声学特征投影到高维空间上,得到高维的均值超矢量。UBM采用大规模的说话人语料训练完成,并因此可以被采进行说话人的共性特征的描述。然后,以UBM为初始模型,采用目标说话人数据进行基于最大后验概率(Maximum A Posterior, MAP)的自适应训练,得到目标说话人的混合高斯模型(GMM)。通过计算似然值的方法进行说话人打分,进而进行识别判决。说话人矢量因子(Identity-Vector,I-Vector)模型以GMM-UBM为基础,基于的假设是所有的说话人信息隐含在高维的均值超矢量中。它的基本思想是将高维的均值超矢量投影至低维空间中进行建模,即将长短不一的语音文件通过全局差异空间建模(Total Variable space Model, TVM)的方式,得到长度一致的低维向量作为说话人模型。这个低维矢量即为I-Vector。由于I-Vector建模中没有区分语音中说话人信息以及信道信息,为了降低信道对识别的影响,本文分别采用了线性判别分析(Linear Discriminant Analysis,LDA)以及概率线性判别分析(Probability Linear Discriminant Analysis,PLDA)对I-Vector进行信道补偿,提高识别效果。
2、研究并实现了基于DNN统计量提取的I-Vector声纹识别系统。在基于TVM-I-Vector的声纹识别系统中,采用UBM计算后验概率,提取相关统计量,从而进行全局差异空间的估计。考虑到UBM数据驱动的建模方式可能会导致后验概率计算的误差较大,从而影响识别性能,本文采用自动语音识别(Automatic Speech Recognition, ASR)任务中训练得到的深度神经网络(Deep Neural Network, DNN)模型作为计算后验概率的模型替代UBM,以得到更准确的统计量进行模型估计,从而提高识别效果。
3、研究并实现了基于深度瓶颈特征(Deep Bottleneck Feature, DBF)的I-Vector声纹识别系统。深度瓶颈层(Deep Bottleneck Layer)在语音识别中的成功应用证明了DBF在特征表示上的优势。低维的DBF在网络结构中能够实现高维输入至输出的重现,说明它是一种抽象、凝练且更具区分性的特征。相比于声纹识别中的Mel频率倒谱系数 (Mel Frequency Cepstral Coefficients, MFCC)特征,DBF在声纹识别上也表现出了一定的优势。在本文中,实现了基于DBF的I-Vector模型的估计。此外,考虑到MFCC以及DBF的互补作用,进行了基于DBF以及MFCC特征融合的I-Vector建模,进一步提高声纹识别系统的性能。
其他摘要Voiceprint identification is an important part of biometric identification, which distinguishes the identity of speaker according to the characteristics of speech. And the text-independent voiceprint identification becomes a hot research point because of its flexibility in the application. This paper will focus on the text-independent voiceprint research, committed to realizing a well-performed system. Research is implemented on four modeling methods listed below, and the four corresponding systems are also constructed.
1. Do research and implementation on TVM-I-Vector systems, meanwhile Voiceprint identification systems based on GMM-UBM is introduced. Gaussian mixture - universal background model (Gaussian Mixture Model-Universal Background Model, GMM-UBM) projects the acoustic characteristics onto high-dimensional space becoming super-Gaussian vectors. Firstly, massive speaker training corpus is used to train a universal background model (UBM) describing the common characteristics of speakers. Then taking the UBM as an initial model, do adaptation based on maximum a posteriori probability (Maximum A Posterior, MAP) using the adaptive training data of the target speaker to gain the target speaker Gaussian mixture model (GMM). Obtain the scores through calculating the likelihood to make identity judgment. Speaker vector factor (Identity-Vector, I-Vector) drawing from GMM-UBM model does vector projection from the super-Gaussian vector to a low dimensional vector. Namely it changes the voice files of varying length to a fixed vector as the speaker model through Total Variable Space Model(TVM). The low-dimensional vector is called I-Vector. Since the I-Vector modeling does not distinguish   speaker information and channel information, in order to reduce the effect of channel variations, Linear Discriminant Analysis(LDA) and Probability Linear Discriminant Analysis (PLDA) for is used for I-Vector channel compensation to improve the recognition performance.
2. Studies and implementation are carried out on the I-Vector voiceprint recognition system using DNN (Deep Neural Network) of ASR (Automatic Speech Recognition) to extract sufficient statistics. In voiceprint recognition systems based on TVM-I-Vector, the posterior probability is calculated by a UBM to extract the sufficient statistics. Taking into account the data-driven training approach of UBM may lead to errors of posterior probability calculation, and thus affects the recognition performance. We propose to obtain the posterior probability through DNN in ASR system getting a more accurate statistic estimation of the model and improving recognition performance.
3. Studies and implementation based on DBF (Deep Bottleneck Layer) are carried out on the I-Vector voiceprint recognition systems. Deep Bottleneck Layer is applied successfully in the area of speech recognition and proves the advantage of Deep Bottleneck Feature(DBF) on the feature representation. DBF is the output of the hidden layer whose number of nodes is set to smaller than the other hidden layers. It is high compressed, abstract and discriminative Compared with Mel Frequency Cepstral Coefficients(MFCC). DBF performs better in voiceprint identification systems than MFCC. Taking into account the complementary roles of MFCC and DBF, we carried out features fusion to further improve the performance of voiceprint identification system.
文献类型学位论文
条目标识符http://ir.ia.ac.cn/handle/173211/11772
专题毕业生_硕士学位论文
作者单位中国科学院自动化研究所
推荐引用方式
GB/T 7714
方硕. 基于深度特征的声纹识别系统[D]. 北京. 中国科学院研究生院,2016.
条目包含的文件
文件名称/大小 文献类型 版本类型 开放类型 使用许可
基于深度特征的声纹识别系统.pdf(2656KB)学位论文 暂不开放CC BY-NC-SA请求全文
个性服务
推荐该条目
保存到收藏夹
查看访问统计
导出为Endnote文件
谷歌学术
谷歌学术中相似的文章
[方硕]的文章
百度学术
百度学术中相似的文章
[方硕]的文章
必应学术
必应学术中相似的文章
[方硕]的文章
相关权益政策
暂无数据
收藏/分享
所有评论 (0)
暂无评论
 

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。