CASIA OpenIR  > 毕业生  > 博士学位论文
基于计算听觉场景分析和语者模型的语音分离研究
其他题名Speech Segregation Based on Computational Auditory Scene Analysis and Speaker Model
关勇
学位类型工学博士
导师刘文举
2008-05-24
学位授予单位中国科学院研究生院
学位授予地点中国科学院自动化研究所
学位专业模式识别与智能系统
关键词语音分离 计算听觉场景分析 多基音跟踪 谐波建模 语者识别 实值掩码 Speech Separation Computational Auditory Scene Analysis Multi-pitch Tracking Harmonics Modeling Speaker Recognition Real Mask
摘要摘 要 在语音信号处理中,一个重要的问题就是如何从混合语音信号中分离出我们感兴趣的语音。在语音识别,音频检索,助听设备的实际应用中,存在单声道下有多个人声语音同时存在的情况,因为只有一个信道信号,而同时存在多个高度非稳态分布的语音信号,很多CASA系统对于多说话人同时存在的混合语音进行分离时,都不能达到令人满意的分离性能。因此分离这种混合语音成为一种颇具挑战性的课题。 本文利用计算听觉场景分析和语者声学模型相结合的方法,对多个说话人同时存在的单声道混合语音分离算法进行了深入的探索和研究。主要的工作和创新点如下:  多基音跟踪及其混合语音分离算法研究。由于在多说话人存在的情况下,混合语音中可能存在多个基音,因此如果能够准确的提取出每个说话人的基音,并利用提取出的基音对各个说话人进行组织,将有助于提高分离系统的性能。基于这一思路,本文研究了多基音跟踪算法,利用多通道时频分解的信号在频带上的分段连续性检测基音存在,利用形态学滤波的方法跟踪多基音轨迹,然后,利用多基音跟踪结果进行单声道混合语音分离的方法,将用多基音跟踪算法估计出的混合语音中出现的多个基音一同作为分离线索结合到计算听觉场景分析系统中,从而提高整个系统的分离性能。  精细谐波结构建模及其应用研究。信号的时频分解带给我们丰富的中层表达信息,而已有的研究中,对这些信息利用不够充足,因而我们对各频带的精细谐波结构进行了深入的研究。基音以谐波的形式反映在滤波后各个频率通道内的响应函数中,本文研究了谐波在低频通道的分布规律,并因此建立了频率通道——第一峰值——基音的分布模板和谐波模板,并将谐波模板应用于多基音检测以及利用谐波重建重新合成语音,提高了相应的多基音检测及语音分离算法的性能。  多说话人识别研究。对于多个说话人同时存在的混合语音,如果能够检测出混合语音中存在的说话人信息,即可以利用相应语者的模型信息,将语者模型的高层信息结合到语音分离系统中。因此本文研究了在混合语音中检测多个说话人的两阶段多语者识别算法,在第一阶段,引入似然得分限制参数和增益补偿参数,利用置信得分计算得到候选语者列表,在第二阶段,利用组合模型算法,在传统说话人识别框架下,得到最优语者组合,并开发了相应的快速算法。实验结果表明,本文提供的两阶段语者识别算法能够准确地检测混合语音中存在的说话人信息,为后续的语音分离研究提供可靠的候选模型。  语者模型在语音分离中的应用研究。利用高层语音知识对语音分离进行指导,是一种图式驱动的计算听觉场景分析算法,本文深入研究了语者声学模型在语音分离中的应用。本文利用多语者识别的结果选定相应的说话人模型,在计算听觉场景分析的框架下,利用语者模型来推断掩码信号并重新合成语音,提高了语音分离系统的性能。针对二值掩码重新合成的语音信号的频谱缺失问题,本文利用语者模型信息估计实值掩码,在后续的语音识别实验中,验证了实值掩码相对于二值掩码的有效性。
其他摘要Abstract In speech signal processing domain, it is an important to separate the target speech from mixed speech. In application, such as speech recognition, audio retrieval and hearing aid, there are maybe multiple speech sources in one channel, so multiple highly un-stationary signals exist synchronously. Then how to separate the target speech from these mixed speeches is a challenging problem. In this thesis, we integrated computational auditory scene analysis and speaker acoustics model to explore separating target speech from mixed speech with multiple human sound sources.  Many CASA systems could not separate the mixture speech of multi-speaker very well. We proposed a multi-pitch tracking algorithm, in which, we utilized the piecewise continuity in multi-channel time-frequency decomposition to detect pitches first, and then tracked the multiple pitch contours by mathematical morphology filtering. By combining the information about the pitch contours of the target speech and interference acquired by Multi-pitch tracking, the performance of separating the mixture speech of multi-speaker is improved effectively.  Time-frequency decomposition takes rich information for middle level expression, but it was not sufficiently utilized in former study, so we did thorough exploration in fine structure for harmonics. we studied the distribution of harmonics in low frequency channel, and built a distribution template of channel-first peak position in autocorrelation-pitch and harmonics template. Then we applied the harmonics template to multi-pitch detection and harmonics reconstruction, and improved corresponding performance.  For mixed speech with multiple speakers, if we detected the speaker identifications in them, the corresponding speaker acoustics model could be applied in speech separation system. We studied two-stage multi-speaker recognition from mixed speech. In first stage, we utilized confidence measure score algorithm, in which likelihood score limit parameter and gain compensate parameter were introduced, to obtain results list. In second stage, we utilized composite speaker model to search the best speaker combination, also we explored a fast algorithm. Experiment results showed that the proposed two-stage algorithm can detect the speaker identifications in mixed speech accurately, and can supply reliable candidate models for following speech separation.  We studied thoroughly the applications of speech acoustic model in speech separation. We utilized the results of multi-speaker recognition to select speaker models, and in the CASA framework, we applied the speaker model to infer masks to re-synthesize speech signals. Since there were frequency lacks in re-synthesized speech signal with binary masks, we utilized speaker model information to estimate real masks, and experiments showed that the real masks were excelled than conventional binary masks.
馆藏号XWLW1243
其他标识符200418014628050
语种中文
文献类型学位论文
条目标识符http://ir.ia.ac.cn/handle/173211/6072
专题毕业生_博士学位论文
推荐引用方式
GB/T 7714
关勇. 基于计算听觉场景分析和语者模型的语音分离研究[D]. 中国科学院自动化研究所. 中国科学院研究生院,2008.
条目包含的文件
文件名称/大小 文献类型 版本类型 开放类型 使用许可
CASIA_20041801462805(773KB) 暂不开放CC BY-NC-SA请求全文
个性服务
推荐该条目
保存到收藏夹
查看访问统计
导出为Endnote文件
谷歌学术
谷歌学术中相似的文章
[关勇]的文章
百度学术
百度学术中相似的文章
[关勇]的文章
必应学术
必应学术中相似的文章
[关勇]的文章
相关权益政策
暂无数据
收藏/分享
所有评论 (0)
暂无评论
 

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。