CASIA OpenIR  > 毕业生  > 博士学位论文
基于计算听觉场景分析的单声道混合语音分离研究
其他题名Monaural Speech Separation Based on Computational Auditory Scene Analysis
李鹏
学位类型工学博士
导师徐波
2007-10-19
学位授予单位中国科学院研究生院
学位授予地点中国科学院自动化研究所
学位专业模式识别与智能系统
关键词单声道语音分离 计算听觉场景分析 二值掩蔽 语音客观质量评估 多基音跟踪 因子最大矢量量化 Monaural Speech Separation Computational Auditory Scene Analysis Binary Mask Objective Quality Assessment Of Speech Multi-pitch Tracking Factorial-max Vector Quantization
摘要实际环境中,语音信号在到达听觉系统时往往伴随着其它噪声。一个能够有效的从干扰源中分离出目标语音的信号分离系统对于自动语音识别、说话人识别、音频检索以及数字内容管理等应用有着十分重要的意义。 目前,对混合语音信号的分离研究大体上集中在盲源分离及计算听觉场景分析这两个领域。本文主要从计算听觉场景分析领域对单声道混合语音分离算法进行了深入地研究和探索。所取得的主要工作成果和创新点如下:  1.从感知质量的角度对语音分离系统的优化准则进行了探索性的研究,并取得初步的成果。证实了感知质量在语音分离工作中可以作为优化的准则。在此基础上,探讨了语音的感知质量这一高层知识与计算听觉场景分析的结合问题,提出了一种基于计算听觉场景分析和语音客观质量评估的单声道混合语音分离系统。该系统将语音客观感知质量作为听觉场景分析及语音分离的优化目标和准则,成功地将语音客观感知质量评估机理应用到语音分离系统中,提高了分离语音的质量。  2.针对ITU-T P.563语音客观质量评估标准在使用方面的限制以及计算量大的缺点,提出了一种采用基于时域包络信息的客观质量评估方法来替代P.563算法的混合语音分离系统。该系统在几乎不降低系统分离性能的前提下,大大降低了算法运行所需的时间和资源消耗。 3.针对许多计算听觉场景分析系统无法很好地解决多说话人混合语音信号分离的问题,提出了一种基于多基音跟踪的单声道混合语音分离系统。该系统充分利用了多基音跟踪研究的最新成果,通过将多基音跟踪得到的目标语音和干扰语音的基音轨迹信息结合到分离系统中,有效地改善了分离系统对多说话人混合语音的分离效果。 4.针对大部分计算听觉场景分析系统无法解决清音信号的分离问题,提出了一个基于计算听觉场景分析和因子最大矢量量化的可以同时分离语音中的清音和浊音的混合语音分离方法。该方法通过使用机器学习技术从独立的纯净说话人语音数据中学习分组线索,并借助一个因子最大矢量量化模型来推断计算听觉场景分析再合成阶段所需的掩蔽信号,实现了目标说话人和干扰说话人语音的分离。实验表明,该方法能够有效地解决清音信号的分离问题,对两个说话人混叠语音也有很好的分离效果。此外,该系统还可以作为自动语音识别系统的鲁棒前端,提高语音识别系统的性能。
其他摘要In a natural world, speech signal is frequently accompanied by other sound sources on reaching the auditory systems. It is valuable to make computer have the ability of human being to segregate the object source from other interfering sources. An effective separation system can greatly facilitate many applications, including automatic speech recognition, speaker identification, audio retrieval, digital content management, etc. Nowadays, the research of speech separation is mainly focused on blind source separation and computational auditory scene analysis. In this thesis, we study and explore in depth on monaural speech separation with computational auditory scene analysis. The main contributions and novelties include: 1.The probability to use perceptual quality of speech as the optimization criterion of separation is discovered in depth. The primary achievements have proved its validity. Based on it, we discuss the problem of combining speech perceptual quality with CASA, and propose a monaural speech separation system based on CASA and Objective quality assessment of speech. The proposed system uses OQAS as the object and criterion in CASA for speech separation. It combines OQAS with CASA successfully, and improved not only the SNR but also the perceptual quality of the separated speech. 2.Considering the defects of the limitation in application and the large consuming on time and load of the ITU-T P.563 algorithm, we construct a system combining CASA with the temporal envelope based OQAS algorithm to replace the one with the ITU-T P.563 algorithm. The new system reduces the time and load consuming obviously as well as almost not decreases the performance of separation. 3.Many CASA systems could not separate the mixture speech of multi-speaker very well. As far as this is concerned, a CASA system which is based on multi-pitch tracking is proposed in this thesis. The proposed system thoroughly employs the latest achievements on multi-pitch tracking into CASA. By combining the information about the pitch contours of the target speech and interference acquired by multi-pitch tracking, the performance of separating the mixture speech of multi-speaker is improved effectively. 4.The segregation of the unvoiced speech is still one of the most difficult problems in CASA. In this thesis, a new method based on CASA and factorial-max vector quantization is suggested to process voiced and unvoiced speech synchronously. A technique of machine learning is used to train the grouping cues on isolated clean data from single speaker. By using a factorial-max vector quantization model to infer the masking signals and sending them to the latter resynthesis module, the objective of separation is accomplished. The results of experiment on a standard corpus show that this proposed method could separate not only the unvoiced speech but also the two speaker mixture very well. Moreover, it could also be used as a robust front-end to improve ASR's performance.
馆藏号XWLW1155
其他标识符200318014603011
语种中文
文献类型学位论文
条目标识符http://ir.ia.ac.cn/handle/173211/6037
专题毕业生_博士学位论文
推荐引用方式
GB/T 7714
李鹏. 基于计算听觉场景分析的单声道混合语音分离研究[D]. 中国科学院自动化研究所. 中国科学院研究生院,2007.
条目包含的文件
文件名称/大小 文献类型 版本类型 开放类型 使用许可
CASIA_20031801460301(1867KB) 暂不开放CC BY-NC-SA请求全文
个性服务
推荐该条目
保存到收藏夹
查看访问统计
导出为Endnote文件
谷歌学术
谷歌学术中相似的文章
[李鹏]的文章
百度学术
百度学术中相似的文章
[李鹏]的文章
必应学术
必应学术中相似的文章
[李鹏]的文章
相关权益政策
暂无数据
收藏/分享
所有评论 (0)
暂无评论
 

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。