CASIA OpenIR  > 毕业生  > 博士学位论文
多通道语音增强优化建模方法研究
李冠君
2021-05-31
页数132
学位类型博士
中文摘要

 在人机语音交互中,麦克风上接收到的语音信号经常受到环境噪声的干扰,导致自动语音识别系统性能下降。多通道语音增强旨在利用多个麦克风消除环境噪声而尽可能保持语音质量不受影响。随着噪声场景日趋复杂,多通道语音增强已经成为人机语音交互的重要前端处理模块。因此,开展多通道语音增强优化建模研究具有重要的理论意义和应用价值。

    目前,多通道语音增强算法在一些特定场景中对信号表征以及问题约束仍存在缺陷。本文在认真总结现有多通道语音增强算法的基础上,围绕不同噪声场景探索多通道语音增强优化建模方法。本文的主要工作和创新点归纳如下:

  (1)针对非点源噪声场景,为了抑制混响,现有的降混响算法通常需要较长的滤波器长度。然而,在许多实际情况中,考虑到计算成本,长度不足的滤波器经常被使用。长度不足的滤波器无法完全建模混响信号,从而导致性能下降。为此,我们提出了一种欠滤波器长度的自适应降混响算法。所提出的算法可以实时追踪由欠滤波器长度引起的建模误差,并在信号模型中补偿误差。实验表明,在不增加算法计算复杂度的情况下,所提出的算法具有更佳的降混响性能。

  (2)针对点源噪声场景,说话人波束算法是一种在多说话人中提取目标说话人信号的常用算法。现有的多通道说话人波束算法侧重利用语音信号的频谱线索,而对利用语音信号的空间多样性较为欠缺。为此,本文提出了一种方向感知的目标说话人提取算法,将说话人波束算法与波束形成算法有机结合,同时利用语音信号的频谱和空间多样性双重线索对目标信号的方向进行感知,更有效地关注来自目标语音方向的声音,因而更好地增强了目标语音。实验表明,该算法显著提升了目标说话人提取的性能和语音增强效果,尤其在含有同性别的干扰说话人噪声的场景中。

  (3)针对复杂噪声场景,已有的多通道后滤波器算法在处理复杂噪声场景时,通常依赖于对点源噪声的数量和方向的估计,而这些估计在实际环境中通常是难以获得的。为此,本文提出了一种不依赖点源噪声的数量和方向的后滤波器算法,该算法采用一个概率模型对复杂噪声场景中的点源噪声的空间相关矩阵进行量化表征。实验表明,在点源噪声的数量和方向不能准确估计的情况下,所提出的后滤波器算法优于对比算法,表明了所提出的后滤波器算法在实际应用场景中具有更大的潜力。

  (4)针对以自动语音识别为目标的复杂噪声场景,以信号优化准则的传统的广义旁瓣消除器算法无法保证最优的自动语音识别结果,为此,本文提出了一种基于广义旁瓣消除器的深度神经网络语音增强方法,可以直接以自动语音识别目标为准则进行优化,从而比现有方法更有效降低自动语音识别的误识率;同时,这种基于传统广义旁瓣消除器知识构建的神经网络方法具有可解释和可学习的优势。系统的实验验证了所提出方法的优势和提高自动语音识别性能的有效性。

 

英文摘要

  In the process of human-machine voice interaction, the received speech signal at the microphones is often corrupted by ambient noise, causing degraded performance of automatic speech recognition. Multi-channel speech enhancement aims to utilize multiple microphones to reduce ambient noise while keeping the desired speech signal as unaffected as possible. With the increasing complexity of noise scenarios, multi-channel speech enhancement has become a crucial front-end processing module for human-machine voice interaction. Therefore, studying optimization modeling for multi-channel speech enhancement has important theoretical significance and application value.

  At present, multi-channel speech enhancement algorithms still have shortcomings in signal representation and problem constraints in some specific scenarios. In this thesis, based on the comprehensive review on the state-of-art multi-channel speech enhancement algorithms, we explore the optimization modeling for multi-channel speech enhancement around different noise scenarios. The main contributions and novelties of this thesis are summarized as follows:

  (1)For the non-directional noise scenarios, in order to suppress reverberation, the existing dereverberation algorithms require a long filter. However, in many practical situations, a deficient length filter, whose length is less than the reverberation time, is employed in consideration of computational cost. A deficient length filter fails to fully model the reverberation, causing degraded performance. To that end, we propose a new adaptive dereverberation algorithm to improve the dereverberation performance in the case of using a deficient length filter. The proposed algorithm can track and compensate the modeling error caused by deficient length filter in real time. The experiments demonstrate that the proposed algorithm has better dereverberation performance without increasing the computational complexity.

  (2)For the directional noise scenarios, the SpeakerBeam is a common algorithm for extracting the target speaker signal in multi-speaker environment. The existing multi-channel SpeakerBeam utilizes the spectral features of the signals with the ignorance of the spatial discriminability of the multi-channel processing. In this work, we tightly integrate spectral and spatial information to propose a direction-aware SpeakerBeam, which accompanies the beamforming algorithm to enhance the signal from the target direction. The experimental results show that the proposed algorithm significantly improves the performance of target speaker extraction, especially in same-gender scenarios.

  (3)For the complex noise scenarios, the existing multi-channel post-filter algorithm relies on the accurate estimation of the number and directions of directional noise, which is a difficult task in practical situations. This motivates us to propose a new post-filter, which is independent of the number and directions of directional noise by using a probabilistic model to describe the spatial covariance matrix of the directional noise. The experiments show that the proposed post-filter shows more practical potentialities in the scenarios where the number and directions of directional noise cannot be accurately estimated.

  (4)For the complex noise scenarios with automatic speech recognition (ASR) as the target, the traditional generalized sidelobe canceller (GSC) uses signal level criteria and fails to guarantee optimal ASR results. For this reason, we propose a deep neural network-based GSC, which is optimized based on an ASR criterion to achieve lower character error rate than the traditional GSC. At the same time, the proposed GSC is interpretable and learnable. The massive experiments verify the advantages and effectiveness of the proposed GSC towards improving ASR results.

 

关键词多通道语音增强,非点源噪声场景,点源噪声场景,复杂噪声场景,自动语音识别
语种中文
七大方向——子方向分类语音识别与合成
文献类型学位论文
条目标识符http://ir.ia.ac.cn/handle/173211/44556
专题毕业生_博士学位论文
推荐引用方式
GB/T 7714
李冠君. 多通道语音增强优化建模方法研究[D]. 中科院自动化研究所. 中国科学院大学,2021.
条目包含的文件
文件名称/大小 文献类型 版本类型 开放类型 使用许可
李冠君-博士论文.pdf(5732KB)学位论文 限制开放CC BY-NC-SA
个性服务
推荐该条目
保存到收藏夹
查看访问统计
导出为Endnote文件
谷歌学术
谷歌学术中相似的文章
[李冠君]的文章
百度学术
百度学术中相似的文章
[李冠君]的文章
必应学术
必应学术中相似的文章
[李冠君]的文章
相关权益政策
暂无数据
收藏/分享
所有评论 (0)
暂无评论
 

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。