复杂场景语音前端增强与分离算法研究

CASIA OpenIR > 数字内容技术与服务研究中心 > 智能技术与系统工程

	复杂场景语音前端增强与分离算法研究
	李晨星
	2020-06
页数	136
学位类型	博士
中文摘要	语音是人与机器最自然的交互方式之一，通过语音处理技术，人的意图可以被直接传递给机器。目前，近场语音识别、说话人识别都已获得非常好的性能，但是在远场环境中，语音信号不可避免地受到噪声、混响和其他说话人的干扰，其可懂度和感知质量严重下降，从而影响后续语音处理技术的性能，而通过语音前端增强与分离技术可以使声音的纯净度明显上升。语音前端增强与分离技术旨在复杂的声学场景中，消除噪声和混响的影响，分离说话人混合语音的同时尽可能保持语音质量不受影响，对语音识别、说话人识别和语音通信等现实应用具有重要价值，是语音信号处理领域最为关键的核心技术和重要研究课题之一。近年来，基于深度神经网络的语音增强与分离方法因其卓越的性能表现而逐渐成为主流，但此类方法仍然存在相位不匹配、模型泛化性差、模拟数据与真实数据有差异等问题。对此，本文在充分把握语音前端领域基本理论和前沿方法的基础上，以深度学习为主要方法，以语音固有的声学特性和噪声场景声学特性为理论基础，从语音去混响、去噪、分离和远场语音识别等方面进行了深入的研究与探索，形成的研究成果主要有： 1. 本文提出一种基于生成对抗训练的单通道语音去混响算法，可以在复杂环境中有效去除混响。提出的算法采用精调的 CBLDNN 结构，利用卷积网络、循环神经网络、前馈网络的模型组合深入挖掘语音特征；网络训练时，通过加入生成对抗训练以使去混响后的语音逼近纯净语音，从而进一步提升语音质量。实验结果表明，所提出的模型明显优于加权预测误差等基线模型，且具有较好的鲁棒性和泛化性。此外，将离线去混响模型扩展到在线语音去混响场景，增加该方法的适用范围，在线模型可以取得与离线模型相近的性能。 2. 本文首先提出一种基于二维自注意力机制的时频域语音去噪模型，可以有效提升带噪语音可懂度。二维自注意力机制可以选择适合当前时频点的时频特征向量进行编码，同时将提取的时间和频域维度的特征融合。通过二维自注意力机制，网络可以捕捉语音序列的长时依赖；基于最小均方误差准则的模型会有频谱模糊现象，我们提出频谱边缘增强网络，建模和恢复频谱纹理细节，锐化频谱。针对时频域的语音去噪方法存在的相位不匹配问题，本文进一步提出一种基于全卷积神经网络的时域多尺度语音去噪模型。该模型不仅通过时域端到端训练有效避免相位不匹配问题，模型中的门控机制还可以用于选择重要特征并抑制不相关信息，多尺度特征提取方法被用来学习不同尺度的特征表示，多尺度特征融合方法融合来自不同层次的特征。实验结果表明，该模型能有效地去除语音信号中的噪声。 3. 本文提出一种基于对抗训练的时频域语音分离方法，分离混合语音并获得更好的语音质量。在探究不同模型结构和模型组成对性能影响的基础上，所提出的网络能够更好地提取声学特征；利用多任务训练将语音听觉特征融入网络训练中，网络关注并学习到语音听觉特性；通过将对抗训练加入网络训练，使分离语音能够在高阶特征空间更趋向于纯净语音。基于时频域的语音分离方法具有相位不匹配问题，并且基于 PIT 训练准则的方法无法处理混合声源个数未知的场景。对此，我们提出一种基于时域的双通道语音分离网络，该网络首先按顺序推断混合语音中所有说话人与其方向，然后将其转换为声源掩蔽来分离混合语音。模型在时域进行分离，避免相位不匹配问题，有效提升分离性能；分离的输出带有网络预测的说话人信息和方向信息，此信息可以被应用于之后的处理流程中。实验结果表明该方法能够有效分离混合语音，并解决常规分离模型中无法解决的混合声源个数未知、输出顺序不定和难以选择分离输出的问题。 4. 本文提出全向波束算法，在真实场景中设计和实现远场语音识别模型。对于远场识别模型的前端，我们比较了几种常用的波束形成方法，并提出一种基于全向最小方差无失真响应和加权预测误差的波束形成方法以弥补现有方法的不足。对于模型后端，我们设计了几种不同结构的声学模型和语言模型，通过优化模型结构、组合顺序，进一步提升语音识别的性能。相较于基线模型，本文提出的方法在单麦克风阵列场景和多麦克风阵列场景中均获得明显识别性能提升。
英文摘要	Speech is one of the most natural ways for humans and machines to interact. Through speech processing technology, human intent can be passed directly to the machine. At present, near-field speech recognition and speaker recognition have achieved the state-of-the-art performance. However, in a real environment, the speech signal is inevitably interfered by noise, reverberation, and other speakers, which damages the speech intelligibility and quality and degrades the performance of subsequent speech processing technologies. The speech front-end technology is designed to eliminate noise and reverberation in the acoustic environment and focus on the target signal while keeping the speech quality as much as possible. This is one of the most critical core technologies and important research topics in the field of speech signal processing. In recent years, neural network-based speech front-end methods have gradually become the mainstream due to their excellent performance. However, there still exists problems such as phase mismatch, poor generalization of the model, and mismatch between the simulated data and the real data. These problems still need to be effectively solved. Based on the basic theories and cutting-edge methods in the speech front-end, this paper takes deep learning as the main method, and uses the inherent acoustic characteristics of speech and the theoretical basis of acoustic characteristics of noise/reverberant scenes to conduct in-depth research and exploration on the front-end models. The main work and innovations are as follows: 1. This work propose a single-channel speech dereverberation system with generative adversarial training, which can effectively enhance reverberate speech. Fine-tuned structure CBLDNN is adopted, which integrates CNN, BLSTM, DNN to improve the performance. Adversarial training is applied to making the indistinguishable from the clean samples in high-dimensional space. The experimental results show that the proposed model outperforms several baseline systems. Besides, this system can deal with wide range reverberation and be well adapted to variant environments. Besides, the offline system is extended to an online system, which can obtain comparable performance with the offline system. 2. This work first investigates the two-dimensional self-attention-based speech enhancement system, which can effectively improve the intelligibility of noisy speech. This attention mechanism focuses on information in two dimensions (e.g., time and frequency dimensions) simultaneously, and the spectral features from two dimensions can be learnt or fused jointly. The network captures global dependencies without recurrence successfully. An edge enhancement network is proposed to model and restore spectral texture details and sharpen the spectrum. There is a phase mismatch in time-frequency domain-based methods, and the performance is not optimal. decline. This work next proposes a multi-scale speech enhancement model based on time domain. The proposed method performs enhancement in the time domain and avoids phase mismatch problem through end-to-end training. A gated mechanism is used to select dominant features and restrain irrelevant information. We apply multi-scale feature extraction to learn different scales of feature representation, and use multi-scale feature fusion to fuse different levels of features that come from different layers. Experimental results show that our model effectively removes noise in speech signals. 3. This work proposes a speaker-independent multi-speaker speech separation method via generative adversarial training. This system aims at obtaining better speech quality instead of only minimizing a mean square error. Model structure is first investigated in depth to better extract acoustic features. In the initial phase, log-mel filterbank and pitch features are utilized to warm up the CBLDNN in a multi-task manner. Thus, the information that contributes to separating speech and improving speech quality is integrated into the model. Generative adversarial training is executed throughout the training, which makes the separated speech indistinguishable from the real one. Time frequency-based methods suffer from phase mismatch. PIT-based methods can not deal with the problem of the unknown number of outputs. We next propose a time-domain based dual-channel speech separation network, which first infers all competing speakers associated with their directions in a sequential manner, and transforms them into source masks to separate the mixture. The speech is separated in the time domain to avoid the problem of phase mismatch. The speaker and direction information are appended to the output, which can be applied to the subsequent task. Experimental resultsd show that our network successfully separates mixture and deals with the unknown number of sources, permutation problem and the selection of output. 4. A cascade far-field speech recognition system is proposed. For front-end, this work conducts the comparisons between several popular beamforming methods. Besides, we also propose an omnidirectional minimum variance distortionless response followed by weighted prediction error. For back-end, several acoustic models and language models with different architectures are deeply investigated. Compared with the baseline system, our proposed method achieves significant performance improvements both on single-array and multi-array scenarios.
关键词	语音去混响语音增强语音分离远场语音识别
语种	中文
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/39846
专题	数字内容技术与服务研究中心_智能技术与系统工程
推荐引用方式 GB/T 7714	李晨星. 复杂场景语音前端增强与分离算法研究[D]. 北京. 中国科学院自动化研究所,2020.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
博士毕业论文-李晨星-签字版.pdf（11281KB）	学位论文		开放获取	CC BY-NC-SA