语谱特性和噪声声学环境深度感知的语音增强方法研究

CASIA OpenIR > 毕业生 > 博士学位论文

	语谱特性和噪声声学环境深度感知的语音增强方法研究
	聂帅1,2
	2018-05-28
学位类型	工学博士
中文摘要	语音是人与机器最自然的交互方式之一，被普遍视为最有可能成为下一代信息和服务的入口。听觉信息处理是人工智能感知的重要组成部分，是目前最接近实用突破的研究方向，然而，在真实环境中，语音信号不可避免地受到噪声和混响的干扰，造成其可懂度和感知质量严重下降。语音增强旨在消除噪声和混响的同时尽可能保持语音质量不受影响，它对语音识别和语音通信等现实应用具有重要的价值，是语音信号处理领域最为关键的核心技术之一和重要研究课题。由于语音的产生机制，语音信号本身具有明显的时序相关性、自回归性、时空结构和基本发音模式等频谱特性。此外，真实场景中的语音信号具有丰富的噪声声学环境信息。这些特性和信息为我们设计语音增强算法提供了许多有价值的声学线索，对提高语音增强性能具有重要作用。深度学习具有强大的感知能力，近年来，在语音和图像等领域取得了巨大的成功。本文在充分把握语音增强领域的基本理论和前沿方法的基础上，以深度学习为着眼点，瞄准语音固有的频谱特性和噪声声学环境的深度感知进行探索，深入研究了知识驱动(又称模式驱动)下的单声道语音增强算法。主要工作及创新点如下： 1. 通过判断时频单元被语音主导还是被噪声主导，可以实现语音和噪声的分离，自然地，单声道语音分离可以表达为一个二值分类问题。由于语音信号具有明显的时序信息，时间上相邻的时频单元，其被语音或噪声主导的概率具有很强的相关性，因此，前一时刻被语音主导的概率可以作为后一时刻的先验信息。深度层叠网络由若干个基础网络模块堆叠而成，前一个基础网络模块的输出作为先验信息输入到后一个基础网络模块，由于获得了更多的信息，通常后一个基础网络模块的性能会得到进一步提升。我们巧妙地利用深度层叠网络的独特网络结构，按照时间序列将混合语音帧依次输入到层叠的基础网络模块中，提出了带有时序的深度层叠网络（Deep Stack Network，DSN），实现对语音信号中的时序相关性进行有效建模，显著地提升了语音分离的性能。 2. 由于语音信号的短时连续性，语音信号可以表达成一个典型的自回归过程。通过N阶自回归模型，当前语音信号能够通过有限的历史语音信号进行预测。然而，在噪声环境中，语音信号不可避免地会受到噪声干扰，因此利用历史带噪的语音信号很难实现对当前纯净的语音信号的准确预测。但是历史分离的语音信号在一定程度上消除了绝大部分噪声的干扰，有效恢复了语音信号，因此，利用历史的分离语音信号可以实现对当前纯净语音信号的预测。基于此，我们提出了基于循环网络结构的自回归语音分离网络，实现了语音自回归与分离的联合建模和优化，充分挖掘了语音信号的自回归特性，提高了语音分离的性能。 3. 理想时频掩蔽和目标语音频谱是语音分离中最常用的分离目标，一方面它们具有密切的联系，呈现了显著的相关性和很强的互补性，另一方面，由于语音的产生机制，语音信号具有明显的时频相关性，这导致无论是语音频谱还是时频掩蔽都具有明显的时空结构，而且由于语音的时频稀疏性，这些时空结构对声学环境能够保持相对的稳定性。为了挖掘这些特性以提高语音分离的性能，我们提出了两阶段多目标联合学习的语音分离方法，首先，利用自编码器通过自学习的无监督方式分别挖掘了语音听觉特征和分离目标的时空结构，然后通过线性映射将两个训练好的自编码器连接起来，构建了基于深度神经网络（Deep Neural Network，DNN）的多目标语音分离模型，最后利用多目标联合学习对构建的语音分离模型进行训练。所提出的语音分离方法一方面挖掘了语音输入特征和语音分离目标的时空结构，另一方面充分利用了理想时频掩蔽和目标语音频谱的相关性和互补性，提升了语音分离性能。 4. 语音是由一些基本发音模式产生的，因此，语音信号中隐含着一些基本的频谱结构模式。挖掘语音的频谱结构模式对于提高语音分离的性能具有重要意义。非负矩阵分解（Nonnegative Matrix Factorization，NMF）是著名的表示学习技术，能够有效挖掘语音信号中具有感知意义的基本时空模式，而DNN具有强大的建模能力，能够从混合信号中感知到其中的语音成分和噪声成分。基于DNN的语音分离方法通常直接学习一个从带噪特征到分离目标的映射函数，而忽略了语音的基本频谱结构模式。显然，DNN与NMF的有机联合可能是一个更好的策略。本文我们将NMF的重构生成方式融入到基于DNN的监督式语音分离中，提出了DNN和NMF联合协作的语音分离框架。NMF用来学习语音和噪声基本频谱模式，然后将学习到的基本频谱模式融入到基于DNN的监督式语音分离中直接重构目标语音和噪声的幅度谱。另外，为了进一步避免噪声残留和语音畸变，我们探索了一个带有稀疏约束和NMF重构约束的区分性训练目标。DNN和NMF有机联合的框架即充分利用了NMF对语音时空结构的优异表征能力，充分发挥了DNN超强的映射学习能力，同时避免了类似工作只学习NMF表征系数造成累计误差的缺陷。理论分析和实验结果都证实了该项工作显著地优于之前基于DNN的语音分离方法。 5. 在真实环境中，语音所处的噪声声学环境通常是复杂多变的，比如，噪声的平稳性随时间变化。基于传统信号处理的语音增强通常忽略了真实场景中噪声声学环境的不确定性，假定噪声声学环境是稳定的，采用确定性统计信号模型解决语音增强问题。深度学习模型具有强大的感知能力，能够感知复杂环境中的语音和噪声声学环境。我们将深度学习强大的感知能力融入到基于信号处理的语音增强框架中，提出了融合信号处理和深度学习的语音增强方案，在这个方案中，深度学习用来感知混合信号中的语音存在概率和噪声声学环境的变化，而传统信号处理框架中的功率谱密度更新模块和维纳滤波模块用来增强最终的期望信号。深度学习和信号处理模块通过频谱近似的目标联合优化，系统的实验证明了所提出的语音增强方法在噪声匹配的条件和噪声不匹配的条件下都能取得较好的语音增强性能。
英文摘要	Speech interaction is one of the most natural ways of human-computer interaction, and is widely regarded as the next major information and service portal. Auditory information processing is a crucial part of AI perception, which is one of the closest practical research directions at present. In real-world environments, the acquired speech signals are inevitably corrupted by various noises and reverberation, which causes the significant degradations of the speech intelligibility and quality. To cope with such acoustic environments, it is essential to establish effective speech enhancement technologies. Speech enhancement aims to suppress the noise and reverberation components in the noisy speech while keeping the speech component undistorted. It has been widely used in many applications such as the speech recognition system and the speech telecommunication system, and is one of the key technologies and the most important research topics in the field of speech signal processing. Due to speech production mechanisms, speech signals have some inherent spectral characteristics, such as temporal correlation, auto-regression, spectro-temporal structure and basic pronunciation pattern. In addition, the acquired speech signals in real-world environments contains rich noise acoustic environment information. These spectral characteristics and acoustic environment information provide a lot of valuable clues and are very worthy to be exploited for speech enhancement. Deep learning has a powerful perceptual ability and has achieved great successes in the fields of speech and image processing. Based on the basic theories and the advanced research progress of speech enhancement, we focus on the deep learning-based perception on speech spectral characteristics and noise acoustic environments and devotes our efforts to the knowledge-driven single channel speech enhancement. We summarize our works and contributions as follows: 1. The separation of speech and noise can be implemented by judging whether the noisy time-frequency (T-F) unit is dominated by speech or by noise. Naturally, the single channel speech separation can be formulated as a binary classification problem. Due to speech production mechanisms, speech signal contains strong temporal correlation, in other words, there is a strong probability correlation on whether the neighboring T-F units are dominated by speech or by noise. Therefore, the probability that the previous T-F unit is dominated by the speech can be used as a priori probability of the next T-F unit. Deep stacking network (DSN) is stacked by several basic network modules. The output of the previous basic network are used as priori information to feed into the next. The performance of the next basic network module will usually improves due to the obtained extra prior information. In order to exploit the temporal correlation of speech signals, we proposed a DSN with the time series (DSN-TS). It cleverly uses the unique structure of DSN to implement the effectively modeling of joint probabilities of the neighboring T-F units in time and improves the performance of speech separation. 2. Speech signals can be described as a auto-regression process. Through a $N$-order autoregressive model (AR), the current frame of speech signals can be predicted by the limited historical frames of speech signals. Unfortunately, in noisy environments, speech signals are inevitably disturbed by various noises and its auto-regression is severely corrupted, which makes it very difficult to predict the current clean speech signals with the noisy historical speech signals. However, the separated speech largely avoids noise interference, and effectively preserves the harmonic structure of speech. Therefore, it is possible to use the historically separated speech signal to predict the current clean signal through an AR model. In this paper, we proposed a novel auto-regression speech separation network to jointly model and optimize the speech auto-regression and separation processes. Systematic experiments show that the proposed model is competitive with the state-of-the-art method in singing-voice separations. 3. The ideal T-F masks and magnitude spectrums of target speech are the main targets of speech separation. On the one hand, they have a close relationship, and contain significant correlation and strong complementarity. On the other hand, the T-F masks and spectral features present prominent spectro-temporal structures due to speech production mechanisms. In addition, due to the sparsity of speech in the T-F domain, the spectro-temporal structures can keep relatively invariant to various auditory environments, which is very important to robust speech separation. Obviously, these characteristics are very worthy to be exploited for speech separation. In this paper, we propose a two-stage multi-target joint learning speech separation method. Firstly, we use two denoising autoencoders (DAE) to exploit the spectro-temporal structures of speech auditory features and speech separation targets by self-learning, respectively. Then the learned DAEs are combined by a linear transformation to build a multi-target DNN for speech separation. Finally, the multiple speech separation targets are jointly learning. Systematic experiments show that the proposed approach not only exploits the spatio-temporal structure of speech auditory features and speech separation targets but also make full use of the correlation and complementarity of ideal T-F masks and speech magnitude spectrograms. 4. Deep neural network (DNN)-based speech separation usually uses a supervised algorithm to learn a mapping function from noisy features to separation targets. These separation targets, either ideal masks or magnitude spectrograms, have prominent spectro-temporal structures. Because speech is produced by some basic pronunciation patterns, these spectro-temporal structures contain some basic structure patterns. Nonnegative matrix factorization (NMF) is a well-known representation learning technique that is capable of capturing the basic spectro-temporal structures with physical or perceptual properties. While DNN has a powerful perceptual ability to speech and noise. Therefore, the combination of DNN and NMF as an organic whole is a smart strategy. In this paper, we propose a jointly combinatorial scheme for speech separation. NMF is used to learn the basis spectra that then are integrated into a DNN to directly reconstruct the magnitude spectrograms of speech and noise. Instead of predicting activation coefficients inferred by NMF, DNN directly optimizes an actual separation objective。Moreover, we explore a discriminative training objective with sparsity and reconstruction constraints to suppress noise and preserve more speech components further. The jointly combinatorial scheme of DNN and NMF concentrates the strengths of both DNN and NMF for speech separation. Systematic experiments show that the proposed models are competitive with the previous methods. 5. In real-world environments, the noise acoustic environment is usually complex and varied, for example, the smoothness of noise is varying with time. The signal processing-based speech enhancement usually assumes that the noise acoustic environment is stationary or slowly varying, and uses deterministic statistical signal model for speech enhancement while ignores the uncertainty of noise acoustic environment in real-world environments. Deep learning has a powerful perceptual ability to speech and noise. To address the limitations of conventional signal processing methods, in this paper, we propose an hybrid signal processing/deep learning scheme which incorporate the powerful perceptual capabilities of deep learning in the conventional speech enhancement framework. Deep learning is used to perceive the speech presence probability and the noise acoustic environment, while the signal processing-based power spectral density update module and Wiener filter are used to enhance the desired speech. The deep learning and signal processing modules are jointly optimized by a spectrum approximation objective. Systematic experiments demonstrate the proposed approach to noise suppression in noise-unmatched and noise-matched conditions.
关键词	语音增强语音分离深度学习非负矩阵分解语谱特性声学环境
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/21028
专题	毕业生_博士学位论文
作者单位	1.中国科学院自动化研究所 2.中国科学院大学
第一作者单位	中国科学院自动化研究所
推荐引用方式 GB/T 7714	聂帅. 语谱特性和噪声声学环境深度感知的语音增强方法研究[D]. 北京. 中国科学院大学,2018.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
聂帅-博士论文-最终版.pdf（5903KB）	学位论文		限制开放	CC BY-NC-SA