|Place of Conferral||北京|
|Keyword||语音增强 语音分离 深度学习 非负矩阵分解 语谱特性 声学环境|
1. 通过判断时频单元被语音主导还是被噪声主导，可以实现语音和噪声的分离，自然地，单声道语音分离可以表达为一个二值分类问题。由于语音信号具有明显的时序信息，时间上相邻的时频单元，其被语音或噪声主导的概率具有很强的相关性，因此，前一时刻被语音主导的概率可以作为后一时刻的先验信息。深度层叠网络由若干个基础网络模块堆叠而成，前一个基础网络模块的输出作为先验信息输入到后一个基础网络模块，由于获得了更多的信息，通常后一个基础网络模块的性能会得到进一步提升。我们巧妙地利用深度层叠网络的独特网络结构，按照时间序列将混合语音帧依次输入到层叠的基础网络模块中，提出了带有时序的深度层叠网络（Deep Stack Network，DSN），实现对语音信号中的时序相关性进行有效建模，显著地提升了语音分离的性能。
3. 理想时频掩蔽和目标语音频谱是语音分离中最常用的分离目标，一方面它们具有密切的联系，呈现了显著的相关性和很强的互补性，另一方面，由于语音的产生机制，语音信号具有明显的时频相关性，这导致无论是语音频谱还是时频掩蔽都具有明显的时空结构，而且由于语音的时频稀疏性，这些时空结构对声学环境能够保持相对的稳定性。为了挖掘这些特性以提高语音分离的性能，我们提出了两阶段多目标联合学习的语音分离方法，首先，利用自编码器通过自学习的无监督方式分别挖掘了语音听觉特征和分离目标的时空结构，然后通过线性映射将两个训练好的自编码器连接起来，构建了基于深度神经网络（Deep Neural Network，DNN）的多目标语音分离模型，最后利用多目标联合学习对构建的语音分离模型进行训练。所提出的语音分离方法一方面挖掘了语音输入特征和语音分离目标的时空结构，另一方面充分利用了理想时频掩蔽和目标语音频谱的相关性和互补性，提升了语音分离性能。
4. 语音是由一些基本发音模式产生的，因此，语音信号中隐含着一些基本的频谱结构模式。挖掘语音的频谱结构模式对于提高语音分离的性能具有重要意义。非负矩阵分解（Nonnegative Matrix Factorization，NMF）是著名的表示学习技术，能够有效挖掘语音信号中具有感知意义的基本时空模式，而DNN具有强大的建模能力，能够从混合信号中感知到其中的语音成分和噪声成分。基于DNN的语音分离方法通常直接学习一个从带噪特征到分离目标的映射函数，而忽略了语音的基本频谱结构模式。显然，DNN与NMF的有机联合可能是一个更好的策略。本文我们将NMF的重构生成方式融入到基于DNN的监督式语音分离中，提出了DNN和NMF联合协作的语音分离框架。NMF用来学习语音和噪声基本频谱模式，然后将学习到的基本频谱模式融入到基于DNN的监督式语音分离中直接重构目标语音和噪声的幅度谱。另外，为了进一步避免噪声残留和语音畸变，我们探索了一个带有稀疏约束和NMF重构约束的区分性训练目标。DNN和NMF有机联合的框架即充分利用了NMF对语音时空结构的优异表征能力，充分发挥了DNN超强的映射学习能力，同时避免了类似工作只学习NMF表征系数造成累计误差的缺陷。理论分析和实验结果都证实了该项工作显著地优于之前基于DNN的语音分离方法。
Speech interaction is one of the most natural ways of human-computer interaction, and is widely regarded as the next major information and service portal. Auditory information processing is a crucial part of AI perception, which is one of the closest practical research directions at present. In real-world environments, the acquired speech signals are inevitably corrupted by various noises and reverberation, which causes the significant degradations of the speech intelligibility and quality. To cope with such acoustic environments, it is essential to establish effective speech enhancement technologies. Speech enhancement aims to suppress the noise and reverberation components in the noisy speech while keeping the speech component undistorted. It has been widely used in many applications such as the speech recognition system and the speech telecommunication system, and is one of the key technologies and the most important research topics in the field of speech signal processing.
Due to speech production mechanisms, speech signals have some inherent spectral characteristics, such as temporal correlation, auto-regression, spectro-temporal structure and basic pronunciation pattern. In addition, the acquired speech signals in real-world environments contains rich noise acoustic environment information. These spectral characteristics and acoustic environment information provide a lot of valuable clues and are very worthy to be exploited for speech enhancement. Deep learning has a powerful perceptual ability and has achieved great successes in the fields of speech and image processing. Based on the basic theories and the advanced research progress of speech enhancement, we focus on the deep learning-based perception on speech spectral characteristics and noise acoustic environments and devotes our efforts to the knowledge-driven single channel speech enhancement. We summarize our works and contributions as follows:
1. The separation of speech and noise can be implemented by judging whether the noisy time-frequency (T-F) unit is dominated by speech or by noise. Naturally, the single channel speech separation can be formulated as a binary classification problem. Due to speech production mechanisms, speech signal contains strong temporal correlation, in other words, there is a strong probability correlation on whether the neighboring T-F units are dominated by speech or by noise. Therefore, the probability that the previous T-F unit is dominated by the speech can be used as a priori probability of the next T-F unit. Deep stacking network (DSN) is stacked by several basic network modules. The output of the previous basic network are used as priori information to feed into the next. The performance of the next basic network module will usually improves due to the obtained extra prior information. In order to exploit the temporal correlation of speech signals, we proposed a DSN with the time series (DSN-TS). It cleverly uses the unique structure of DSN to implement the effectively modeling of joint probabilities of the neighboring T-F units in time and improves the performance of speech separation.
2. Speech signals can be described as a auto-regression process. Through a $N$-order autoregressive model (AR), the current frame of speech signals can be predicted by the limited historical frames of speech signals. Unfortunately, in noisy environments, speech signals are inevitably disturbed by various noises and its auto-regression is severely corrupted, which makes it very difficult to predict the current clean speech signals with the noisy historical speech signals. However, the separated speech largely avoids noise interference, and effectively preserves the harmonic structure of speech. Therefore, it is possible to use the historically separated speech signal to predict the current clean signal through an AR model. In this paper, we proposed a novel auto-regression speech separation network to jointly model and optimize the speech auto-regression and separation processes. Systematic experiments show that the proposed model is competitive with the state-of-the-art method in singing-voice separations.
3. The ideal T-F masks and magnitude spectrums of target speech are the main targets of speech separation. On the one hand, they have a close relationship, and contain significant correlation and strong complementarity. On the other hand, the T-F masks and spectral features present prominent spectro-temporal structures due to speech production mechanisms. In addition, due to the sparsity of speech in the T-F domain, the spectro-temporal structures can keep relatively invariant to various auditory environments, which is very important to robust speech separation. Obviously, these characteristics are very worthy to be exploited for speech separation. In this paper, we propose a two-stage multi-target joint learning speech separation method. Firstly, we use two denoising autoencoders (DAE) to exploit the spectro-temporal structures of speech auditory features and speech separation targets by self-learning, respectively. Then the learned DAEs are combined by a linear transformation to build a multi-target DNN for speech separation. Finally, the multiple speech separation targets are jointly learning. Systematic experiments show that the proposed approach not only exploits the spatio-temporal structure of speech auditory features and speech separation targets but also make full use of the correlation and complementarity of ideal T-F masks and speech magnitude spectrograms.
4. Deep neural network (DNN)-based speech separation usually uses a supervised algorithm to learn a mapping function from noisy features to separation targets. These separation targets, either ideal masks or magnitude spectrograms, have prominent spectro-temporal structures. Because speech is produced by some basic pronunciation patterns, these spectro-temporal structures contain some basic structure patterns. Nonnegative matrix factorization (NMF) is a well-known representation learning technique that is capable of capturing the basic spectro-temporal structures with physical or perceptual properties. While DNN has a powerful perceptual ability to speech and noise. Therefore, the combination of DNN and NMF as an organic whole is a smart strategy. In this paper, we propose a jointly combinatorial scheme for speech separation. NMF is used to learn the basis spectra that then are integrated into a DNN to directly reconstruct the magnitude spectrograms of speech and noise. Instead of predicting activation coefficients inferred by NMF, DNN directly optimizes an actual separation objective。Moreover, we explore a discriminative training objective with sparsity and reconstruction constraints to suppress noise and preserve more speech components further. The jointly combinatorial scheme of DNN and NMF concentrates the strengths of both DNN and NMF for speech separation. Systematic experiments show that the proposed models are competitive with the previous methods.
5. In real-world environments, the noise acoustic environment is usually complex and varied, for example, the smoothness of noise is varying with time. The signal processing-based speech enhancement usually assumes that the noise acoustic environment is stationary or slowly varying, and uses deterministic statistical signal model for speech enhancement while ignores the uncertainty of noise acoustic environment in real-world environments. Deep learning has a powerful perceptual ability to speech and noise. To address the limitations of conventional signal processing methods, in this paper, we propose an hybrid signal processing/deep learning scheme which incorporate the powerful perceptual capabilities of deep learning in the conventional speech enhancement framework. Deep learning is used to perceive the speech presence probability and the noise acoustic environment, while the signal processing-based power spectral density update module and Wiener filter are used to enhance the desired speech. The deep learning and signal processing modules are jointly optimized by a spectrum approximation objective. Systematic experiments demonstrate the proposed approach to noise suppression in noise-unmatched and noise-matched conditions.
|First Author Affilication||Institute of Automation, Chinese Academy of Sciences|
|聂帅. 语谱特性和噪声声学环境深度感知的语音增强方法研究[D]. 北京. 中国科学院大学,2018.|
|Files in This Item:|
|聂帅-博士论文-最终版.pdf（5903KB）||学位论文||暂不开放||CC BY-NC-SA||Application Full Text|
|Recommend this item|
|Export to Endnote|
|Similar articles in Google Scholar|
|Similar articles in Baidu academic|
|Similar articles in Bing Scholar|
Items in the repository are protected by copyright, with all rights reserved, unless otherwise indicated.