Separating target speech signal from background noises is one of the key problems in speech processing. Computational auditory scene analysis is one promising approach to this problem. In most of the present speech separation systems, the time domain signal is firstly decomposed into time-frequency (T-F) domain due to the non-stationary property of speech. The separation problem can be transformed into the ideal binary or ratio mask estimation task. In recent years, statistical classification models have been widely used for binary mask estimation. However, the correlation information between adjacent time-frequency units has not received much attention. In this thesis, we studied the local correlation information and integrating it with the original systems to improve the performance. The main works and innovations include: Discussing the approximate property of the ideal binary mask(IBM) to the ideal ratio mask (IRM) in the signal-to-noise ratio(SNR) sense. The IBM and IRM are two commonly used computational goals in speech separation. Since the computational goal is one of the key problems in designing a complex system, we firstly analyze and compare the two mask models in the mean square error and SNR senses. According to Paseval's equality, the presentations of the mean square error(MSE) in the T-F domain corresponding to the IBM and the IRM are derived. Under approximate W-Disjoint Orthogonality assumption, we analyze the MSE of the IRM. Then,the difference between the two masks is analyzed. We further find that the upper bound of the difference in signal-to-noise ratio (SNR) is approximately equal to 3.01dB. In practice separation, the difference is usually smaller than 1dB. Integrating the local correlation model with Bayes classification for the IBM estimation. We propose a time-frequency segmentation based adaptive prior model of the IBM and a local noise tracking based prior model of noise energy. Then, the two prior models are integrated with the original Bayes classification. Due to the local correlation information, the posterior distribution is a high-dimension function. Finally,we use Markov Chain Mento Carlo algorithm to approach the expectation of the IBM. Experiments show that the local correlation model could improve the accuracy of the IBM estimation the SNR gain of the separated speech. Further more, since some outliers in the IBM estimation is suppressed, a more smooth estimation is obtained. Markov random field based sp...
修改评论