This dissertation focus on the problem of separating speech from variousintrusive sounds based on the method of computational auditory scene analysis under co-channel circumstances. With the rapid development of deep learning, CASA systems based on deep machine learning, rather than mechanism modeling,have drawn much attention of researchers in this field for the past few years. Nevertheless, how to select and design proper network structure and to use various acoustic cues effectively in deep learning circumstances needs further exploration. The main works and innovations of this dissertation include: · Cooperative deep stacking network (DSN) for speech denoising. During the design process of speech separation algorithms based on deep learning, we found that DSN is very suitable for this task and it is also ready for parallelization. Based on DSN, we propose a cooperative DSN framework, which is suitable to integrate multiple acoustic cues to do speech separation jointly. · Cross-domain and multiscale cooperative DSN for speech denoising. Base on the proposed cooperative DSN framework, we first carried out the fusion of features from different time-frequency domains. As different representations of the speech mixture can provide complementary information, the performance of speech separation is improved. Further, the separation was carried out at different time-frequency scales, which also improved the performance. · Speaker separation based on deep learning. Existing speaker separation algorithms can be classified into two categories. Some of them rely on pitch information and local correlation model to do speaker separation, and others usually use generative statistical models. For example, MAXVQ and Linear VQ are based on GMM model, while NMF and sparse coding methods are based on codebook generation. The common feature of them is the usage of approximation or linear model. In this dissertation speaker separation was first carried out with a discriminative DNN model, and then the proposed cooperative DSN model was also utilized to complete this task. Quantitative experiments show the effectiveness of our method, and better results were obtained by cooperative DSN. · Multipitch tracking based on empirical mode decomposition (EMD) and HMM. Although pitch is an important acoustic cue in speech separation, it is not easy to simultaneously track multipitch. As the amplitude and peak position of autocorrelation function are sometimes not reliabl...
修改评论