英文摘要 | Abstract In speech signal processing domain, it is an important to separate the target speech from mixed speech. In application, such as speech recognition, audio retrieval and hearing aid, there are maybe multiple speech sources in one channel, so multiple highly un-stationary signals exist synchronously. Then how to separate the target speech from these mixed speeches is a challenging problem. In this thesis, we integrated computational auditory scene analysis and speaker acoustics model to explore separating target speech from mixed speech with multiple human sound sources. Many CASA systems could not separate the mixture speech of multi-speaker very well. We proposed a multi-pitch tracking algorithm, in which, we utilized the piecewise continuity in multi-channel time-frequency decomposition to detect pitches first, and then tracked the multiple pitch contours by mathematical morphology filtering. By combining the information about the pitch contours of the target speech and interference acquired by Multi-pitch tracking, the performance of separating the mixture speech of multi-speaker is improved effectively. Time-frequency decomposition takes rich information for middle level expression, but it was not sufficiently utilized in former study, so we did thorough exploration in fine structure for harmonics. we studied the distribution of harmonics in low frequency channel, and built a distribution template of channel-first peak position in autocorrelation-pitch and harmonics template. Then we applied the harmonics template to multi-pitch detection and harmonics reconstruction, and improved corresponding performance. For mixed speech with multiple speakers, if we detected the speaker identifications in them, the corresponding speaker acoustics model could be applied in speech separation system. We studied two-stage multi-speaker recognition from mixed speech. In first stage, we utilized confidence measure score algorithm, in which likelihood score limit parameter and gain compensate parameter were introduced, to obtain results list. In second stage, we utilized composite speaker model to search the best speaker combination, also we explored a fast algorithm. Experiment results showed that the proposed two-stage algorithm can detect the speaker identifications in mixed speech accurately, and can supply reliable candidate models for following speech separation. We studied thoroughly the applications of speech acoustic model in speech separation. We utilized the results of multi-speaker recognition to select speaker models, and in the CASA framework, we applied the speaker model to infer masks to re-synthesize speech signals. Since there were frequency lacks in re-synthesized speech signal with binary masks, we utilized speaker model information to estimate real masks, and experiments showed that the real masks were excelled than conventional binary masks. |
修改评论