Recent years have seen great improvements in the performance of systems to automatically recognize speech. These STT systems can now produce transcriptions of sufficient quality to enable some important tasks, such as information retrieval. Research is now focusing on making the transcripts more readable. On the other hand, real-world audio signals, such as broadcast news and telephone, can not be seemed as input to ASR system directly. The problem of distinguishing audio signals into these different audio types is thus becoming increasingly significant. The audio segmentation and annotation techniques make useful preprocessing for any audio indexing and recognition system. This dissertation presents the researchs on content-based audio segmentation and annotation. The work of dissertation mainly includes the following contributions: (1)A two-level segmentation method that detects speaker or other acoustic changes in a continuous audio stream effectively is proposed. The method divides the acoustic change detection process into two levels: region level that detects the potential change regions containing candidate acoustic change points, and boundary level that searches and refines the true change points. The region level search for the potential change regions in continuous local windows. The boundary level perform T2 and BIC algorithm to detect segment boundaries within the potential windows. (2)A novel method to choose the effective samples for SVMs training based on regression tree in audio classification task is proposed. The objective is to reduce the training time of SVM by choosing effective examples from the training set and to balance the number of training points of binary classes. We test the performances of our new method on a dataset composed of about 6-hour audio data which illustrate that the computation time can be significantly reduced without a significant decrease in the prediction accuracy. (3)An efficient, robust and unsupervised algorithm for speaker clustering is proposed. Each utterance is modeled as a single Gaussian model distribution. A novel distance metric is proposed in this paper for the purpose of determining stopping criteria. The advantage of the proposed method is that it achieves comparable performance without requiring an adjusting threshold term. On the other hand, the framework of iterative segmentation and clustering is studied. Speaker change detection and clustering are preformed concurrently in a single iterative process. (4)The general framework of audio segmentation and annotation is proposed. The system consists of three main components, namely speech detection, speaker change detection and speaker clustering. We also investigate into the system application problems of constraint conditions and preprocessing step for ASR system.
修改评论