This paper introduces multilingual spoken term detection, STD, system and evaluates it on CallHome and CallFriend multilingual databases published by Linguistic Data Consortium. Seven languages including Arabic, English, German, Japanese, Korean, Chinese Mandarin and Spanish, are considered in the multilingual acoustic modeling task. A lot of works are focused on the comparison of multilingual acoustic models estimated by two different methods - the conventional global phoneme set, GPS, based GMM method and the recently proposed subspace Gaussian mixture model, SGMM, method. The experimental results show that the resulting multilingual STD system is capable of supporting seven different languages simultaneously without any adaptation. It is also observed substantial performance gains achieved by multilingual system over monolingual systems. A language-independent STD method is discussed in this paper for those languages or dialects which are extremely lack of training resource. The solution to sparseness issue of training resource in target language is to use available training resources of existing languages to estimate a multilingual phoneme classifier, multi-layer perceptron, MLP. The keyword audio samples and test utterances can both be represented as general phonetic posteriorgram, which is a time sequence consisting of phonetic posterior vectors, generated by multilingual MLP mentioned above. The modified dynamic time warping algorithm with ``sliding matching window" and loose path extension constraint is applied to find warping paths between keyword sample and test utterance with minimum matching cost. Currently, the majority of state-of-art speech recognition systems depend on a large amount of transcribed speech data to robustly estimate acoustic model. However, the acquisition of large training resources is a challenging task. Especially, the transcription of audio data typically involves expensive manual labors of language experts in particular, and is very time-consuming. In the case of under-resourced language or dialect in general, the collection of large training data is major bottleneck for developing LVCSR system. The unsupervised learning has been gaining popularity as a method to greatly reduce human efforts. It is the first attempt to apply unsupervised learning method on subspace acoustic model. The key features presented in this paper include: augmented UBM by bootstrapping and enhanced lattice-based utterance level confide...
修改评论