英文摘要 | With the popularity of smart mobile phones, wearable devices, intelligent home and on-board equipment, the human computer interaction technologies based on intelligent speech have attracted more and more attention from ITacademia and industry, and become one of the focuses in the field of mobile Internet. There are Apple’s Siri, Google Now, Microsoft’s Bing Voice Search, etc abroad. In addition, iFlyVoice, Baidu voice assistant, Sogou voice assistant, etc appear at home in recent one or two years. Because the speech interaction technology is the most natural human communication mode, the machine can understand human speech has become the urgent needs of the people. Speech recognition technology is one key technique of intelligent speech products. For these products, when the surrounding environment is relatively clean, speech recognition technology is often able to meet the practical application standard. However, when noise interference emerges, the result of speech recognition is not satisfactory. Besides, speech contains volatile mood, tempo, rhythm and real emotion, and severe coarticulation, which will lead to a large number of phoneme-level insert, delete and replace phenomena. Hence, the robustness of speech recognition systems has attracted much attention from researchers widely. This dissertation, which is based on the summarization of previous research findings, deals with the problem of robustness to speech recognition technology. We analyze and compare various feature extraction algorithms related with speech recognition systems in detail, present new robust feature extraction algorithms, and propose several new frameworks to combine different models. The main research work focused on the following four aspects: (1) Researching noise robustness of speech detection algorithms. In the aspect of feature extraction, we present a voice activity detection (VAD) algorithm based on the combination of short-term and long-term spectral patterns. Not only the algorithm combines the advantages of the feature based on short-time spectral peaks, but also exploits the virtues of long-term spectral divergence estimation, which can incorporate speech context information. On the classification modeling, we present a voice activity detection algorithm based on a hybrid architecture of support vector machine (SVM) and hidden Markov Model (HMM). The algorithm retains discriminative and nonlinear properties of SVM and models the inter-frame correlation powerfu... |
修改评论