Large vocabulary, speaker-independent and continuous speech recognition appears to be one of the most difficult task in Automatic Speech Recognition(ASR). The breakthrough of the field will open up lots of speech-centric application in information technology systems. This research aims at the development of some key component technology particularly in acoustic model, search algorithm, as well as the final integrated system. The article analysis the syllable perplexity firstly under various knowledge sources and conclude the importance of lexicon constraint in acoustic and language level processing in Chinese speech recognition. In acoustic model research, two kinds of modeling unit are compared and four catagory context-dependent (inter and intra syllable, tone and endpoint) acoustic models are explored individually. Then two schemes of acoustic modeling methodology are proposed for the complete and uniform solution under current HMM engine based on the ideas of data sharing and parameter smoothing. One of the methods, we call knowledge and data hybrid driven decision tree methods, is taken as our final solution of the Chinese acoustic modeling which can integrate all of our valuable findings. For recognition algoirhtm, two-pass search algorithm based on the monosyllable structure of Chinese speech is proposed. The algorithm not only reduce the search error to zero but also accelerate the speed greatly. The thesis also point out the another use of algorithm for the rejection of out of vocabulary in lots of pratical application. The characteritics of two tree copy methods in continuous speech recognition, namely word conditioned and time conditioned, are discussed. And according to the fact of many existing hynonyms of Chinese language, these two tree-copy methods are combined into a new algorithm under the word bigram constraints. Taking the word-lattice as the module interface, A* algorithm is used for the final sentence search. The prelimary results show the effectiveness of the algorithm and character accuracy for speakerindependent continuous speech recognition reaches 70% averagely.. Finally, component modules of a isolated word dictation system are described. It includes the modules of frond-end processing based on MFCC and CMN, CCBC based adaptation, the building of statistical language model and its post-processing, the real time issues of the system etc. All technologies make it possible to reach 90% charater accuracy for unknown speech input channel.
修改评论