Language model is to capture and utilize the regular patterns of natural language. It plays an increasingly important role in speech recognition, machine translation and other applications of natural language processing. As a part of the research project Speaker-independent Chinese Continuous Speech Dictator, a National High-Tech R&D Plan project, this research aims at the language model and its adaptation for Chinese speech recognition. Before the construction of an N-gram(N=l,2,3) statistical language model, an adaptive word segmentation algorithm based on interactive machine learning is designed, with the word segmentation correct rate of 99.89%. Using 80M training text from People" s Daily, this language model greatly improves the performance of our large-vocabulary continuous speech recognition system. The data smoothing method, Backoff method with discounting parameters, is analyzed in this thesis. It reduces the pinyin (transcription)-character conversion error rate by 16.95%. The role of trigram in Chinese language model is also investigated, in which a linguistic explanation is proposed after some statistical analysis. On the basis of a detailed introduction and comparison of different adaptive language models, the supervised Cache-based language model adaptation is introduced. This method can reduce the pinyin (transcription)-character conversion error rate by 20-40% after learning the content and style of test articles online. Finally, the domain-keywords based lexicon and language model adaptation is proposed. By "forgetting" irrelevant words and learning Out-of-Vocabulary domain keywords automatically, the lexicon size can be reduced by 60%, the probability parameters by 50% and the running time by 50%. The pinyin (transcription)-character conversion error rate can be reduced by 37% when the supervised Cache-based language model adaptation is combined.
修改评论