Statistical language model simulates the natural language source using statistical method. Traditional n-gram model had great success in speech recognition. However, as the development of speech recognition, there is much demand to build high performance, high robust language model. The research of this paper wanted to improve the performance of language model, explore the new method of language modeling from the robustness and prediction ability. After studying the current progress and major problem carefully, we thought that there are three directions that we can improve the performance of language model greatly. The first one is the method to solve the data sparseness problem of language model. There are two way to solve the data sparseness problem: one is smoothing, the other is building class based model. Traditional smoothing method is using less specific model to smooth the unseen event in current model. Because the less specific model has the less performance, if we introduced too much less specific model, the prediction ability of models will be decreased. In order to solve this problem, we have presented a new smoothing method. We have introduced a truncated SVD of word relation matrix to calculate the similarity of two words. This method decrease the proportion of less specific model evidently and get the union of model' s robustness and prediction ability. Building class based model is another way to solve the data sparseness problem. We have presented a series of algorithm from word clustering to class based model building. At first, a fast, high performance word clustering method based on the similarity between word classes and hierarchical structure. This algorithm is faster than traditional greedy method based on likelihood of training data more than 10 times. And the clustering result is much closer to the semantic system of Chinese than traditional method too. At the same time, we have presented a method to construct the vari-gram tree based on the entropy and confidence of nodes. This method can get class based vari-gram much better than class based n-gram model. Our another new idea is building the class based and word based mixture models according to a tree structure lexicon. We can realize the data smoothing in arbitrary level of lexicon tree and the smoothing performance is much better than traditional method. At the same time, different smoothing strategy can be got in the common framework. The second aspect of our work focused on improving the performance of models using topic information. In order to introduce the topic information into our models, an improved MAP method for language model adaptation has been presented. The traditional MAP method mixes the task independent corpus and task dependent corpus using a fixed weight. In new method, a fuzzy controller was introduced in adaptation process, many factors that influent the adaptation process were considered as th
修改评论