Community question answering system has become an important venue for web users to get information and share knowledge. The community question answering system , such as Yahoo! Answers, Baidu know and etc, solve tens of thousands of questions every day. Moreover, community question answering system has accumulated a huge number of question and answer pairs. Questions in community question answering system represent the information needs of web users. Therefore, it is of great importance to make good use of these questions (current, historical) for enhancing the web user experience. In this thesis, we investigate several key problems in community question answering system based on a large scale of real dataset. These problems includes question classification, question retrieval, question labels generation and question routing. The main contributions of this thesis include following issues: Large-Scale Question Classification by Leveraging Wikipedia Semantic Knowledge Question classification in community question answering system is usually a difficult task. Compared with traditional methods, there are two main difficulties: the large number of target categories and feature sparseness. In this thesis, we propose a two-stage approach for question classification in community question answering system that can tackle the difficulties of the traditional methods. In the first stage, we perform a search process to prune the large-scale categories to focus our classification effort on a small subset. In the second stage, we enrich questions by leveraging Wikipedia semantic knowledge to tackle the data sparseness. As a result, the classification model is trained on the enriched small subset. The experimental results show that our proposed method significantly outperforms the traditional BOW methods by 10%-15%. Learning the Latent Topics for Question Retrieval After several years of development, the community question answering system has accumulated hundreds of millions of historical questions and answers, It is of great importance to find historical questions that are semantically equivalent or relevant to the queried questions. Compared with traditional text or webpage, the content in question is very short. Therefore, The “lexical gap” problem is more serious in information retrieval. In this paper, we propose a topic model incorporated with the category informat...
修改评论