面向社区问答的问题分析与处理关键技术研究

CASIA OpenIR > 毕业生 > 博士学位论文

	面向社区问答的问题分析与处理关键技术研究
其他题名	Research on Question Analysis and Processing for Community Question Answering
	蔡黎
	2012-05-24
学位类型	工学博士
中文摘要	社区问答系统已成为互联网上信息获取和知识分享的重要媒介。例如Yahoo! Answers、百度知道等社区问答网站每天发布有数以万计的问题。与此同时，社区问答系统也积累了数以亿计的已有问题和答案对。社区问答系统中的问题代表用户需求，如何利用这些问题（当前的、已有的），对更好地满足用户的信息需求，提高用户体验，具有重要研究价值。本文就社区问答系统中问题分析关键技术进行了系统的研究，涉及了以下几个方面：包括问题分类、相似问题检索、问题新类别标签生成以及问题推送。本文的主要贡献如下：提出了基于语义知识扩展的大类别问题分类的方法社区问答系统中问题分类通常是难度较大的工作。相对于传统文本分类任务主要有两个难点：目标类别数目庞大和分类特征稀疏。本文针对这个问题，提出了一个两段式解决方法。在第一阶段中，首先通过检索手段挖掘已有问题集中语义相似问题的类别信息，把大类别分类问题转化为常规分类问题。在第二阶段中，利用维基百科的结构化语义知识扩展问题分类的特征，降低了分类特征稀疏对分类结果的负面影响，有效提高了分类准确率。实验结果显示，相比于传统的基于词袋子模型的分类方法，基于语义知识扩展的大类别问题分类的方法提升了10%-15%的分类性能。提出了融合类别信息和隐含主题的相似问题检索模型社区问答系统经过几年的发展，已经积累了数以亿计的已有问题和答案。从大量已回答的已有问题中，找到与查询问题语义相似的已有问题，既可以立刻满足用户的信息需求，又可以避免重复提问，提高已有资源使用效率。相对传统文本或网页，问题包含的文本长度短很多。信息检索中的难题―“词汇鸿沟”问题，在相似问题检索中更加突出。相对于完全非结构化的文档或网页，本文提出将已有问题的类别信息融入到隐含主题模型；然后，将融合类别信息和隐含主题的相似问题检索模型和基于翻译的语言模型融合成一个新的相似问题检索模型。本文在Yahoo！Answers实验数据集上进行了实验。实验结果表明，与几种基本检索方法相比较，本文提出的问题检索方法提升了10%-20%的检索性能。提出了融入已有类别体系信息的问题的新类别标签生产方法社区问答系统对提问用户输入的查询问题通常用一个类别体系进行组织和管理。由编辑来维护这个类别体系存在准确性和实时性两个问题。前人的工作没有考虑生成类别标签与已有的类别体系一致性。本文的主要动机就是将已有的类别体系信息融入到类别标签生成过程中。本文提出了融入已有类别体系信息的问题的新类别标签生成方法，首先将问题映射到维基百科的概念上，利用概念在不同领域和类别上的分布信息，计算概念权重，抽取出权重高的概念；然后利用维基百科提供的分类图，挖掘权重高的概念和分类图的信息，生成候选标签。最后对候选标签进行过滤和重排，使候选标签与已有的类别体系更加一致。本文在实验中设计了一个评价手段，利用现有的类别体系验证了本文提出的方法，实验结果表明，相比于最好的基准系统，本文提出的方法也能取得约10%的性能提升。提出了融入答案质量的问...
英文摘要	Community question answering system has become an important venue for web users to get information and share knowledge. The community question answering system , such as Yahoo! Answers, Baidu know and etc, solve tens of thousands of questions every day. Moreover, community question answering system has accumulated a huge number of question and answer pairs. Questions in community question answering system represent the information needs of web users. Therefore, it is of great importance to make good use of these questions (current, historical) for enhancing the web user experience. In this thesis, we investigate several key problems in community question answering system based on a large scale of real dataset. These problems includes question classiﬁcation, question retrieval, question labels generation and question routing. The main contributions of this thesis include following issues: Large-Scale Question Classiﬁcation by Leveraging Wikipedia Semantic Knowledge Question classiﬁcation in community question answering system is usually a difﬁcult task. Compared with traditional methods, there are two main difﬁculties: the large number of target categories and feature sparseness. In this thesis, we propose a two-stage approach for question classiﬁcation in community question answering system that can tackle the difﬁculties of the traditional methods. In the ﬁrst stage, we perform a search process to prune the large-scale categories to focus our classiﬁcation effort on a small subset. In the second stage, we enrich questions by leveraging Wikipedia semantic knowledge to tackle the data sparseness. As a result, the classiﬁcation model is trained on the enriched small subset. The experimental results show that our proposed method signiﬁcantly outperforms the traditional BOW methods by 10%-15%. Learning the Latent Topics for Question Retrieval After several years of development, the community question answering system has accumulated hundreds of millions of historical questions and answers, It is of great importance to ﬁnd historical questions that are semantically equivalent or relevant to the queried questions. Compared with traditional text or webpage, the content in question is very short. Therefore, The “lexical gap” problem is more serious in information retrieval. In this paper, we propose a topic model incorporated with the category informat...
关键词	社区问答系统问题分类问题检索标签生成问题推荐 Community Question Answering System Question Classification Question Retrieval Label Generation Question Routing
语种	中文
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/6414
专题	毕业生_博士学位论文
推荐引用方式 GB/T 7714	蔡黎. 面向社区问答的问题分析与处理关键技术研究[D]. 中国科学院自动化研究所. 中国科学院研究生院,2012.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
CASIA_20081801462908（5692KB）			暂不开放	CC BY-NC-SA