基于语义知识的问题理解关键技术研究

CASIA OpenIR > 毕业生 > 硕士学位论文

	基于语义知识的问题理解关键技术研究
	林子琦
	2020-12
页数	128
学位类型	硕士
中文摘要	信息呈爆炸式增长的互联网时代，问答系统能够接受用户使用自然语言提出的问题，直接返回准确、清晰的答案，已成功应用于众多领域，成为继搜索引擎之后，备受关注的信息获取新兴方式。然而，中文在语言形态、表达方式等方面与其他语言的差异性，以及中文问答系统在语义表达、问句实体发现与链接等方面缺少标注数据，使得大量针对英文的语义表达模型和数据驱动式学习算法难以直接迁移；同时，在基于知识库的问答系统中，大部分模型关注于构建问答对的打分模型，依赖于信息检索等方法获取候选答案，缺乏对问题语义的深层理解，因而难以对复杂问句的答案进行推理。因此，如何获取和利用中文问句的语言特征，构建适用于中文的问答系统是加深计算机对于中文问题理解的重要研究方向，具有重要的研究价值和应用前景。本文针对基于知识库的问答系统对中文问题缺乏深度理解的问题，分析总结汉语基本短语的语言特征，构建基于概念知识树知识表示模型的中文问句语义表达模型；提出一种基于问句表达的中文问句实体发现和链接的渐进式联合框架，减少错误累计，降低标注数据稀缺的影响，实现问句与知识的连接；并基于问句语义表达模型，设计层级化语义知识的复杂问句答案推理算法，增强语义推理算法的准确性和泛化性。本文主要研究内容和创新点包括：（1）为实现对中文问句和知识间语义关系的深度刻画，提出一种基于概念知识树的中文问句语义表达模型，该模型首先构建15种汉语基本短语的语义表达模型，进而根据中文问句各成分间的组成关系、问句匹配的知识，建立包含语言特征和语义信息的问句语义表达模型，实现从问句表达到层级化知识的语义连接。同时，提出一种基于依存句法分析的问句表达生成算法，实现从问句到问句表达模型的自动构建。最后，在人工构建的语义表达数据集上验证算法，实验结果表明，该算法能有效识别和分析短语的各组成成分，提出的语义表达模型能够充分刻画短语中不同成分的语义关系和特征。（2）针对中文问句中实体发现与链接过程的错误累计，以及监督学习模型在标注数据稀缺条件下性能欠佳问题，提出一种实体发现和链接的渐进式联合框架，利用实体发现和链接之间的相互信息，增强实体名称生成、过滤、合并和实体链接的性能，降低错误累计。同时，结合问句表达模型，构建基于规则的问句实体发现方法，实现领域独立的实体发现。在两个真实数据集上，该方法将经典算法的实体发现和链接性能分别提高了0.18%-8.98%和0.02%-6.97%、0.29%-13.07%和0.26%-12.62%；实验结果表明，该方法可以有效降低标注数据稀缺对监督学习模型性能的影响，同时该联合框架能够灵活地结合不同类型的实体发现和链接方法，通过利用二者间的相互依赖信息，增强问题理解中实体发现和链接的性能。（3）针对基于问答对的打分模型难以处理复杂问句的问题，提出一种知识驱动的问句语义推理算法，该算法首先基于问句语义表达模型，设计层级化的动态问句语义推理算法，并利用推理结果不断更新问句语义表达，实现对中文复杂问句的递归式推理。进一步，构建基于置信度的不确定性问句语义推理算法，实现从特定领域扩展到开放领域的问句语义推理，增强语义推理算法的泛化性。实验结果表明，提出的算法对于特定领域和开放领域的问答均能提取有效答案；同时，在问句的语义推理中，该算法利用问句语义表达，能够有效获取问句的关键信息，提高问句语义推理的性能。最后，本文将研究成果应用于儿童智能教育，设计并实现了一个由数据层、模型层、算法层、业务层和应用层组成的基于语义知识的中文问答系统，并实现了开放领域知识库的半自动化构建。因而，通过问答的方式，让知识学习不受时间和空间的限制，增加了儿童教育中知识学习的丰富度和趣味性。其中，该系统可根据应用场景，自由地切换和选择对应的知识库或方法，实现不同领域的问答，并为用户提供知识库的可视化分析工具，增强了知识库的维护和管理。
英文摘要	In the Internet era where information is exploding, the question answering (QA) system can accept questions in natural language and directly return accurate and distinct answers, which has been successfully applied in many fields. It has become an emerging method of acquiring information that has received much attention beyond the search engine. However, Chinese is different from other languages on morphology, expression, etc. Besides, there is a lack of annotated data on semantic representation, question entity discovery and linking in Chinese QA systems. It invalidates most English semantic representation methods and data-driven learning methods. Moreover, most current methods focus on calculating the scores between questions and answers, and strongly depend on information retrieval methods to obtain candidate answers. They always result in failure to capture the deep semantic comprehension containing in questions, hardly reasoning the answers to complex questions. Therefore, how to capture and utilize Chinese linguistic characteristics, as well as construct QA systems suitable for Chinese is an important research direction to deepen the computer comprehending of Chinese questions. And it has significant research value and application prospect. Considering the lack of deep question comprehension in Chinese knowledge-based QA systems, we summarize the linguistic characteristics of Chinese primary phrases and construct a Chinese question semantic representation model based on concept knowledge tree (CKT). To connect questions with knowledge, we design a progressive joint framework for Chinese question entity discovery and linking with question representations, which can reduce error accumulation and the impacts of insufficient annotated data. Based on question semantic representation model, we design hierarchical knowledge-driven reasoning algorithm for complex questions. It can enhance the performance and generalization of semantic reasoning algorithm. The main contributions are summarized as follows: (1) To achieve a deep description of the semantic relationship between Chinese questions and knowledge, we propose a Chinese question semantic representation model based on CKT. According to the composition and matched knowledge of Chinese questions, we firstly design 15 kinds of semantic representation models for Chinese primary phrases, then build a question semantic representation model, which contains linguistic characteristics and semantic information. It can build the connections from question representations to hierarchical knowledge. Moreover, we propose a question representation generation algorithm based on dependency syntactic parsing, which can automatically generate question semantic representation model from questions. Finally, we perform a comprehensive experiment on a semantic representation dataset to validate the effectiveness of our algorithm. The results demonstrate that it can effectively recognize the components of phrases. And the semantic representation model can depict the semantic relationships and features between different components in phrases. (2) Considering the problems of error accumulation and the inferior performance of supervised models in insufficient annotated data, we propose a progressive joint framework for entity discovery and linking. It leverages their mutual information to improve the performances of entity generation, filtering, merging and linking, which can reduce error accumulation. Furthermore, we design a rule-based question entity discovery method based on question representation model, which is domain-independent. The entity discovery and linking performances of classical algorithms respectively increased by 0.18%-8.98% and 0.02%-6.97%, 0.29%-13.07% and 0.26%-12.62% on two real datasets. The experimental results show that the progressive joint framework can effectively reduce the impacts of insufficient annotated data for supervised models. And it can flexibly integrate multiple methods, and leverage the mutual dependency information to improve the performances of entity discovery and linking. (3) Most existing scoring models for question-answer pairs are hardly to deal with the complex questions. Hence, we propose a knowledge-driven question semantic reasoning algorithm. Firstly, it designs hierarchical dynamic question semantic reasoning algorithm based on the question semantic representation model. Then it uses the reasoning results to continually update the semantic representations, and recursively reasons the complex questions. To enhance the generalization of the semantic reasoning algorithm, we further propose a confidence-based uncertainty semantic reasoning algorithm, which extends the knowledge-driven question semantic reasoning algorithm from the domain-specific to the open-domain. The experiments indicate that our method can extract effective answers in both of the domain-specific and open-domain QAs. Moreover, in the question semantic reasoning, the proposed algorithm can effectively acquire the critical information of questions by using question semantic representations, which improves the performance of question semantic reasoning. Finally, we apply the research into children's intelligence education, and develop a Chinese QA system based on semantic knowledge. It consists of five layers: a data layer, a model layer, an algorithm layer, a business layer and an application layer. And we semi-automatically construct an open domain knowledge base. Through the way of QA, the system breaks the limitations of time and space, and raises the richness and interest of knowledge learning in children's education. According to the different application scenarios, this system can freely switch to the corresponding knowledge bases or methods, and realize the QA in different domains. Besides, this system can provide users with a visualized analysis tool for knowledge bases, enhancing the maintenance and management of knowledge bases.
关键词	基于知识库的问答概念知识树问句语义表达语义推理实体发现和链接
语种	中文
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/41617
专题	毕业生_硕士学位论文复杂系统认知与决策实验室_智能系统与工程
推荐引用方式 GB/T 7714	林子琦. 基于语义知识的问题理解关键技术研究[D]. 中国科学院自动化研究所. 中国科学院自动化研究所,2020.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
基于语义知识的中文问题理解关键技术研究_（8144KB）	学位论文		限制开放	CC BY-NC-SA