基于预训练语言模型的概念体系自动构建方法研究

CASIA OpenIR > 毕业生 > 硕士学位论文

	基于预训练语言模型的概念体系自动构建方法研究
	王思懿
	2024-05-16
页数	60
学位类型	硕士
中文摘要	概念体系（Taxonomy）描述概念之间的上下位语义关系并使用层次结构组织概念集合，是一类重要的知识体系，被广泛应用于信息检索、个性化推荐和问答系统等任务。目前，大量概念体系主要由人工构建完成，例如，语言概念体系 WordNet 和常识概念体系 Cyc 完全由领域专家构建，DBpedia 中的概念体系是工程师通过观察 Wikipedia 中的概念类别/标签的命名和组织方式后总结得到。这种方式不仅费时费力，构建的概念体系还常常存在大量概念遗漏的情况，使得更新和维护知识库的成本高昂。因此需要使用高效的概念体系自动构建方法。研究人员已经提出了一些方法，这些方法可以大致分为流水线式和端到端式两大类。流水线式方法存在错误传播的问题，而端到端式方法则存在缺乏全局信息指导的问题。本文针对上述问题展开研究，具体包括： 1. 基于强化学习的端到端概念体系自动构建方法针对流水线方法存在的错误传播问题和端到端方法存在的缺乏全局信息指导的问题，本文提出了基于强化学习的端到端概念体系构建方法。该方法利用预训练语言模型提取词语特征，并采用可视矩阵将二维概念树转换为一维序列。随后，利用预训练语言模型再次提取特征，并将其作为概念树的特征，用于指导强化学习动作的选择。概念树的特征不仅包括节点的语义特征信息，还是包含了概念树的结构信息，使得模型能够全面考虑层次结构信息来选择节点的位置。在 WordNet 公开数据集上进行的实验表明，本文的方法相比基线模型在 F1 值上取得了 1.7% 的提升，相比当时的最好结果在同等条件下有 3.9% 的提升。 2. 基于树结构信息感知的概念体系自动构建方法针对端到端方法缺乏对局部信息的处理问题，本文提出了基于树结构信息感知的概念体系自动构建方法。该方法结合了长短期记忆网络和图神经网络，从底向上提取概念树的特征，使得所有信息在根节点处汇聚。这种特征提取方式更有效地保留了概念树的结构信息。本文设计了基于度感知的评价函数，综合考虑了概念树的层次和结构对边的重要性的影响，赋予每条边不同的权重，从而更全面地反映了概念树构建的效果。此外，为了充分利用概念树的局部信息，还考虑了词语之间丰富的关系，而不仅局限于上下位关系。实验结果表明，在公开的 WordNet 英文数据集上相比原有的方法取得了 9.4% 的提升，相比当时最好的结果有 1.3% 的提升，相比第一个方法则有 7.5% 的提升。 3. 基于大模型的概念体系自动构建探索针对大语言模型在概念体系自动构建领域的探索不足的问题，本文对基于大模型的概念体系自动构建进行了实验性探索。为了探索预训练语言模型和大模型在概念体系自动构建任务中的效果，本文进行了一系列对比试验，涉及模型输入、微调方法和模型类型等方面。本文研究了输入模板对上下位关系判断的影响，并尝试了不同的微调方式和模型类型。通过这些实验，希望更深入地了解大语言模型在概念体系构建中的表现行为。在探索实验中，最好的实验结果达到了 76.1% 的 F1 值，相比本文的前两个方法分别取得了 18.1% 和 11.3% 的 F1 值提升，这体现了大模型在概念体系自动构建领域的极大潜力。
英文摘要	Taxonomies, which describe hierarchical semantic relationships among concepts and organize sets of concepts in a hierarchical structure, are an important type of knowledge system widely applied in tasks such as information retrieval, personalized recommendation, and question-answering systems. Currently, a large number of taxonomies are primarily constructed manually. For example, linguistic taxonomies like WordNet, and common-sense taxonomies like Cyc, are constructed entirely by domain experts, while taxonomies in DBpedia are derived by engineers through observing the naming and organization of concept categories/tags in Wikipedia. However, this manual approach is not only time-consuming and labor-intensive but also often results in numerous concept omissions, leading to high costs in updating and maintaining knowledge bases. Therefore, there is a need for efficient methods for automatic taxonomy construction. Researchers have proposed several methods, which can be roughly classified into two categories: pipeline and end-to-end methods. Pipeline methods suffer from the issue of error propagation, whereas end-to-end methods lack guidance from global information. This paper addresses the aforementioned issues by conducting research, including 1. End-to-End Taxonomy Construction Using Reinforcement Learning In response to the error propagation issue of pipeline methods and the lack of global information guidance in end-to-end methods, this paper proposes an end-to-end concept taxonomy construction method based on reinforcement learning. This method utilizes pre-trained language models to extract word features and employs a visual matrix to convert the two-dimensional concept tree into a one-dimensional sequence. Subsequently, pre-trained language models are used again to extract features, which are then employed as features of the concept tree to guide the selection of reinforcement learning actions. The features of the concept tree not only include semantic feature information of nodes but also incorporate structural information of the concept tree, enabling the model to comprehensively consider hierarchical structure information for selecting node positions. Experiments conducted on the WordNet public dataset demonstrate that our method achieves a 1.7% improvement in F1 score compared to the baseline model and a 3.9% improvement compared to the best results at the time under similar conditions. 2. Taxonomy Construction Based on Tree Structure Perception Addressing the issue of end-to-end methods lacking local information handling, this paper proposes a tree-structured information-aware approach for automated concept taxonomy construction. This method combines Long Short-Term Memory (LSTM) networks and Graph Neural Networks (GNNs) to extract features of the concept tree from bottom to top, aggregating all information at the root node. This feature extraction method more effectively preserves the structural information of the concept tree. The paper designs an evaluation function based on degree awareness, which comprehensively considers the influence of hierarchy and structure of the concept tree on the importance of edges, assigning different weights to each edge to more comprehensively reflect the effectiveness of concept tree construction. Furthermore, to fully utilize the local information of the concept tree, rich relationships between words are considered, not limited to just hyponym-hypernym relations. Experimental results on the publicly available WordNet English dataset show a 9.4% improvement compared to the original method, a 1.3% improvement compared to the best results at the time, and a 7.5% improvement compared to the first method mentioned. 3. Exploration of Concept Taxonomy Construction Based on Large Models Addressing the inadequacy of exploration in the domain of automated concept taxonomy construction using large language models, this paper conducts experimental exploration on concept taxonomy construction based on large models. To investigate the effectiveness of pre-trained language models and large models in concept taxonomy construction tasks, a series of comparative experiments are conducted, involving aspects such as model inputs, fine-tuning methods, and model types. The paper examines the impact of input templates on determining hyponym-hypernym relationships and explores different fine-tuning methods and model types. Through these experiments, the aim is to gain a deeper understanding of the performance of large language models in concept taxonomy construction. In the exploratory experiments, the best experimental result achieves an F1 score of 76.1%, showing significant potential of large models in the domain of automated concept taxonomy construction, with F1 score improvements of 18.1% and 11.3% compared to the preceding two methods, respectively.
关键词	概念体系自动构建强化学习预训练语言模型
语种	中文
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/56646
专题	毕业生_硕士学位论文
推荐引用方式 GB/T 7714	王思懿. 基于预训练语言模型的概念体系自动构建方法研究[D],2024.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
王思懿-最终论文0528.pdf（2808KB）	学位论文		限制开放	CC BY-NC-SA