面向信息网络的表示与关联方法研究

CASIA OpenIR > 毕业生 > 博士学位论文

	面向信息网络的表示与关联方法研究
	车飞虎
	2022-05-16
页数	130
学位类型	博士
中文摘要	信息网络一般指具有特定类型节点和边的有向图，在现实生活中广泛存在。根据节点和边的种类不同，信息网络有三种常见的形式：同质信息网络、异质信息网络和知识图谱。学习信息网络的表示然后根据表示获取潜在的关联在生物信息、推荐系统和信息检索等领域有着广泛的应用。近些年来，在深度学习与图神经网络的推动下，信息网络的表示与关联方法取得了突出的成果，但仍然存在一些挑战：(1) 同质网络过度依赖于标签数据或负样本；(2) 异质网络难以捕捉到子图之间的语义相似性；(3) 知识图谱的高质量困难负样本比较稀疏。为了缓解上述问题，本文采用自监督学习方法，并将负样本作为切入点，分别从无负样本学习、设计新负样本进行对比学习以及基于已有负样本生成高质量的困难负样本这三个点展开研究。本文的主要工作和创新点可以总结为如下三个方面： (1) 面向同质信息网络的无监督无负样本方法研究。目前已有的同质信息网络学习方法主要是通过监督信息的指导或者正负样本的对比进行学习，但在真实场景中监督信息或者有效的负样本往往难以获取。为了应对无监督无负样本的场景，本文提出了一种基于自举机制的同质信息网络学习模型，模型分为两个部分：在线网络和目标网络，模型的核心思想是通过在线网络与目标网络互相学习，从而摆脱对监督信息和负样本的依赖。此外，考虑到在线网络与目标网络需要相似但不同的输入，本文采用了两类图数据增强方法来生成同质网络的两个视角。所提模型在三个公共数据集上的四组实验验证了模型的有效性。 (2) 面向异质信息网络的子图间相似性捕捉方法研究。由于异质信息网络含有丰富的语义信息，现有方法一般通过元路径将异质信息网络划分为若干个同质子图，这些子图因为具有语义相关的拓扑结构而具备很强的语义相似性，但是目前已有的方法难以捕捉到子图之间的语义相似性。为了弥补该缺陷，本文提出了一种子图间的对比学习模型。模型将两个具有相同节点特征和语义相关拓扑结构的子图分别当作锚样本与正样本，然后让这两个子图经过编码器得到的表示在向量空间中距离拉近。为了与两个子图的近距离形成对比，本文设计了与它们具有相同节点特征但是没有拓扑结构的负样本，并且让锚样本与负样本的距离拉远。为了进一步凸显拓扑结构的重要性，本文将正样本与负样本的编码器参数共享，从而使得正、负样本与锚样本的距离差距只是因为正样本与锚样本具有语义相关的拓扑结构。本文在四个数据集上进行了六组实验，从不同方面展示模型的优越性。 (3) 面向知识图谱的高质量困难负样本挖掘方法研究。知识图谱表示学习的核心是对比正样本和负样本三元组。由于知识图谱中只存在正样本三元组，一般通过随机选取其他实体来替换正样本三元组的实体，从而生成负样本三元组。当前方法存在两方面的不足：一是固定采样得到的负样本会随着模型训练逐渐变得容易区分，以至于导致梯度消失；二是通过替换已存在的实体得到的负样本语义具有单一性，无法融合不同负样本的语义信息。为了应对这些不足，本文通过混合操作生成高质量的困难负样本。为了动态地负采样，本文提出了两种筛选已有困难负样本的标准，这两个标准可以让模型在训练的不同时期选择该时期下的困难负样本。紧接着，本文将选择后的困难负样本进行混合，生成基于虚拟实体得到的困难负样本，这种负样本融合了不同困难负样本的语义信息，因此为模型提供了更有价值的梯度更新。在两个数据集和四个评分函数上的四组实验表明本文所提模型相比于之前的负采样方法可以生成高质量的困难负样本，并且取得超越之前负采样方法的性能。
英文摘要	Information networks generally refer to directed graphs with specific types of nodes and edges, which are widespread in real life. According to the types of nodes and edges, there are three common forms in information networks: homogeneous information networks, heterogeneous information networks, and knowledge graphs. Learning representations of information networks and then obtaining potential correlations have a wide range of applications in bioinformatics, recommender systems, information retrieval, and other fields. In recent years, driven by deep learning and graph neural networks, the representation learning and correlations methods on information networks have achieved outstanding results, but there are still some challenges: (1) homogeneous networks rely heavily on labeled data or negative samples; (2) it is difficult for heterogeneous networks to capture the semantic similarity between subgraphs; (3) high-quality negative samples are relatively sparse in knowledge graphs. To alleviate the above problems, this thesis uses self-supervised learning methods, and takes negative samples as an entry point, then conducts research from three points: learning without negative samples, designing new negative samples for contrastive learning, and generating high-quality hard negative samples based on existing negative samples. The main works and innovations of this thesis can be summarized into the following three aspects: (1) Research on homogeneous information network without labeled data or negative samples. The existing learning methods for homogeneous information network mainly rely on the guidance of supervision or the comparison of positive and negative samples, but obtaining supervision or effective negative samples is difficult in real scenes. To deal with the unsupervised scenario without negative samples, this thesis proposes a homogeneous information network learning model based on bootstrapping mechanism. The model contains two parts: the online network and the target network. The core idea is making the online network and the target network learn from each other, which makes the model get rid of supervision and negative samples. Furthermore, considering that the online network and the target network require similar but different inputs, this thesis utilizes two classes of graph-structured data augmentation methods for generating two perspectives of homogeneous networks. Four groups of experiments on three public datasets validate the effectiveness of the model. (2) Research on heterogeneous information network to capture the similarity between subgraphs. Since heterogeneous information networks contain rich semantic information, the existing methods generally divide the heterogeneous networks into several homogeneous subgraphs through meta-paths. These subgraphs have strong semantic similarity due to their semantically related topological structures, but the existing methods cannot capture the semantic similarity between subgraphs. To make up for this deficiency, this thesis proposes a contrastive learning model between subgraphs. The model treats two subgraphs with the same node features and semantically related topology as the anchor samples and positive samples, respectively, and then let the representations of the two subgraphs get closer in the vector space. In order to contrast with the close distances of the two subgraphs, this thesis designs negative samples with the same node features but no topology, and makes the anchor samples farther away from the negative samples. To further highlight the importance of topology, this thesis makes the encoders of the positive and negative samples share the same parameters, so that the reason why the distances (between anchor and positive samples) are much smaller than the distances (between anchor and negative samples) are only that the positive and anchor samples have semantically related topology. This thesis conducts six groups of experiments on four datasets to demonstrate the superiority of the model from different aspects. (3) Research on knowledge graph for mining high-quality hard negative samples. The core of knowledge graph representation learning is to contrast positive and negative triplets. Since there are only positive triplets in knowledge graphs, negative triples are generally generated by randomly selecting other entities to replace the entities in positive triplets. There are two shortcomings in the present methods: first, the negative samples obtained by fixed sampling becomes easier to distinguish with the model's training process, so that the gradients disappear; second, the semantics of the negative samples obtained by replacing the existing entities are single, and the semantic information of different negative samples cannot be integrated. To cope with these deficiencies, this thesis generates high-quality hard negative samples through mixing operation. To achieve dynamic negative sampling, this thesis proposes two criteria for selecting existing hard negatives, which allows the model to select hard negatives in different periods of training. In order to fuse the semantic information of different negatives, this thesis uses virtual entities to generate hard negative samples through mixing operation, which offers more valuable gradients. Four groups of experiments on two datasets and four scoring functions show that the proposed model can generate higher-quality hard negatives, and achieve performances that surpass previous negative sampling methods.
关键词	信息网络网络表示学习自举机制对比学习负样本采样
语种	中文
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/48820
专题	毕业生_博士学位论文
推荐引用方式 GB/T 7714	车飞虎. 面向信息网络的表示与关联方法研究[D]. 中国科学院自动化研究所. 中国科学院自动化研究所,2022.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
面向信息网络的表示与关联方法研究.pdf（8695KB）	学位论文		限制开放	CC BY-NC-SA