The Knowledge Graphs (KGs) describe entities and their relations in the real world with a structured form. It is an important form to represent knowledge. Since Google formally proposed the concept in May 2012, it has been widely applied in the understanding of query, intelligent Q\&A, personalized recommendations and other fields. Although the current KGs contain large amounts of structured information, they are far from completeness due to the number of entities and relations in the real world is very huge and most KGs have been built either collaboratively or (partly) automatically. Therefore, there are many missing relations in the KGs and this defect limits their availabilities. The research on the method of relation discovery for KGs aims to solve the problem.
Relation discovery for KGs is to predict and complete the missing relations. The task involves two aspects: (1) Reasoning out the missing relations based on the structured facts in KGs; (2) Extracting relations between entity pairs from unstructured texts and adding them into KGs. At present, there are much work has made certain progress for the task. The first aspect has two main methods：the traditional logic rules method and representation learning method of KGs. The methods based on logic rules are vulnerable to data sparsity. The computation complexity of logic rules generation and reasoning is very high, which can't satisfy the need of large-scare KGs. The representation learning methods can remit the data sparsity problem by learning dense vectors for entities and relations in low dimensional vector space. They have higher computational efficiency than logic rules, so they are more suitable for relation completion of large-scale KGs. Therefore, for the first aspect, we mainly research the representation learning methods. The second aspect contains unsupervised, supervised and distantly supervised relation extraction. The unsupervised methods extract the words between the given entity pairs as relations and it is difficult to map the words into a certain KG. The supervised methods need manual annotation data, which limits their applying fields and scale. The distant supervision strategy is an effective method of automatically labelling dataset by KGs, and its relations (labels) are also in the KGs (does not need the mapping operations). So it is useful to extract relations for KGs. Therefore, for the second aspect, we focus on the distant supervision method.
In this dissertation, we research the key methods of relation discovery for KGs. The main contents and achievements are as follows.
1. Previous representation learning methods do not consider the multiple types of entities and relations, and let the entities and relations with different types share the same mapping matrices. This strategy would reduce the accuracy of relation prediction. To overcome the problem, we proposed a representation learning method TransD for reasoning based on dynamic mapping matrices. It defines two vectors for each entity and relation. The first vector represents the general meaning of an entity or a relation, the other one will be used to construct mapping matrices dynamically which provide a flexible way to project relation-specific ingredient of entity vector into relation vector space where the translation will be finished. Therefore, every entity-relation pair has an unique mapping matrix determined by themselves, which can adapt to the multiple types of entities and relations. In addition, TransD is more efficient than previous work because it has no matrix-by-vector operations which can be replaced by vectors operations. We evaluate TransD with the task of triplets classification and link prediction on WordNet and Freebase. The experimental results show that our method has significant improvements compared with previous models.
2. To alleviate the negative effects of heterogeneity (the number of entity pairs linked by different relations are different) and imbalance (the number of head entities and that of tail entities are different in a relation) in KGs, we propose a representation learning method TranSparse based on adaptive sparse mapping matrices. In TranSparse, we define all mapping matrices as sparse matrices and it contains two models ``share" and ``separate". The ``share" model is proposed for the heterogeneity, its head and tail entities share the same mapping matrices whose sparse degrees are determined by the number of entity pairs linked by relations. The more entity pairs linked by a relation, the sparse degrees of the relation's mapping matrix is more smaller, and
vice versa. The ``separate" model is based on ``share" and it is proposed for the imbalance. It has two mapping matrices for each relation, one for head and the other for tail. The sparse degrees are determined by the number of entities linked by relations on head or tail. The sparse matrix not only can adapt different KGs, but also can reduce the computation which makes the model easy to extend on large-scale KGs. In experiments, TranSparse improves the prediction results significantly.
3. To weaken the influence of noise data and supplement enough background knowledge of entities, we propose a distant supervision relation extraction method with sentence-level attention and entity descriptions. For a multiple instance bag, it uses the difference of the two given entities' vectors to represent relation features, then extracts each sentence's feature vector by a Piecewise Convolutional Neural Networks (PCNNs) module and computes the similarities (attention weights) between relation features and each sentence's features through a hidden layer. The sentences with higher weights are valid instances, the others are noise sentences. At last, the weighted sum of all sentence feature vectors is regarded as the bag's features and the model feeds them into a softmax classifier. In addition, we extract entity descriptions from Freebase and Wikipedia. The descriptions not only provide more information for predicting relations, but also bring better entity representations for the attention module. The experimental results show that the attention mechanism is able to selectively focus on the relevant sentences through assigning higher weights for valid sentences, and that the model can obtain more useful background knowledge from the entity descriptions. Our model outperforms all the baseline systems on hold-out and manual evaluations.
In the above work, part 1 and 2 are relation prediction based on representation learning methods, and the part 3 is relation extraction from unstructured texts. They complement each other and constitute the relation discovery content together.