Relation extraction aims to extract attribute relations and relations between entities from semi-structured and unstructured texts. It is one of the core tasks in information extraction. Due to its potential effects on the construction of the large-scale knowledge base, question answering, semantic search etc., it has gained much attention in both the academic community and the industrial community. And the massive heterogeneous Web data and the diversified Web language also pose great challenges to the relation extraction techniques. In the paper, we focus on extracting relations from semi-structured and unstructured texts. The main content is as follows: 1, To handle the problem of template inconsistency in weakly semi-structured texts, we propose a method that leverages the site-level knowledge with templates and attributes to extract attribute relations from weakly semi-structured texts. First, we use a graph-based random walk model to acquire templates and attributes with high confidence, which constitute the site-level knowledge. Then we utilize such knowledge to identify weakly semi-structured texts in each page, and extract attribute-value pairs to get attribute relations with corresponding entities. The experiments show that, comparing with the baseline method which does not utilize site-level knowledge, our method can improve the extraction performance significantly. 2, Distant supervision (DS) for relation extraction suffers from the problem of noisy labeling. Most solutions try to model the noisy instances in the form of multi-instance learning. However, the distant supervision assumption may fail, which causes a bad performance. In this paper, we employ a novel approach to address this problem by exploring distinctive features. First, We make use of all the training data (both the labeled part that satisfies the DS assumption and the part that does not). Then, we employ an unsupervised method based on a topic model to discover the feature-relation distribution. We use the distribution to compute the clarity of a feature, and we compute the distinctiveness of a feature by combining its clarity and its informativeness which is measured by the length and frequency of the feature. At last, we train the extractor by using the distinctiveness as the value of the feature, where the distinct features will get greater weight than the noisy ones. Experimental results show that the approach significantly outperforms the baseline methods in both th...
修改评论