With the rapid development of Internet, the text data in the network increases dramatically. The massive text data contains a wealth of knowledge, which can support the development of a variety of intelligent applications, but it can bring a huge redundant information, which makes it difficult for people to find the information they want. We need to discover knowledge from large-scale text data and convert the knowledge to something that the computer can understand. Knowledge extraction aims to solve this problem.
In this thesis, we focus on the problem of extracting knowledge from unstructured texts on the premise of predefined relation set. In order to define the relation set and preprocess the extracted text, we firstly study the technology of topic extraction. The three different kinds of knowledge extraction methods were proposed, they are: the supervised relation extraction method, the distant supervised relation extraction method and the joint knowledge extraction method. The main achievements of this thesis are shown as follows:
Firstly, a hybrid neural network model is proposed for topic extraction. The topic within text contains the main semantic information of the text. Extracting the topic information from text is helpful to define the relation set and extract the specific domain knowledge. We propose a neural model based on two-dimensional convolution and two-dimensional pooling to extract topic information and this model is able to capture the semantic information in the text. We conduct experiments on six public dataset and the experimental results show that the proposed method outperforms most of the state-of-the-art methods.
Secondly, a neural network based on attention mechanism is proposed for supervised relation extraction. Relation extraction is to identify the relationship of two given entities in the text, which is an important step of pipelined knowledge extraction method. The main weakness of most existing methods is that most features are explicitly derived from Natural Language Processing (NLP) tools, the errors generated by NLP tools would propagate in these methods and these features constructed on one domain could not utilized by another domain. Another weakness is that most existing methods treat all words in the text as the same important and ignore the fact the keywords are more crucial for the relation than other words in the text. Based on the above analysis, we propose a novel neural network based on attention mechanism to extract relation without any handcrafted features. Experimental results on the SemEval-2010 relation classification task show that the proposed method only with word embeddings outperforms most of existing methods.
Thirdly, a hierarchical selective attention mechanism based neural network is proposed for distant supervised relation extraction. In relation extraction, one challenge that is faced when building a machine learning system is the generation of training examples, which is time-consuming when labelling text. The distant supervised relation extraction methods suffer from the wrong label problem. A sentence that mentions two entities does not necessary express their relation. To solve these problems, we propose a hierarchical attention mechanism based neural network, which does not rely on annotated text. We conduct experiments on as widely used dataset and the experimental results demonstrate that the proposed method performs significantly better than most of existing methods.
Finally, a hybrid neural network is proposed to jointly extract entities and their relations. Traditional pipelined knowledge extraction methods treat the task as two separated tasks, i.e., named entity recognition and relation extraction. They neglect the relevance of these two subtasks. Besides, most of existing joint methods are feature based structured systems, which need complicated feature engineering and heavily rely on the supervised NLP tools. We propose a hybrid neural network to jointly extract entities and their relations without using any handcrafted features. Experimental results on the CoNLL04 dataset demonstrate that the proposed model using only word embedding as input achieves state-of-the-art performance.
修改评论