Entity relation extraction is one of the main tasks of information extraction. The input of this problem is multi-structured data, including structured data (infobox form), semi-structured data (tables and lists) and non-structured data (free text). And the output is a set of fact triples extracted from input data. Entity relation triples can be easily extracted from the structured and semi-structured data, and current research mainly refers to extracting relation triples from unstructured text. For example, given the sentence ”Yao Ming was born in Shanghai” as input, the relation extraction algorithm should extract ”” from it. These fact triples can be used to build a large, high-quality knowledge base, which can benefit to a wide range of NLP tasks, such as question-answering system, semantic web and machine translation. Now massive Chinese information exists on the internet and the research of Chinese entity relation extraction will have important significance. But current research mainly focuses on the processing of English resource and the study conducted on Chinese corpus is less. Compared to English language, Chinese language need word segmentation, and the proper nouns don’t have the first letter capitalized, so the Chinese entity relation extraction is more difficult and more challenging. This thesis thoroughly explores and investigates the Chinese entity relation extraction approaches, and the main contributions are as follows: 1. A Chinese knowledge base is built. The structured part of the Chinese text from Baidu Encyclopedia and Hudong Encyclopedia is extracted and stored after being transformed to a triple which is organized in the form . A relation word is identified as a high-frequent relation word when its frequency in the knowledge base is larger than a threshold and other wise a low-frequency relation word. 2. The extraction of high-frequency relation word is converted into a sequence labeling issue. We can get enough triples containing the high-frequency relation word by traversing the knowledge base, and these triples can be used to automatically tag training data. And the sentences used for extraction are located in the specified phrase page according to key words matching strategy. We train a conditional random field model to label the extracted part and then generate corresponding relation triples. 3. The method based on some simple rules and knowledge base is...
修改评论