面向非结构化文本的事件抽取关键技术研究

CASIA OpenIR > 毕业生 > 博士学位论文

	面向非结构化文本的事件抽取关键技术研究
	陈玉博
	2017-05-24
学位类型	工学博士
中文摘要	随着互联网技术的发展和普及，网络已经成为人们日常生活中必不可少的部分。互联网上存在大量的非结构化电子文本，面对日益增长的网页数据，如何帮助人们理解这些数据，快速地从海量的非结构化文本中发现知识，以及如何将这些文本知识表示成计算机易于“理解” 的形式，从而减轻人类的学习成本，显得越来越重要。信息抽取技术的提出正是为了解决这个问题。事件抽取是信息抽取技术的重要环节，也是信息抽取领域的难点问题。它旨在从非结构化文本中抽取出用户感兴趣的事件信息并以结构化的形式呈现出来，如什么人，什么时间，在什么地方，做了什么事。事件抽取不仅有助于互联网信息的管理与服务，而且对于文本内容理解具有重要支撑作用，能够将文本分析从语言层面提升到内容层面，对大规模知识库构建、自动问答、语义搜索、舆情监控等具有潜在的应用前景。因此，事件抽取技术得到了学术界和工业界的广泛关注，成为越来越热门的研究课题。近年来，基于机器学习的事件抽取研究已经取得了一定的进展，其中基于监督学习的方法占据了主导位置并取得一系列成果，然而其性能一直比较低。现有方法主要面临着三个挑战：（1）特征方面：特征提取过程中过分依赖现有的自然语言处理工具，存在误差累积问题；（2）语料方面：训练语料由人工标注耗时、费力、成本昂贵，而且语料规模较小、类别较少；（3）抽取过程方面：独立预测各个候选事件元素，忽略事件内部各个元素之间的关系和影响。本文针对上述挑战和问题，面向非结构化文本的事件抽取关键技术展开研究，研究成果主要包括： 1、针对特征提取过程中过分依赖自然语言处理工具造成的误差累积问题，提出基于动态多池化卷积神经网络的事件抽取方法。该方法不依赖于现有的自然语言处理工具，利用动态多池化卷积神经网络从原始文本中自动学习表示事件信息的特征，特别地考虑了一句话中有多个事件的情况。具体地，首先将输入文本表示为词向量形式，然后抽取候选事件触发词和事件元素对应的向量作为词汇级特征，同时利用动态多池化卷积神经网络进行语义组合得到句子级特征，最后将这两种特征拼接起来构成最终的特征向量。实验结果表明，与基线系统相比，该方法在事件抽取任务上性能有显著提升，改善了传统特征抽取存在的误差累积问题，同时使用动态多池化技术后系统性能进一步提升。 2、针对人工标注语料耗时、费力、成本高昂的问题，提出基于世界知识和语言学知识的事件语料大规模自动生成方法。该方法不依赖人工标注，利用世界知识和语言学知识自动生成大规模事件标注语料。首先利用世界知识发现每个事件类型的核心元素和触发词，然后利用语言学知识扩展和过滤事件触发词，最后提出面向事件抽取的远距离监督回标方法，利用事件触发词和核心元素自动地标注事件语料。评价结果显示，自动生成的语料正确率能达到85%，而且能有效扩展人工标注的语料，进而提升事件抽取模型的性能。除此之外，本文还针对自动生成数据中的噪声问题，将多示例学习算法融入到基于动态多池化卷积神经网络的事件抽取方法中，从而减少数据回标噪声对实验结果的影响。实验结果表明，在held-out 评价和人工评价两种指标上，该方法取得的结果均好于基线系统，有效缓解了回标噪声的问题。 3、针对传统方法抽取事件过程中忽略事件内部结构和候选元素之间的内在影响和语义关系的问题，提出基于双向长短期记忆张量神经网络的事件抽取方法。该方法能考虑一个事件中各个候选元素之间的内在影响和语义关系，进而联合预测一个事件中的所有元素。具体地，首先利用双向长短期记忆神经网络完成基于上下文的词语语义表示和句子级语义表示，然后，利用张量层来捕获各个候选事件元素之间的内在影响和语义关系，进而完成所有事件元素的联合预测。实验结果表明，该方法能较好地捕获一个事件中各个元素之间的内在影响和语义关系，相对于基线系统，取得了更好的效果。
英文摘要	With the development and popularization of Internet，the network has become the most essential part of everyday life. There are large amounts of unstructured texts on the Internet. Faced with the ever-growing Web data, we need to quickly discover knowledge from large-scale unstructured texts and convert the knowledge to something that the computer can understand. Information extraction aims to solve this problem. Event extraction was formulated as a critical part of information extraction. It is a fundamental and one of the most difficult tasks in the field of information extraction. Event extraction aims to automatically recognize events from unstructured texts and represent it with structured information. e.g., who, when, where, why and so on. Event extraction not only helps to manage the information and services on the Internet, but also supports for text comprehension. It can enhance text analysis from language to content level and has the potential to help large-scale knowledge base construction, question answering system, semantic search and public opinion monitor. Thus, Event extraction has being received widespread attention in academia and industry, and is becoming increasingly popular research topic. Recently, event extraction systems based on machine learning has made good progress and event extraction systems based on supervised paradigm occupy the most powerful and influential position. However, there are three challenges in these supervised event extraction system. (1) Features, these approaches primarily rely on elaborately designed features and complicated natural language processing tools. Thus, these approaches will suffer from error propagation problem. (2) Training data, hand-labeled training data is expensive to produce, in low coverage of event types, and limited in size, which makes supervised methods hard to extract large scale of events for knowledge base population. (3) Approaches, nearly all of traditional approaches extract each argument of an event separately without considering the interaction between candidate arguments. This dissertation focuses on the aforementioned challenges, and the main achievements are as follows: 1. To address the error propagation in feature extraction procedure of supervised event extraction approaches, we propose an event extraction approach based on Dynamic Multi-pooling Convolutional Neural Networks (DMCNNs). In the approach, DMCNNs are used to automatically learn features from raw texts, which is independent on the existing Natural Language Processing (NLP) tools. In addition, we devise a dynamic multi-pooling layer to capture more valuable clues in the sentence that contains more than one event. Specifically, the input of the system is raw texts. At first, word tokens are transformed into vectors by looking up word embeddings. Then, we directly regard the word embeddings of the candidate triggers and arguments as lexical level features. Meanwhile, sentence level features are learned using DMCNNs. Finally, the lexical and sentence level features are concatenated as the learned feature vector. The experimental results show that, our proposed method achieves the best performance among all of the compared methods and effectively overcomes the error propagation in the traditional feature extraction procedures. 2. To solve the data labeling problem, we propose to automatically label training data for event extraction via a world knowledge and linguistic knowledge. The proposed approach is independent on human labeled data. At first, we propose an approach to figure out key arguments of an event by using world knowledge, and use them to automatically detect events and corresponding trigger words. Then, we employ linguistic knowledge to filter noisy triggers and expand more trigger. Finally, we propose a distant supervision for event extraction, which use key arguments and triggers to automatically label training data. The experiment results show that the quality of our large scale automatically labeled data is competitive with elaborately humanannotated data. The accuracy of the automatically labeled data is 85%. Also, our automatically labeled data can augment traditional human-annotated data, which could significantly improve the extraction performance. To address the wrong label problem in the training data, we regard distant supervised event extraction as a multi-instance problem and incorporate multi-instance learning into the DMCNNs. The experimental results show that, compared with the baseline methods, our proposed method achieves better result on both held-out and manually evaluation, which means that our approach can effectively addresses the wrong label problem. 3. To make use of the structure of an event and the interaction and semantic relation between candidate arguments, we propose to extract events via a Bidirectional Long Short-Term Memory Tensor Neural Networks (BLSTM-TNN). At first, we exploit a context-aware word representation model based on Bidirectional Long Short- Term Memory Networks (BLSTM) to capture the semantics of words from plain texts. In addition, we devise a tensor layer to explore the interaction and semantic relation between candidate arguments and predict all arguments simultaneously. The experimental results show that our approach significantly outperforms the state-of-the-art methods.
关键词	信息抽取事件抽取非结构化文本卷积神经网络语料自动生成
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/14647
专题	毕业生_博士学位论文
作者单位	中国科学院自动化研究所
推荐引用方式 GB/T 7714	陈玉博. 面向非结构化文本的事件抽取关键技术研究[D]. 北京. 中国科学院大学,2017.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
yubochen.pdf（4636KB）	学位论文		限制开放	CC BY-NC-SA