CASIA OpenIR  > 毕业生  > 博士学位论文
面向非结构化文本的事件抽取关键技术研究
陈玉博
Subtype工学博士
Thesis Advisor赵军
2017-05-24
Degree Grantor中国科学院大学
Place of Conferral北京
Keyword信息抽取 事件抽取 非结构化文本 卷积神经网络 语料自动生成
Abstract
随着互联网技术的发展和普及,网络已经成为人们日常生活中必不可少的
部分。互联网上存在大量的非结构化电子文本,面对日益增长的网页数据,如
何帮助人们理解这些数据,快速地从海量的非结构化文本中发现知识,以及如
何将这些文本知识表示成计算机易于“理解” 的形式,从而减轻人类的学习成
本,显得越来越重要。信息抽取技术的提出正是为了解决这个问题。
事件抽取是信息抽取技术的重要环节,也是信息抽取领域的难点问题。它
旨在从非结构化文本中抽取出用户感兴趣的事件信息并以结构化的形式呈现
出来,如什么人,什么时间,在什么地方,做了什么事。事件抽取不仅有助于
互联网信息的管理与服务,而且对于文本内容理解具有重要支撑作用,能够将
文本分析从语言层面提升到内容层面,对大规模知识库构建、自动问答、语义
搜索、舆情监控等具有潜在的应用前景。因此,事件抽取技术得到了学术界和
工业界的广泛关注,成为越来越热门的研究课题。近年来,基于机器学习的事
件抽取研究已经取得了一定的进展,其中基于监督学习的方法占据了主导位
置并取得一系列成果,然而其性能一直比较低。现有方法主要面临着三个挑
战:(1)特征方面:特征提取过程中过分依赖现有的自然语言处理工具,存在
误差累积问题;(2)语料方面:训练语料由人工标注耗时、费力、成本昂贵,
而且语料规模较小、类别较少;(3)抽取过程方面:独立预测各个候选事件元
素,忽略事件内部各个元素之间的关系和影响。本文针对上述挑战和问题,面
向非结构化文本的事件抽取关键技术展开研究,研究成果主要包括:
1、针对特征提取过程中过分依赖自然语言处理工具造成的误差累积问题,
提出基于动态多池化卷积神经网络的事件抽取方法。该方法不依赖于现有的自
然语言处理工具,利用动态多池化卷积神经网络从原始文本中自动学习表示事
件信息的特征,特别地考虑了一句话中有多个事件的情况。具体地,首先将输
入文本表示为词向量形式,然后抽取候选事件触发词和事件元素对应的向量作
为词汇级特征,同时利用动态多池化卷积神经网络进行语义组合得到句子级特
征,最后将这两种特征拼接起来构成最终的特征向量。实验结果表明,与基线
系统相比,该方法在事件抽取任务上性能有显著提升,改善了传统特征抽取存
在的误差累积问题,同时使用动态多池化技术后系统性能进一步提升。
2、针对人工标注语料耗时、费力、成本高昂的问题,提出基于世界知识
和语言学知识的事件语料大规模自动生成方法。该方法不依赖人工标注,利用
世界知识和语言学知识自动生成大规模事件标注语料。首先利用世界知识发现
每个事件类型的核心元素和触发词,然后利用语言学知识扩展和过滤事件触发
词,最后提出面向事件抽取的远距离监督回标方法,利用事件触发词和核心元
素自动地标注事件语料。评价结果显示,自动生成的语料正确率能达到85%,
而且能有效扩展人工标注的语料,进而提升事件抽取模型的性能。除此之外,
本文还针对自动生成数据中的噪声问题,将多示例学习算法融入到基于动态多
池化卷积神经网络的事件抽取方法中,从而减少数据回标噪声对实验结果的影
响。实验结果表明,在held-out 评价和人工评价两种指标上,该方法取得的结
果均好于基线系统,有效缓解了回标噪声的问题。
3、针对传统方法抽取事件过程中忽略事件内部结构和候选元素之间的内
在影响和语义关系的问题,提出基于双向长短期记忆张量神经网络的事件抽取
方法。该方法能考虑一个事件中各个候选元素之间的内在影响和语义关系,进
而联合预测一个事件中的所有元素。具体地,首先利用双向长短期记忆神经网
络完成基于上下文的词语语义表示和句子级语义表示,然后,利用张量层来捕
获各个候选事件元素之间的内在影响和语义关系,进而完成所有事件元素的联
合预测。实验结果表明,该方法能较好地捕获一个事件中各个元素之间的内在
影响和语义关系,相对于基线系统,取得了更好的效果。

Other Abstract
With the development and popularization of Internet,the network has become
the most essential part of everyday life. There are large amounts of unstructured texts
on the Internet. Faced with the ever-growing Web data, we need to quickly discover
knowledge from large-scale unstructured texts and convert the knowledge to something
that the computer can understand. Information extraction aims to solve this problem.
Event extraction was formulated as a critical part of information extraction. It is
a fundamental and one of the most difficult tasks in the field of information extraction.
Event extraction aims to automatically recognize events from unstructured texts and
represent it with structured information. e.g., who, when, where, why and so on. Event
extraction not only helps to manage the information and services on the Internet, but
also supports for text comprehension. It can enhance text analysis from language to
content level and has the potential to help large-scale knowledge base construction,
question answering system, semantic search and public opinion monitor. Thus, Event
extraction has being received widespread attention in academia and industry, and is becoming
increasingly popular research topic. Recently, event extraction systems based
on machine learning has made good progress and event extraction systems based on supervised
paradigm occupy the most powerful and influential position. However, there
are three challenges in these supervised event extraction system. (1) Features, these
approaches primarily rely on elaborately designed features and complicated natural
language processing tools. Thus, these approaches will suffer from error propagation
problem. (2) Training data, hand-labeled training data is expensive to produce, in low
coverage of event types, and limited in size, which makes supervised methods hard to
extract large scale of events for knowledge base population. (3) Approaches, nearly
all of traditional approaches extract each argument of an event separately without considering
the interaction between candidate arguments. This dissertation focuses on the
aforementioned challenges, and the main achievements are as follows:
1. To address the error propagation in feature extraction procedure of supervised
event extraction approaches, we propose an event extraction approach based on Dynamic Multi-pooling Convolutional Neural Networks (DMCNNs). In the approach,
DMCNNs are used to automatically learn features from raw texts, which is independent
on the existing Natural Language Processing (NLP) tools. In addition, we devise
a dynamic multi-pooling layer to capture more valuable clues in the sentence that contains
more than one event. Specifically, the input of the system is raw texts. At first,
word tokens are transformed into vectors by looking up word embeddings. Then, we
directly regard the word embeddings of the candidate triggers and arguments as lexical
level features. Meanwhile, sentence level features are learned using DMCNNs.
Finally, the lexical and sentence level features are concatenated as the learned feature
vector. The experimental results show that, our proposed method achieves the best
performance among all of the compared methods and effectively overcomes the error
propagation in the traditional feature extraction procedures.
2. To solve the data labeling problem, we propose to automatically label training
data for event extraction via a world knowledge and linguistic knowledge. The
proposed approach is independent on human labeled data. At first, we propose an approach
to figure out key arguments of an event by using world knowledge, and use
them to automatically detect events and corresponding trigger words. Then, we employ
linguistic knowledge to filter noisy triggers and expand more trigger. Finally, we
propose a distant supervision for event extraction, which use key arguments and triggers
to automatically label training data. The experiment results show that the quality
of our large scale automatically labeled data is competitive with elaborately humanannotated
data. The accuracy of the automatically labeled data is 85%. Also, our automatically
labeled data can augment traditional human-annotated data, which could
significantly improve the extraction performance. To address the wrong label problem
in the training data, we regard distant supervised event extraction as a multi-instance
problem and incorporate multi-instance learning into the DMCNNs. The experimental
results show that, compared with the baseline methods, our proposed method achieves
better result on both held-out and manually evaluation, which means that our approach
can effectively addresses the wrong label problem.
3. To make use of the structure of an event and the interaction and semantic relation
between candidate arguments, we propose to extract events via a Bidirectional
Long Short-Term Memory Tensor Neural Networks (BLSTM-TNN). At first, we exploit a context-aware word representation model based on Bidirectional Long Short-
Term Memory Networks (BLSTM) to capture the semantics of words from plain texts.
In addition, we devise a tensor layer to explore the interaction and semantic relation
between candidate arguments and predict all arguments simultaneously. The experimental
results show that our approach significantly outperforms the state-of-the-art
methods.
Document Type学位论文
Identifierhttp://ir.ia.ac.cn/handle/173211/14647
Collection毕业生_博士学位论文
Affiliation中国科学院自动化研究所
Recommended Citation
GB/T 7714
陈玉博. 面向非结构化文本的事件抽取关键技术研究[D]. 北京. 中国科学院大学,2017.
Files in This Item:
File Name/Size DocType Version Access License
yubochen.pdf(4636KB)学位论文 暂不开放CC BY-NC-SAApplication Full Text
Related Services
Recommend this item
Bookmark
Usage statistics
Export to Endnote
Google Scholar
Similar articles in Google Scholar
[陈玉博]'s Articles
Baidu academic
Similar articles in Baidu academic
[陈玉博]'s Articles
Bing Scholar
Similar articles in Bing Scholar
[陈玉博]'s Articles
Terms of Use
No data!
Social Bookmark/Share
All comments (0)
No comment.
 

Items in the repository are protected by copyright, with all rights reserved, unless otherwise indicated.