CASIA OpenIR  > 毕业生  > 硕士学位论文
低资源场景下的事件抽取方法研究
马文杰
2022-05-16
页数62
学位类型硕士
中文摘要

随着互联网的普及和信息技术的高速发展,互联网上的信息规模呈指数级增长。信息抽取技术关注从海量的数据中挖掘有价值的信息。事件抽取作为信息抽取的重要组成部分,旨在从海量的非结构化文本数据中抽取关注的事件信息,在金融、舆论监控等方面发挥着重要作用。近年来,事件抽取研究获得了长足的发展,但是大多数事件抽取研究都基于深度学习方法,其依赖于大规模的标注数据训练模型。然而,现有事件数据集规模较小且分布不均匀,这些问题很大程度上限制了事件抽取方法的应用。

针对事件抽取存在的问题,本文主要从重构事件抽取任务范式和事件数据增广两个方面来缓解低资源场景下的事件抽取问题。本文的工作主要包含以下两个方面:

一.针对真实应用场景中低频次事件的抽取问题,提出了一种基于阅读理解问答范式的事件抽取方法,该方法的核心是将事件抽取任务重构为阅读理解问答范式。首先,所提方法将事件抽取任务建模为阅读理解问答形式,通过构造包含事件先验知识的阅读理解问题,从待抽取文本中检索问题答案,最后将问题的答案进行组合作为事件抽取的结果。另外,通过设计实体跨度预测网络有效地增强了模型的多论元实体抽取能力,并且通过构建包含先验知识的阅读理解问题提高了模型的表达能力。实验证明,该方法显著提升了中文事件抽取的性能,并且可以有效应对低频次事件抽取问题。

二.针对事件抽取任务中标注数据稀缺问题,提出基于样本转换和自训练的事件数据增广方法。基于样本转换的事件数据增广方法通过在已有数据集上进行随机实体替换、随机交换位置、随机同义词替换操作,实现对ACE2005中文数据集的扩充,增加了数据的多样性。基于自训练的事件数据增广方法利用少量的标注数据通过自训练方法充分利用未标注数据的事件信息,增强了模型的泛化性能。实验表明,本文提出的两种事件数据增广方法在提升模型性能上效果明显,进一步证明了两种数据增广方法的有效性。另外,基于自训练的事件数据增广方法还可以借助少量的标注语料对未标注数据进行标注,大大提升了数据标注的效率。

英文摘要

With the popularization of the Internet and the rapid development of information technology, the scale of information on the Internet has grown exponentially. Information extraction technology focuses on mining valuable information from massive data. As an important part of information extraction, event extraction aims to extract interesting event information from massive unstructured text data, and plays an important role in finance and public opinion monitoring. In recent years, event extraction research has made great progress, but most of the event extraction research is based on deep learning methods, which rely on large-scale labeled data to train models. However, the existing event datasets are small in scale and unevenly distributed, and these problems largely limit the application of event extraction methods.

Aiming at the problems existing in event extraction, this thesis mainly alleviates the problem of event extraction in low-resource scenarios from two aspects: reconstruction of the event extraction task paradigm and event data augmentation. The work of this thesis mainly includes the following two aspects:

1. Aiming at the extraction of low-frequency events in real application scenarios, an event extraction method based on the reading comprehension question and answer paradigm is proposed. The core of the method is to reconstruct the event extraction task into the reading comprehension question and answer paradigm. First, the proposed method models the event extraction task as a question-and-answer format for reading comprehension. By constructing reading comprehension questions containing event prior knowledge, the question answers are retrieved from the text to be extracted, and finally the answers to the questions are combined as the result of event extraction. . In addition, the multi-argument entity extraction ability of the model is effectively enhanced by designing an entity span prediction network, and the expressive ability of the model is improved by constructing a reading comprehension problem that includes prior knowledge. Experiments show that this method significantly improves the performance of Chinese event extraction, and can effectively deal with the problem of low-frequency event extraction.

2. Aiming at the scarcity of labeled data in the event extraction task, an event data augmentation method based on sample transformation and self-training is proposed. The event data augmentation method based on sample transformation realizes the expansion of the ACE2005 Chinese data set by performing random entity replacement, random exchange position, and random synonym replacement operations on the existing data set, increasing the diversity of data. The event data augmentation method based on self-training utilizes a small amount of labeled data to make full use of the event information of unlabeled data through the self-training method, which enhances the generalization performance of the model. Experiments show that the two event data augmentation methods proposed in this thesis have obvious effects on improving the performance of the model, which further proves the effectiveness of the two data augmentation methods. In addition, the event data augmentation method based on self-training can also use a small amount of labeled corpus to label unlabeled data, which greatly improves the efficiency of data labeling.

关键词事件抽取
学科门类工学 ; 工学::控制科学与工程
语种中文
文献类型学位论文
条目标识符http://ir.ia.ac.cn/handle/173211/48697
专题毕业生_硕士学位论文
推荐引用方式
GB/T 7714
马文杰. 低资源场景下的事件抽取方法研究[D]. 中科院自动化研究所. 中科院自动化研究所,2022.
条目包含的文件
文件名称/大小 文献类型 版本类型 开放类型 使用许可
低资源场景下的事件抽取方法研究.pdf(3013KB)学位论文 限制开放CC BY-NC-SA
个性服务
推荐该条目
保存到收藏夹
查看访问统计
导出为Endnote文件
谷歌学术
谷歌学术中相似的文章
[马文杰]的文章
百度学术
百度学术中相似的文章
[马文杰]的文章
必应学术
必应学术中相似的文章
[马文杰]的文章
相关权益政策
暂无数据
收藏/分享
所有评论 (0)
暂无评论
 

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。