面向低资源情境的事件抽取方法研究

CASIA OpenIR > 多模态人工智能系统全国重点实验室 > 自然语言处理

	面向低资源情境的事件抽取方法研究
	刘健
	2020-05
页数	120
学位类型	博士
中文摘要	随着信息技术的高速发展,互联网上的信息规模呈爆炸式增长。信息抽取技术关注从文本数据中挖掘有用信息。事件抽取是信息抽取领域中的重要问题,旨在从文本中抽取和事件相关的信息。近年来,事件抽取研究取得了很大进展。然而,目前大多数事件抽取方法都采用监督学习范式,它们依赖于大规模的人工标注数据和高质量的自然语言处理工具。在真实应用场景中,人工标注数据比较稀缺,自然语言处理工具也不完善,这些问题极大地阻碍了现有事件抽取方法的应用。针对事件抽取方法在真实应用场景中存在的问题,本研究重点关注面向低资源情境的事件抽取方法,致力于降低事件抽取方法对人工标注数据和自然语言处理工具的依赖性。本研究分别从跨语言数据增广、特征模仿学习和跨任务迁移学习三个角度展开,具体包括: 1. 针对人工标注数据的稀缺性问题,提出了一种基于少量平行资源的跨语言事件识别方法。与以往的跨语言事件识别方法相比,该方法只需要非常少量的平行资源 (包含 5,000 个词语对) 即可实现跨语言的数据增广。在该方法中,提出了上下文相关的词汇翻译模型,利用基于词向量映射的方法实现多语动态词汇映射。另外,提出了词序无关的跨语言事件识别模型,通过自注意力机制和图卷积神经网络抽取词序无关的特征进行模型训练,解决了源语言和目标语言词序不一致的问题。在标准事件识别数据集上的实验表明,该方法可以大大降低目标语事件识别对人工标注数据的依赖性,且具有跨语言方向无关性,显著提升了事件识别的性能。 2. 针对部分语言缺少自然语言处理工具的问题,提出了一种基于特征模仿学习的端到端事件识别方法。在学习过程中,该方法通过特征模仿机制,可以将传统方法中构建特征向量的过程隐式地“蒸馏”到特征编码阶段,从而摆脱了预处理过程对外部自然语言处理工具资源的依赖。在模型层面,提出了一种基于对抗模仿机制的事件识别模型。该模型首先构建教师编码器和学生编码器,分别学习集成了标注信息的“标准特征向量”和不含标注信息的“非标准特征向量”,然后构建对抗判别器以区分这两种向量。经过对抗训练,学生编码器可以在判别器的指导下对教师编码器进行“模仿”,以进行事件识别。在标准数据集上的实验表明,相较以往方法,该方法在事件识别过程中不依赖于任何外部自然语言处理工具,更适用于面向低资源情境的事件识别任务。 3. 针对低频次事件抽取问题,提出了一种基于跨任务迁移学习的事件抽取方法。该方法以全新的视角看待事件抽取问题,把其显式地建模成一个阅读理解问答任务。这种建模方式使得模型可以联合利用机器阅读理解任务中的标注数据进行训练,以提高模型适应低频次事件类型的能力。在模型层面,设计了一种基于无监督问题生成策略和机器阅读问答模型的事件抽取框架。该框架首先使用无监督问题生成策略对事件抽取任务关注的事件触发词和事件元素生成相应的查询问题,然后使用机器阅读问答模型对每一个查询问题检索答案,最后综合所有答案作为事件抽取结果。实验结果表明,相较以往方法,该方法显著提升了事件抽取方法的性能。同时,该方法可以有效地处理只有有限个训练样本 (少次)或者完全没有训练样本 (零次) 的事件类型。本文的工作、方法和结论对于进一步探索和建立更加高效的事件抽取系统具有重要的指导意义。
英文摘要	With the rapid development of information technology, the amount of data on the Internet has increased exponentially. It becomes urgent and necessary to study how to extract useful information among the mass of data. As one important task of information extraction (IE), event extraction (EE) aims to extract event information specifically. Despite many efforts for EE, most of the existing EE methods adopt supervised learning paradigm and rely on large-scale human annotations and high-quality natural language processing (NLP) toolkits. In the real-world scenarios, the labelled data is usually scarce and many languages lack NLP toolkits, which greatly hinder the applicability of existing EE methods. In this paper, we focus low-resource EE, aiming to reduce the dependence of EE methods on the labelled data and NLP toolkits. We address the problems faced by low-resource EE from three aspects: cross-lingual data augmentation, feature imitation learning, and cross-task transfer learning, as follows: 1. In order to reduce the dependence on manually labelled data, we propose a cross-lingual approach for event detection (ED), a core step of EE. Compared with previous cross-lingual methods, our approach relies on only a small seed dictionary to achieve cross-lingual data augmentation. Specifically, in our method, a context-sensitive trans- lation model is designed, which uses word embedding projection to achieve dynamic word mapping. Moreover, in order to address the word order difference between the source and target languages, an order-invariant cross-language ED model is proposed, which can extract the order-invariant features for training via self-attention mechanisms and Graph Convolutional Neural Networks (GCNs). The empirical results on the standard ED datasets demonstrate that our approach can greatly reduce the dependency on manually labelled data of a target language. Moreover, it fits with different cross-lingual directions, achieving a new state-of-the-art performance. 2. In order to reduce the dependence on NLP toolkits, we propose an end-to-end ED approach based on feature imitation learning. Our approach adopts an adversarial learning mechanism which can “distill” the feature extraction procedure into our feature encoding stage, to get rid of external NLP toolkits for pre-processing. Specifically, in our model, a teacher feature encoder and a student feature encoder are designed to respectively learn the “standard feature vector” and ”non-standard feature vector”, based on whether the annotation tags are considered. Then an adversarial discriminator is built to distinguish between the above two feature vectors to discern the teacher and student encoders. In the training stage, the student encoder manages to “imitate” the teacher encoder, under the guidance of the adversarial discriminator, to detect events. The experimental results show that our approach greatly advances the state-of-the-arts. Moreover, as our approach does not require any NLP toolkit in the testing stage, it fits with the low-resource ED tasks. 3. In order to address the problem of few-shot/zero-shot EE, we propose a new EE approach based on cross-task transfer learning. Our approach takes a new perspective on EE and frames it as machine reading comprehension (MRC) problem. This enable us to train an EE model jointly with the training data in the EE task and MRC task, which can largely improve the performance of our model on event types which has very limited event training data or completely no event training data. Our approach integrates an unsupervised question generation module and a machine reading based question answering (QA) module. To perform EE, it first generates query questions for event triggers and arguments via the unsupervised question generation module, and then retrieves answers for each query question using the machine reading based QA module. Lastly, our model combines all the answers as the final EE results. The experimental results demonstrate that our approach greatly advances the state-of-the-arts and shows promising results in tackling the few-shot/zero shot EE problem. The methods and conclusions in this paper shed lights on building more effective and robust EE systems that fit with real-word scenarios.
关键词	自然语言处理,信息抽取,事件抽取,低资源事件抽取
语种	中文
七大方向——子方向分类	自然语言处理
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/39193
专题	多模态人工智能系统全国重点实验室_自然语言处理
推荐引用方式 GB/T 7714	刘健. 面向低资源情境的事件抽取方法研究[D]. 中国科学院自动化研究所. 中国科学院大学,2020.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
Thesis-signed.pdf（4435KB）	学位论文		开放获取	CC BY-NC-SA