基于对话的人物关系抽取技术研究

CASIA OpenIR > 毕业生 > 硕士学位论文

	基于对话的人物关系抽取技术研究
	张传强
	2016-05
学位类型	工程硕士
英文摘要	网络和信息技术的快速发展为人们彼此间交换信息带来了极大的便利，像网上聊天，短信服务等都是很好的例子，因此，分析社交媒体中人物社会关系对分析人物社交行为具有重要意义。基于对话的人物关系抽取，不同于以往的实体关系抽取，待抽取关系的两个实体不再存在于同一个句子当中，而是作为对话的双方，这样我们就需要考虑整个对话的语义信息，而且对话本身的结构特征也是我们需要考虑的对象；另外，这些社交媒体中的对话文本中常存在着同音字、同义字、错别字、特殊表情和缺少标点符号等各种问题，规范性差，无疑都增加了任务的难度。目前针对对话人物关系抽取的研究相对较少，因此，本文从以下四条路线进行对话人物关系抽取技术的研究： 1）验证基于支持向量机（SVM，Support Vector Machine）的传统文本分类方法在对话关系抽取任务上的有效性，作为本文实验的Baseline。将对话看作文本分类语料，分词、词性标注，根据词性筛选构建候选特征词集合，计算候选特征词的信息增益IG，选择IG最大的N个词组成特征向量；另外，构建关键词库和规则集合，抽取每个样本中对应存在的特征词数目和满足的规则数目，将之都映射为向量，添加到原来的特征向量中，最后，使用SVM分类器进行关系分类。通过搜集网络上和实验室同学的SMS构建了本文的数据集，实验结果表明，关键词库和规则集合的构建，使得实验结果取得了很大幅的提高。 2）提出基于层级的长短期记忆（LSTM，Long Short Term Memory）神经网络模型的方法。为了考虑对话的语义信息和结构特征，提出了一种基于层级的LSTM的深度学习的方法，将对话输入神经网络模型，生成了对话的语义表示，而且LSTM模型的输入本身就是一个序列信息，兼顾了对话时序性的特征，最后在语义表示的基础上通过分类器给两个人物实体添加关系标签。实验结果表明，基于层级LSTM的深度学习方法，取得了同传统方法极为接近的结果，并且大幅减少了人工参与。 3）在2）中的基于LSTM神经网络的分类模型中融入注意力（Attention）机制。通过神经网络，我们可以学习出输入文本的一个低维向量表示，针对于分类任务，我们仍然需要凸出一些局部特征来达到提高分类准确率的目的，而通过Attention机制，增加对分类有益的特征的权重并且降低其他特征的权重，正好满足我们预期的效果。另外，通过Attention机制关注对分类有益的部分，相当于简化了输入文本的依存路径。实验结果表明，Attention机制的引入不仅提高了分类的准确率而且加快了实验的收敛速率。 4）探索了两种自编码（Auto Encoder）和LSTM关系分类相结合的方法。一种是Auto Encoder和分类任务的联合学习，希望通过Auto Encoder生成的中间表示包含的信息更加全面；另一种是通过Auto Encoder任务训练LSTM模型，用训练好的LSTM模型参数来初始化分类任务的LSTM模型，相当于引入外部语料，以求达到提高分类效果的目的。实验结果表明，以Auto Encoder作为预训练，进一步加速了实验的收敛。 ; The rapid development of social network and information technology has brought great convenience for message exchange, such as chatting online and short message service. Therefore, analyzing the relationship between people in social media will be urgently significant for the understanding of social behavior. Unlike previous entity relation extraction tasks, the entities in relation extraction based on dialogue, which no longer exist in the same sentence, but the two parties of the dialogue, so we need to consider both the semantic information and the structure of the dialogue itself. In addition, the dialogue texts in social media often contain homonyms, synonyms, typos and special expression, the lack of punctuation together with the problems above make the related tasks more difficult to deal with undoubtedly. There is less study on dialogue relation extraction at present, therefore we conduct the research in the following four routes: 1) We verify the effectiveness of traditional text classification method based on SVM in the dialogue relation extraction task, as the baseline of this paper. Regarding the dialogues as text classification corpus, according to POS selection to construct the candidate feature set, from the information gain calculation of candidate feature words, choose the largest N words to generate the feature vector; in addition, we build two key thesaurus and a set of rules, the number of characteristic words and the number of rules that exist in each sample are extracted, then the numbers are mapped to the weights, and add the weights to the feature vector; finally, using SVM classifier for classification. The experimental results show that the sets of key vocabulary and rules, making the results achieve very significant increase. 2) We propose a method of neural network model based on hierarchical LSTM. In order to consider the semantic information and structure characteristics of the dialogues, we propose a deep learning method based on hierarchical LSTM, put the dialogue into the neural network model, generate the semantic representation of the dialogue, then the input of LSTM model itself is a sequence of information, taking into account the characteristics of sequential dialogue, finally on the basis of semantic representation, through the classifier we label two characters relations to entities. The experimental results show that the deep learning method based on hierarchical LSTM has obtained the result which is very close to the traditional method, and greatly reduces the manual participation. 3) We integrate the attention mechanism into the classification model based on LSTM neural network in 2). By neural network, we can learn a low dimensional vector representation for the input corpus, then needle for the classification task, we still need to protruding, local features to improve the classification accuracy, and the attention mechanism, increase weight of features beneficial to classification and reduce the weight of other characteristics, just to meet the expected effect; Furthermore, the attention mechanism focus on the useful part for relation classification, which is equivalent to simplify the dependency path of the input text. The experimental results show that the introduction of attention mechanism not only improves the classification accuracy but also speed up the convergence rate of the experiment. 4) We explore two ways to combine auto encoder and classification based on LSTM. One is the method of jointly training the classification task with sequence auto encoder, hope that the intermediate representation generated by auto encoder containing more comprehensive information; Another is the method of training the LSTM model preliminarily by auto encoder task, using the trained LSTM parameters to initialize the LSTM model of classification task, which is equivalent to bring in external data, in order to achieve the goal of improving the effect of classification. The experimental results show that with auto encoder as training, the convergence of the experiment is further accelerated.
关键词	语义表示长短期记忆网络自编码注意力机制
语种	中文
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/11720
专题	毕业生_硕士学位论文
作者单位	中科院自动化研究所
第一作者单位	中国科学院自动化研究所
推荐引用方式 GB/T 7714	张传强. 基于对话的人物关系抽取技术研究[D]. 北京. 中国科学院研究生院,2016.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
基于对话的人物关系抽取技术研究.pdf（2440KB）	学位论文		限制开放	CC BY-NC-SA