机器阅读理解关键技术研究

CASIA OpenIR > 毕业生 > 博士学位论文

	机器阅读理解关键技术研究
	王炳宁
	2018-05-24
学位类型	工学博士
英文摘要	近年来，随着技术的不断发展，自然语言处理的研究逐渐转入自然语言理解的研究。机器阅读理解，便在这个大背景下应运而生。机器阅读理解旨在让机器能够像人类一样理解文本的意义。机器阅读理解不仅仅是对文字的“感知”，更是对文字的“认知”。随着深度学习技术的蓬勃发展，当前机器阅读理解的主要方法是基于深度神经网络的表示学习技术。这种技术利用深度神经网络对文档和问题进行建模，自动化地学习到文本中的词法、句法及语义信息。然而，这种数据驱动技术的基础是海量的训练资源，但现有的有标注阅读理解资源的规模都非常小，大量有价值的资源数据是无标注的。本文以表示学习技术为基础，从数据资源挖掘利用的角度进行研究，并探索机器阅读理解技术在开放领域问答中的应用。主要的研究成果和创新点包括： 1、提出了一种依靠外部资源的机器阅读理解方法针对现有很多机器阅读理解数据规模太小问题，本文提出了一种利用外部资源的机器阅读理解方法。该方法将机器阅读理解拆分成答案选择和答案生成两个子部分。进而依靠外部海量的答案选择和答案生成资源进行辅助训练。并且，针对引入外部资源的时候可能产生的数据领域不一致问题，本文依靠知识蒸馏技术进行迁移学习。最后，使用策略梯度技术将两部分结合在一起生成答案。实验结果表明本文提出的方法成功地将深度学习技术应用在小规模机器阅读理解数据MCTest中，并且取得了比传统基于特征方法更好的结果。 2、提出了一种基于生成式对抗网络的无监督机器阅读理解方法针对现有机器阅读理解方法难以利用无标注数据的问题，本文提出了一个基于生成式对抗网络的无监督机器阅读理解方法。该方法首先构造一个生成器用以根据故事的背景文档生成可能的结论，然后依靠一个判别器判断这个结论是否可以被背景文档推断出。生成器和判别器交互地对抗训练，最终在无标注的故事语料中学习到上下文推理信息。该方法在常识机器阅读理解任务SCT中取得了比以往依靠语言学特征的方法更好的效果。 3、提出了一种基于编码器-解码器的无监督机器阅读理解方法针对现有机器阅读理解方法难以利用无标注数据的问题，本文提出了一种基于编码器-解码器的无监督方法对海量故事文档进行建模。该方法依靠编码器将背景文档编码到一个隐含空间，然后依靠一个解码器将能被背景文档推理出的句子解码出来，整个模型依靠损失再调整策略进行训练。测试阶段，本文使用一种基于似然概率的互信息方法，判断目标句子是否可以被给定的背景文档推断出。这种无监督的生成式模型在常识机器阅读理解任务SCT中取得比以往方法更好的效果。 4、提出了一种基于文档门控选择机制的开放域问答方法机器阅读理解技术一个非常重要的应用方向是开放域问答。针对以往将机器阅读理解应用于开放域问答方法中存在的弱监督数据质量低、答案概率偏置等问题，本文提出了一种基于自举法的弱监督数据生成方法用以动态地获取训练数据。并且依靠一个基于卷积神经网络的文档选择模型判断文档和问题的相关度。并且将上述文档选过程嵌入到机器阅读理解过程中以生成开放域问题的答案。实验表明，本文提出的方法获取的弱监督数据质量要显著优于以往的基于启发式方法获取的数据，并且，在三种开放域问答的任务中都取得了比以往模型更好的结果。; Recently, with the development of technologies, the research of Natural Language Processing has been focused on language understanding. Machine Comprehension, which aims at teaching machines to read and comprehend, is thriving under this situation. In this application, the machine does not only 'perceives' the text, but also 'understands' the text. With the rapid development of deep learning methods, most of the model that deals with machine comprehension is focused on representation learning. Such method is built upon the deep neural networks architecture to represent document and questions. However, deep learning is a data-driven approach which heavily relies on the abundant labeled data, however, currently the high-quality labeled data is limited, and a very large amount of precious data is unlabeled. In addition, the previous methods struggle to apply machine comprehension to other applications, such as open domain question answering. This dissertation is based on representation learning methods and studies the key problems of the current machine comprehension systems. The main contributions are summarized as follows: 1, We propose a method which employs the rich external knowledge to machine comprehension application. One of the main deficiency of many machine comprehension resources, such as MCTest or CLEF, is that the size of labeled data is very small, which hinder the application of deep learning methods. To solve this problem, this thesis divides machine comprehension into two parts, namely answer selection and answer generation, and we employ the abundant external knowledge to resolve this two problem separately. We develop a knowledge distillation based transfer learning method to introduce the external knowledge. Finally, we use policy gradient method to train these two parts jointly. Experimental results demonstrate the ability of our approach to applying deep learning method on small-size machine comprehension dataset, which achieves a significant improvement over traditional feature engineering methods. 2, We propose an unsupervised machine comprehension method based on generative adversarial networks. Currently, most methods to deal with machine comprehension are focused on deep learning, however, in many machine comprehension applications, one key problem is that there are no labeled data available for training such deep learning methods. To solve this problem, we proposed an inference method based on generative adversarial networks. This method is optimized to generate the target sentence based on its context document, and a discriminator is trained simultaneously to discriminate whether the sentence is generated or real sample. With the adversarial learning process of the generator and discriminator, our model learns some inference ability on the unlabeled commonsense story and achieves the state-of-the-art results on commonsense machine comprehension application. 3, We propose an unsupervised machine comprehension method based on encoder-decoder.
关键词	机器阅读理解问答系统深度学习无监督学习
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/20986
专题	毕业生_博士学位论文
作者单位	中国科学院自动化研究所
第一作者单位	中国科学院自动化研究所
推荐引用方式 GB/T 7714	王炳宁. 机器阅读理解关键技术研究[D]. 北京. 中国科学院研究生院,2018.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
机器阅读理解关键技术研究.pdf（7214KB）	学位论文		限制开放	CC BY-NC-SA