面向跨领域场景的句子匹配关键技术研究

CASIA OpenIR > 多模态人工智能系统全国重点实验室 > 自然语言处理

	面向跨领域场景的句子匹配关键技术研究
	白桂荣
	2022-05-25
页数	104
学位类型	博士
中文摘要	句子匹配是人工智能和自然语言处理中的一个重要研究任务，在信息检索、智能问答、自动摘要等众多应用中都发挥着重要作用，它旨在让计算机判断两段文本之间是否具有某种特定的语义关系，例如蕴涵关系、复述关系、问答关系等。随着互联网产业的快速发展和网络数据的爆炸式增长，人们迫切需要更精细的语义匹配技术支撑精准信息的获取。因此，近年来句子匹配受到了学术界和产业界的持续关注。随着深度学习的迅猛发展，基于数据驱动的神经网络方法在句子匹配任务中取得了良好的效果，成为目前处理句子匹配的常用方案。在真实场景中，训练好的神经网络模型常常需要被应用于新领域，但新领域中的数据分布往往和模型之前学习到的数据分布存在很大差异，会引起领域偏移问题。如果直接使用原来训练好的神经网络模型在新领域中进行测试，句子匹配的性能会不可避免地急剧下降。同时，新领域通常缺乏标注数据，依赖标注数据的神经网络方法难以直接在新领域上进行训练。因此，如何在缺乏标注数据的跨领域场景下缓解句子匹配任务面临的领域偏移问题是一个重要的挑战。根据缓解领域偏移问题角度的不同，本文从标注数据获取、无监督训练和外部资源引入三个方向，研究面向跨领域场景的句子匹配关键技术。本文的主要研究成果和创新点如下： 1. 提出了一种基于预训练语言模型的句子匹配标注数据采样策略针对句子匹配任务在新领域上数据标注成本高的问题，本文提出了一种新的主动学习方法来指导数据的采样和标注。为了降低标注数据的成本，主动学习方法的目标在于提供一种标注数据采样策略，在标注预算有限的情况下优先采样对模型性能提高更大的样本来进行标注。传统的主动学习方法通常利用样本在模型中的不确定性作为样本采样准则来衡量样本的标注优先级。这种采用单一准则的方法忽略了其它潜在的样本采样准则，并在面对有特定偏置的数据时可能效果很差。为此，本文提出了一种基于预训练语言模型的主动学习方法，该方法能够从预训练语言模型中挖掘语言学规律，丰富主动学习中的样本采样准则，从不确定性、噪声性、覆盖性和多样性等多个方面评价样本标注的优先级。此外，针对句子匹配任务的特点，本文还引入了编辑距离来捕捉两个句子间的差异并增强样本表示。本文在多个不同领域的句子匹配数据集上进行了实验，实验结果表明该方法的效果优于基线方法，可以有效降低为新领域标注数据的成本。 2. 提出了一种基于自监督学习的句子匹配领域自适应方法针对跨领域场景下句子匹配任务缺乏目标领域标注数据的问题，本文提出了一种基于自监督学习的句子匹配领域自适应方法。传统的无监督领域自适应方法的目标函数通常存在优化困难的问题。为此，本文从语言模型、领域特性和句子匹配任务特点等三个方面设计自监督任务，利用更容易优化的自监督学习目标来缩小源领域和目标领域之间的分布差异。此外，针对优化过程中不同数据的训练难度不同的问题，本文还提出了一种基于课程学习的训练框架。该框架先从容易的训练数据中学习一个基本模型，再从困难的训练数据中学习一个更好的模型。实验表明该方法缓解了句子匹配任务中未知的新领域缺乏标注数据的问题，提高了句子匹配模型在跨领域场景中的性能。 3. 提出了一种借助外部知识的句子匹配领域自适应方法针对传统无监督领域自适应方法缺乏必要的监督信息和领域迁移能力有限的问题，本文提出了一种基于知识指导的跨领域句子匹配方法。外部知识中包含了不同领域间的差异和联系，引入外部知识可以提供额外的信息帮助模型更好地弥合领域间的鸿沟。因此，本文研究了如何借助外部知识促进无监督领域自适应。本文提出的方法通过图神经网络，将外部知识融入无监督领域自适应的训练过程，帮助模型借助外部知识来处理不同领域的差异和联系。此外，为了降低训练难度，该方法在原来的领域对抗网络的基础上，将样本表示分离为样本相关的基础表示和领域相关的偏差表示，改进领域无关表示的计算方式。在多个跨领域句子匹配实验中，该方法比基线方法有明显提升，表明借助外部知识能提高跨领域场景下句子匹配模型的性能。
英文摘要	Sentence matching is an important task in artificial intelligence and natural language processing, which plays an important role in many applications such as information retrieval, question answering, automatic summarization and so on, and aims at deciding whether two paraphrases have a specific semantic relationship such as entailing, paraphrasing and answering. With the development of the Internet industry and the explosive growth of network data, people urgently need more sophisticated semantic matching technology to support the acquisition of accurate information. Therefore, sentence matching has recently attracted continuous attention of academia and industry. With the development of deep learning, data-driven neural networks obtain great performance in sentence matching and become the popular solution for sentence matching. In real scenarios, the learned sentence matching model is often used in new domains, but the data distribution in the new domain is typically much different from the one learned in the model, which causes the domain shift problem. If the neural model trained on the source domain is directly applied to the new target domain for test, the performance of sentence matching will inevitably drop sharply. Meanwhile, the new domain typically lacks annotated data and data-driven neural networks are unable to be directly trained in the new domain. Therefore, alleviating the domain shift problem of sentence matching in the cross-domain scenario with the lack of annotated data becomes an important challenge. With the different perspectives on alleviating the domain shift problem, this thesis studies the key technologies of cross-domain sentence matching through annotated data obtaining, unsupervised training and introducing external resources. The main research findings and innovations of this thesis are listed as follows: 1. Pre-trained language model based annotated data sampling for sentence matching Aiming at the issue of the high cost of annotating sentence matching data for new domains, this thesis proposes a new active learning method to guide sampling and annotation of data. To reduce the cost of annotating data, the active learning methods aim at providing a data sampling strategy so that we can give priority to specific examples that bring more improvement with limited annotation budget. Traditional active learning methods typically use the uncertainty of examples in the model to measure the annotation priority. This kind of method with only a single criterion ignores other potential data sampling criteria and may have poor effects when dealing with biased data. This thesis proposes a pre-trained language model based active learning method, which mines linguistic characteristics from pre-trained language models, enriches the criteria of measuring examples in active learning and evaluates the annotation priority of unlabeled examples based on uncertainty, noise, coverage and diversity. In addition, aiming at the characteristics of sentence matching, this thesis uses edit distance to capture the difference between two sentences and enhance representations of examples. We conduct experiments across multiple sentence matching datasets of different domains, the experimental results show that the proposed method outperforms baseline methods and is able to effectively reduce the cost of data annotation for new domains. 2. Self-supervised learning based domain adaptation for sentence matching Aiming at the issue of the lack of target domain annotated data in cross-domain sentence matching, this thesis proposes a self-supervised learning based unsupervised domain adaptation method. Objectives of traditional unsupervised domain adaptation methods are typically difficult to optimize. Therefore, this thesis designs self-supervised tasks from three aspects of language models, domain features and sentence matching characteristics, and uses self-supervised objectives that are much easier to optimize to bridge the gap between the source domain and target domain. In addition, aiming at the issue that the training difficulty of different data is different during unsupervised domain adaptation, this thesis also proposes a curriculum learning based training framework. The framework first learns a basic model from the easy training data, and then learns a better model from the difficult training data. Experiments show that the self-supervised learning based method can effectively alleviate the lack of annotated data in unseen new domains and improve the performance of cross-domain sentence matching. 3. Utilizing external knowledge for domain adaptation for sentence matching Aiming at the issue of the lack of necessary supervised information and limited ability of domain transfer in traditional unsupervised domain adaptation methods, this thesis proposes a knowledge-guided method for cross-domain sentence matching. The external knowledge contains the difference and connection between different domains, and introducing external knowledge can provide extra information to help the model bridge the gap between domains. Therefore, this thesis studies utilizing external knowledge for unsupervised domain adaptation. The proposed method introduces external knowledge into the training process of unsupervised domain adaptation through graph neural networks, which helps the model deal with the difference and connection between different domains with the assistance of external knowledge. In addition, to reduce the difficulty of training, we divide sample representations into sample-related base representations and domain-related bias representations based on the original domain-adversarial networks, which improves the computation of domain-invariant representations. In multiple cross-domain sentence matching experiments, our method significantly outperforms baseline methods, which demonstrates that the performance of sentence matching model can be improved in the cross-domain scenario with the help of external knowledge.
关键词	自然语言处理，句子匹配，跨领域场景，领域自适应
语种	中文
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/48662
专题	多模态人工智能系统全国重点实验室_自然语言处理
推荐引用方式 GB/T 7714	白桂荣. 面向跨领域场景的句子匹配关键技术研究[D]. 中国科学院自动化研究所. 中国科学院自动化研究所,2022.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
_白桂荣_博士论文.pdf（3950KB）	学位论文		开放获取	CC BY-NC-SA