基于文档的对话系统构建方法研究

CASIA OpenIR > 毕业生 > 硕士学位论文

	基于文档的对话系统构建方法研究
	阮星程
	2023-05
页数	73
学位类型	硕士
中文摘要	基于文档的对话系统运用关联的文档内容，以持续对话的形式为用户提供所需要的信息。相较于知识库、知识图谱等结构化知识表示，非结构化文档更接近信息传递的自然形式，易于获取且广泛存在。然而，非结构化文档缺乏有效的信息组织结构，导致用户难以快速获取关注的信息。基于文档的对话系统能够充分利用广泛存在的文档内容，提供各个领域内用户所需要的关键信息。因此，基于文档的对话系统研究具有重要的理论和应用价值。通常基于文档的对话系统包含参考知识识别模块和回复生成模块。参考知识识别模块依据对话历史，识别出相关文档中回复所需要的参考知识；回复生成模块则结合对话历史与参考知识生成系统回复。参考知识作为回复用户提问的核心依据，其形式和内容决定了系统生成回复的质量。然而，现有的研究工作较少关注参考知识的具体形式，缺乏参考知识的形式与内容，对回复生成影响的评估，以及针对性的解决方案。本文以参考知识的形式和内容为切入点，分别从参考知识识别和回复生成两方面展开相关研究，主要研究工作和创新点归纳如下：（1）提出了一种基于集成式自蒸馏的参考知识识别方法参考知识的不确定性给数据收集和模型构建带来了诸多挑战。不确定性导致参考知识的划定方式暂无明确结论，因此，现有数据集无法提供系统性和全面性的参考知识标注。在模型构建方面，主流参考知识识别方法通常直接拟合数据集提供的唯一参考知识标签，忽视参考知识的不确定性与标注质量，这导致模型存在鲁棒性较差、泛化能力有限等问题，最终影响参考知识识别性能。针对这一问题，本文提出了一种基于集成式自蒸馏的参考知识识别方法。该方法从模型本身获取参考知识标签，将单一参考知识转化为软标签形式，运用自蒸馏思想再次训练模型。此外，本文对自蒸馏方法的训练方式和目标进行了优化，有效结合了模型集成与自蒸馏方法。实验表明，本文提出的方法能够有效缓解参考知识难以唯一确定问题对模型训练的影响，显著提升了模型参考知识识别性能。（2）提出了一种基于差异化训练策略的回复生成方法基于文档的对话系统生成的回复通常与参考知识密切相关。如过模型直接使用高度重叠的输入序列进行训练，会导致模型陷入惰性训练，仅简单地学习改写输入序列，最终导致生成的回复信息冗余且与参考知识过度重叠。本文从训练方法角度入手，提出了一种基于差异化训练策略的回复生成方法。该方法在训练阶段引入了拓展的参考知识，填补了参考知识在上下文逻辑方面的缺失，并要求模型从中提取与对话历史关联的关键信息以生成回复。同时，本文采用动态温度调度机制，平衡了差异化训练策略引入的训练与测试阶段参考知识的长度差异。实验结果表明，基于差异化训练策略的回复生成方法能够有效缓解模型惰性训练的问题，并在公开数据集上取得了优异的回复生成结果。上述两种方法共同构成了基于文档对话系统的整体结构。实验表明，结合两种方法在公开的基于文档的对话数据集上实现的回复生成结果优于已有的最好方法。
英文摘要	Document-based dialogue systems use associated document content to provide users with the information they need in a continuous conversation format. Compared to structured knowledge representations such as knowledge bases and knowledge graphs, unstructured documents are closer to the natural form of information transmission, easy to obtain, and widely available. However, unstructured documents lack effective information organization structures, making it difficult for users to quickly access the information they focus on. Document-based dialogue systems can make full use of widely available document content to provide users with key information in various fields. Therefore, research on document-based dialogue systems has important theoretical and practical value. A typical document-based dialogue system consists of a reference knowledge identification module and a response generation module. The reference knowledge identification module identifies the reference knowledge required for responses from relevant documents based on dialogue history; the response generation module generates system responses by combining dialogue history and reference knowledge. Reference knowledge serves as the core basis for responding to user questions, and its form and content determine the quality of system-generated responses. However, existing research pays little attention to the specific form of reference knowledge and lacks evaluation of the impact of reference knowledge form and content on response generation, as well as targeted solutions. This paper focuses on the form and content of reference knowledge, and carries out related research from the aspects of reference knowledge identification and response generation. The main research work and innovations are summarized as follows: (1) Proposed an ensemble self-knowledge distillation-based reference knowledge identification method. The uncertainty of reference knowledge brings many challenges to data collection and model construction. Uncertainty leads to a lack of clear conclusions on the delineation of reference knowledge, and existing datasets cannot provide systematic and comprehensive reference knowledge annotations. In terms of model construction, mainstream reference knowledge identification methods usually directly fit the unique reference knowledge labels provided by datasets, ignoring the uncertainty and annotation quality of reference knowledge, which results in models with poor robustness and limited generalization capabilities, ultimately affecting reference knowledge identification performance. To address this issue, this paper proposes an ensemble self-knowledge distillation-based reference knowledge identification method. This method obtains reference knowledge labels from the model itself, converting single reference knowledge into soft label forms and retraining the model using self-distillation ideas. In addition, this paper optimizes the training methods and objectives of the self-distillation method, effectively combining model ensemble and self-distillation methods. Experiments show that the proposed method can effectively alleviate the impact of the difficulty in uniquely determining reference knowledge on model training and significantly improve the model's reference knowledge identification performance. (2) Proposed a response generation method based on differentiated training strategy Responses generated by document-based dialogue systems are usually closely related to reference knowledge. If the model is trained directly using highly overlapping input sequences, it will fall into lazy training, simply learning to rewrite input sequences, and ultimately leading to redundant responses and excessive overlap with reference knowledge. This paper proposes a response generation method based on a differentiated training strategy, addressing the issue from the perspective of training methods. This method introduces extended reference knowledge during the training phase, filling in the missing context logic of reference knowledge and requiring the model to extract key information related to dialogue history to generate responses. At the same time, this paper adopts a dynamic temperature scheduling mechanism to balance the length differences of reference knowledge introduced by the differentiated training strategy during training and testing phases. Experimental results show that the response generation method based on the differentiated training strategy can effectively alleviate the problem of model lazy training and achieve exceptional response generation results on public dataset. These two methods together constitute the overall structure of the document-based dialogue system. Experiments show that the combination of the two methods achieves better response generation results on public document-based dialogue dataset than existing best methods.
关键词	请输入关键词
学科领域	自然语言处理
学科门类	工学::计算机科学与技术（可授工学、理学学位）
语种	中文
七大方向——子方向分类	自然语言处理
国重实验室规划方向分类	多尺度信息处理
是否有论文关联数据集需要存交	否
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/51884
专题	毕业生_硕士学位论文多模态人工智能系统全国重点实验室
推荐引用方式 GB/T 7714	阮星程. 基于文档的对话系统构建方法研究[D],2023.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
基于文档的对话系统构建方法研究-6.2-（3650KB）	学位论文		限制开放	CC BY-NC-SA