CASIA OpenIR  > 毕业生  > 硕士学位论文
面向表格数据的语义解析泛化技术研究
孙润鑫
2024-05-16
页数68
学位类型硕士
中文摘要

表格以行和列的网状格式来存储和定位数据,是现实世界中存储和管理数据的主要形式之一。随着电子设备的普及以及互联网的产生和发展,世界上每天都会产生大量的以表格形式存储的数据。进一步地,为了更好地满足各行各业对于存储和管理大规模数据的需要,关系型数据库及其管理系统应运而生,此类数据库正是以表格为单位来组织和存储数据的。然而,现有的数据查询接口(即用户与表格或数据库中数据的交互方式)大多是基于结构化查询语言SQL的,这使得不熟悉SQL语法和不清楚数据库内容的普通用户无法通过这种方式来访问大规模数据库。因此,面向表格数据的语义解析(简称表格语义解析)任务受到广泛关注,其目标是根据用户所提出的自然语言描述的问题,来自动化地生成相应的形式语言(一般为SQL)查询,从而在给定表格或数据库上执行该查询以得到答案。

经过一段时间的发展,表格语义解析任务的性能取得了长足进步。传统方法在标准的独立同分布场景已经能取得令人满意的性能。然而,在实际应用场景中,用户所提问题的意图、风格和复杂程度可能是多样的,此时(训练和预测)同分布的假设便不再成立。并且,考虑到实际场景复杂多变的特点,人们自然希望模型能在不重新训练的情况下,具备自适应地泛化到未见过的问题和数据库上的能力。因此,如何提升表格语义解析模型的泛化性能是研究人员近些年的主要关注点,也是本文研究的内容和目标。

本学位论文聚焦面向表格数据的语义解析泛化技术,致力于提升表格语义解析模型在跨领域泛化和语义组合泛化两个场景下生成SQL查询的性能。为了实现这一目标,本文主要围绕“如何让模型更好地捕获问题与SQL查询子结构间的对齐关系”以及“如何对问题进行分解并让模型更好地生成查询片段”两个方面,进行了深入的研究和探讨。本文的主要研究内容和创新点如下:

1. 基于短语级“词汇-逻辑”对齐增强的表格语义解析方法

已有表格语义解析任务上的研究工作表明,利用“词汇-逻辑”对齐(即问题中的短语与SQL查询中的片段之间的对应关系)可以有效提升模型生成SQL查询的性能。然而,现有利用“词汇-逻辑”对齐的方式主要基于注意力机制。这种方式仅能在词级别建模对齐,也不利于模型的泛化。为了更好地获取和利用上述对齐信息,本研究提出了一种新颖的神经解析框架。该框架采用“先预测、再利用”的两阶段方式,以实现在短语级别建模“词汇-逻辑”对齐并在解析过程中显式地利用这种对齐关系。此外,为了缓解噪声(错误)对齐对表格语义解析模型的负面影响,本研究还提出了一种数据增强方法,即在训练阶段添加含有噪声的对齐以提升模型的鲁棒性。在基准数据集上的实验结果表明,本研究所提模型的性能显著优于之前的最优基线。并且,与基于注意力机制的模型相比,该模型展示出了更强的泛化能力以及更好的利用“词汇-逻辑”对齐信息的效果。

2. 基于问题分解和子片段检索的少样本表格语义解析方法

对于表格语义解析任务,一次性生成完全正确的SQL查询,特别是一些较为复杂的查询是十分困难的。对此,本研究提出了一种结合问题分解和子片段检索的少样本表格语义解析方法。该方法采用问题分解的方式,针对分解得到的问题片段逐个生成其对应的查询片段,从而显著降低完整SQL查询的生成难度并提升模型的组合泛化能力。进一步地,该方法在子(问题)片段级别而不是常用的问句级别检索示例样本,以提升示例样本的检索精度并更好地提示模型生成查询片段。在基准数据集上的实验结果表明,本研究所提方法的查询生成性能显著优于所有的基线方法,并展现出更强的组合泛化能力以及更好的生成较复杂查询的性能。

英文摘要

Tables, which store and locate data in a grid format of rows and columns, are one of the primary forms of storing and managing data in the real world. With the popularization of electronic devices and the emergence and development of the Internet, a large amount of data stored in tabular form being generated every day in the world. Furthermore, to meet the growing needs of various industries for storing and managing large-scale data, relational databases and their management systems have emerged, and such databases organize and store data in units of tables. However, most existing data query interfaces (i.e., the way users interact with data in tables or databases) are based on the Structured Query Language (SQL), which is impossible for ordinary users who are unfamiliar with SQL syntax and unaware of the database's content to access large-scale databases in this way. Therefore, there is a growing interest in tabular semantic parsing (more commonly referred to as text-to-SQL), which aims to automatically generate corresponding formal language (typically SQL) queries based on natural language questions posed by users. These queries can then be executed on a given table or database to obtain answers.

After a period of development, the performance of tabular semantic parsing has made significant progress. Traditional methods have achieved satisfactory performance in the standard independent and identically distributed (i.i.d.) scenario. However, in real-world application scenarios, where the intent, style, and complexity of the questions posed by users may be diverse, the assumption of identical distribution (between training and prediction) no longer holds. Moreover, considering the complex and dynamic nature of real-world scenarios, it is natural to desire a model that can adaptively generalize to unseen queries and databases without retraining. Therefore, improving the generalization performance of tabular semantic parsing models has been a primary focus of researchers in recent years, and it is also the subject and goal of this thesis.

This thesis focuses on semantic parsing generalization techniques for tabular data. It aims to improve the performance of tabular semantic parsing models in generating SQL queries under two scenarios: cross-domain generalization and semantic compositional generalization. To achieve this goal, this thesis conducts in-depth research and discussion around the following two aspects: 1) how to better capture the alignments between the substructures of questions and SQL queries, and 2) how to decompose the question and make the model better generate query fragments. The main research contents and innovations of this thesis are as follows:

1. A tabular semantic parsing method enhanced by phrase-level lexical-logical alignments

Previous research on the tabular semantic parsing task has shown that leveraging lexical-logical alignments, which refer to the correspondence between phrases in questions and fragments in SQL queries, can effectively improve the performance of parsing models in generating SQL queries. However, existing methods of leveraging such alignments are mainly based on the attention mechanism. Such methods can only model alignments at the token level and are harmful to the model's generalization capability. To better capture and utilize the aforementioned lexical-logical alignments, this research proposes a novel neural parsing framework. The framework adopts a two-stage approach of ``predict first, then utilize'' to model lexical-logical alignments at the phrase level and explicitly leverage such alignment relationships during the parsing process. Moreover, to alleviate the negative effects on the tabular semantic parsing model caused by noise (erroneous) alignments, this research proposes a data augmentation method, that is, adding noisy alignments during the training procedure to enhance the model's robustness. Experimental results on the benchmark dataset show that the proposed model of this research significantly outperforms the previous state-of-the-art baseline. Moreover, compared with models based on the attention mechanism, this model demonstrates stronger generalization capabilities and better utilization of such lexical-logical alignment information.

2. A few-shot tabular semantic parsing method based on question decomposition and segment retrieval

For the tabular semantic parsing task, it is quite challenging to generate an entirely correct SQL query in one pass, especially for some complex SQL queries. Therefore, this research proposes a few-shot tabular semantic parsing method that combines question decomposition and segment retrieval. The method adopts question decomposition and generate the corresponding query fragment for each question segment individually, significantly reducing the difficulty of generating complete SQL queries and improving the model's compositional generalization capability. Furthermore, instead of retrieving demonstration examples at the commonly used question (sentence) level, the method conducts retrieval at the (question) segment level to improve the precision of demonstration retrieval and better prompt the model to generate query fragments. Experimental results on benchmark datasets show that the proposed method of this research significantly outperforms all baseline methods in generating SQL queries, and it demonstrates stronger compositional generalization capability and better performance in generating complex SQL queries.

关键词自然语言处理 语义解析 文本到SQL 泛化能力
语种中文
文献类型学位论文
条目标识符http://ir.ia.ac.cn/handle/173211/56597
专题毕业生_硕士学位论文
推荐引用方式
GB/T 7714
孙润鑫. 面向表格数据的语义解析泛化技术研究[D],2024.
条目包含的文件
文件名称/大小 文献类型 版本类型 开放类型 使用许可
学位论文_V2.pdf(2868KB)学位论文 限制开放CC BY-NC-SA
个性服务
推荐该条目
保存到收藏夹
查看访问统计
导出为Endnote文件
谷歌学术
谷歌学术中相似的文章
[孙润鑫]的文章
百度学术
百度学术中相似的文章
[孙润鑫]的文章
必应学术
必应学术中相似的文章
[孙润鑫]的文章
相关权益政策
暂无数据
收藏/分享
所有评论 (0)
暂无评论
 

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。