鲁棒的语义角色标注方法研究

CASIA OpenIR > 毕业生 > 博士学位论文

	鲁棒的语义角色标注方法研究
其他题名	Research on Robust Methods of Semantic Role Labeling
	庄涛
	2012-05-28
学位类型	工学博士
中文摘要	语义角色标注是一种浅层语义分析技术。它以句子为单位，不对句子所包含的语义信息进行深入的分析，而只分析句子的谓词－论元结构。语义角色标注能够为信息抽取、问答系统、以及机器翻译等任务提供有用的语义分析结果。然而在实际应用中，语义角色标注技术的鲁棒性(Robustness) 比较差：只有在范围很小的特定语料上才能得到比较好的结果；在一般语料上的分析结果很差。造成这种现象的主要原因有：第一，由于语义角色标注要用到句法分析，所以对句法分析结果的依赖性非常大，而目前句法分析的效果并不十分理想。第二，语义角色标注在领域外(Out-of-domain) 的测试数据上性能下降太大。在研究中最常用的命题库(PropBank) 语料大部分来自于《华尔街日报》(Wall Street Journal, WSJ) 的经济类新闻，在非经济类新闻的测试数据上，语义角色标注的准确率下降非常大。此外，由于语义角色标注的训练数据非常有限，所以引入更多的语言知识来帮助语义角色标注就显得非常重要。因此，如何利用更多的语言知识来提高语义角色标注的鲁棒性也是一个需要研究的问题。本文以提高语义角色标注的鲁棒性为目标，针对上述语义角色标注所面临的问题，从三个方面展开了研究： 1. 提出了一种最小错误加权的融合策略来减小语义角色标注对句法分析的依赖。系统融合能够减小句法分析错误对语义角色标注的影响。传统的融合方法将各个被融合系统的结果同等对待。而实际上总体结果好的系统更值得信赖。因此，本文提出了一种最小错误加权的融合策略。该策略为不同系统的结果设置不同的权重，并且通过在开发集上最小化一个错误函数来训练这些权重。本文给出了训练最小错误权重的方法，该方法适用于多种形式的错误函数。使用最小错误加权融合策略，本文在汉语命题库(Chinese PropBank) 数据集上取得了目前最好的语义角色标注结果。 2. 提出了基于深层信念网(Deep Belief Network, DBN) 的隐含特征表示模型来提高语义角色标注在领域外测试数据上的性能。由于语义角色标注对句法分析的依赖，要提高语义角色标注的领域外测试性能，必须同时提高句法分析的领域外测试性能。目前许多句法分析和语义角色标注方法都使用判别式模型进行决策。在判别式模型中，每一个数据样本都表示为一个特征向量。而领域外测试性能下降主要是由特征的稀疏性造成的：有许多在目标领域测试数据中出现的特征在源领域的训练数据中很少出现。本文的DBN 模型的目标是自动学习一种源领域和目标领域之间公共的特征表示，使得在该特征表示下，两个领域的数据显得更为相似。本文的DBN 模型是一个包含两层隐含变量的图模型。对于每一个数据样本，该模型都会将其表示为一组隐含特征。本文用这种隐含特征来训练和测试依存句法分析和语义角色标注系统。实验结果表明，这样得到的依存句法分析和语义角色标注系统能够更好地适应目标领域。本文的DBN 模型为依存语法分析和语义角色标注提供了一种统一的领域适应方法。 3. 研究了如何利用双语的信息来帮助语义角色标注。双语的语义角色标注在机器翻译中有着重要的应用。对于该问题，传统的方法是在双语两端分别进...
英文摘要	Semantic Role Labeling (SRL) is a shallow semantic analysis technique. Given a sentence, it does not perform deep semantic analysis, but only labels arguments that are related to the predicates in the sentence. SRL can provide useful semantic analysis for applications such as information retrieval, question answering, and machine translation etc. However, in practice the robustness of SRL is very weak: Good results can only be obtained on a very small and specific domain of texts. SRL result on general texts is usually very bad. The main reasons for this phenomenon are as follows: First, because SRL utilizes syntactic parsing results, it relies heavily on syntactic parsing. Second, SRL performs very badly on out-of-domain test data. The commonly used corpus for SRL is the PropBank, which consists mostly of economic news texts from Wall Street Journal (WSJ). On texts from other genres, the performance of SRL drops significantly. Moreover, because the available training data for SRL is very limited, using more linguistic knowledge to help SRL is very important. Therefore, how to utilize more linguistic knowledge to boost the robustness of SRL is also an important problem for research. The work in this thesis aims to enhance the robustness of SRL. Focusing on the problems above, This thesis presents research on three aspects: 1. This thesis has proposed a Minimum Error Weighting (MEW) combination strategy for SRL to reduce SRL’s reliance on single syntactic parsing result. System combination is an effective method to reduce SRL’s reliance on single parsing result. Traditional combination methods equally trust the SRL results to be combined. However, different systems have different properties. It is reasonable to trust systems with better overall results more than other systems. So this thesis has proposed a strategy that assigns different weights to results from different systems. These weights are trained by minimizing an error function on development set. This thesis has introduced an algorithm for MEW training, which has no requirement for the form of the error function. So the error function can be freely defined as needed. Using the proposed method, this thesis has achieved the best SRL result on commonly used Chinese PropBank data set to date. 2. This thesis has proposed a model based on Deep Belief Network (DBN) to learn a Latent Feature Representation (LFR) for domain adaptation of SRL. Because of SRL’s reliance on syntactic parsing, it is di...
关键词	语义角色标注系统融合双语语义角色标注联合推断论元对齐领域适应隐含特征表示 Semantic Role Labeling System Combination Joint Inference For Bilingual Semantic Role Labeling Argument Alignment Domain Adaptation Latent Feature Representation
语种	中文
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/6430
专题	毕业生_博士学位论文
推荐引用方式 GB/T 7714	庄涛. 鲁棒的语义角色标注方法研究[D]. 中国科学院自动化研究所. 中国科学院研究生院,2012.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
CASIA_20081801462808（1124KB）			暂不开放	CC BY-NC-SA