CASIA OpenIR  > 模式识别国家重点实验室  > 自然语言处理
多信息融合的自动摘要方法研究
刘天赏
Subtype硕士
Thesis Advisor张家俊
2019-05-24
Degree Grantor中国科学院大学
Place of Conferral中国科学院大学
Degree Discipline计算机技术
Keyword自动摘要,用户信息,词性信息,序列到序列模型,注意力机制
Abstract

自动摘要是一种信息压缩技术,旨在从输入信息中剔除冗余信息,保留核心信息,并将核心信息以通顺流畅的自然语言作为输出结果呈献给用户。随着互联网信息的日益丰富多样,人们可以检索到的信息也越来越庞杂,自动摘要技术可以有效地辅助人们从纷繁多样的网络信息中自动地找到重要的或者人们感兴趣的内容,从而减轻人们获取信息的负担。高质量的摘要可以直接以新闻快讯、商品描述等简短信息的形式呈现给用户,亦可以用于下游任务,比如在信息检索中使用网页的摘要进行检索等。因此,开展这项研究具有重要的理论意义和实际应用价值。

目前的自动摘要模型几乎都是以文本为研究对象,但是文本之外的其他信息,例如文本撰写者的个人信息和语言学信息等,都能够作为辅助信息帮助生成质量更好的摘要结果。所以,利用文本之外的哪些信息以及如何将这些信息融入到摘要模型之中,成为文本自动摘要的一个挑战。本文重点研究融合两类信息的自动摘要方法:电商用户评论摘要任务中对用户信息的融合和新闻单文档摘要任务中对语言学信息的融合。

论文的主要工作及创新点归纳如下:

1. 提出融合用户信息的自动摘要方法

电商用户评论摘要是为电商平台上的每条用户评论生成一个简短的标题。因为在电商平台上用户为评论生成的标题因人而异,电商用户评论摘要模型不仅需要挖掘评论的核心信息,而且需要关注撰写该评论的用户的个人偏好,所以电商评论摘要比一般的单文档摘要难度更大。现有的评论摘要模型,只关注单纯的评论文本,未将用户的偏好考虑在内,使得生成的评论标题不能反映特定用户的个人偏好。我们认为在对电商用户评论进行摘要时应融合用户信息。因此,我们提出融合用户信息的电商评论摘要方法。我们在分布式向量空间中表示和学习用户的偏好,并将用户的偏好表示融入到带注意力机制的序列到序列的生成式摘要模型中。实验结果表明,融合用户信息的评论摘要模型显著优于单纯使用评论文本的摘要模型。这说明用户信息对电商用户评论的自动摘要起到很关键的作用,同时也表明我们提出的融合用户信息的电商评论摘要模型是有效的。通过对样本的分析,我们进一步发现,我们的模型相比基线模型更能捕捉到用户的个人偏好。

2. 提出融合词性信息的自动摘要方法


人们在阅读文档时,语言学信息(比如词性、句法等)对于分析文档、理解文档有促进作用。因此,利用语言学信息可能会提升摘要模型的性能。在自然语言处理发展过程中,将语言学信息用于自然语言处理任务是一种常见的解决自然语言处理问题的思路。但是,随着基于分布式表示的深度学习的发展,利用循环神经网络、卷积神经网络和注意力机制等构建的自动摘要模型被认为可以隐式表示和学习文本的句法和语义信息。本文提出融合显式词性信息的自动摘要模型,拟考察词性信息对自动摘要质量的影响。我们将词性信息融入到序列到序列的生成式摘要模型中,通过自底向上的方式确定文本内容的重要性。实验结果显示,显式地融入词性信息可以提升文本摘要的质量,但是性能提升的幅度并不明显。这也间接说明基于深度学习的基线系统能够隐式地表示和利用部分词性信息。

Other Abstract

Automatic summarization is an information compression technique, which aims at eliminating redundant parts from the input while retaining the valuable content, then fluently expressing the core information as a summary to a user. With the development of the Internet, people can retrieve more and more information. Automatic summarization can effectively help people to find important or interesting content from diverse and complex information. High-quality summaries can be presented directly to users in the form of short messages such as news headlines, descriptions of products, etc. Automatic summarization can also be used for downstream tasks. For example, we can use the summaries of webpages for information retrieval. Therefore, the research of automatic summarization has great research value.

At present, almost all automatic summarization models are based on text information, but other information such as user information and linguistic information can be used as auxiliary information to improve the performance of the summarization models. Therefore, it is a challenge for automatic text summarization to make use of information other than text. This paper focuses on incorporating two kinds of information into the summarization models: the fusion of user information in the e-commerce user review summarization task and the fusion of linguistic information in news single document summarization task.

The main contributions of the thesis are summarized as follows:

1. Proposing an Automatic Summarization Method based on User Information Fusion

The summarization of user reviews in the e-commerce platform is a task that generates a short headline for each review. It is more difficult than general single document summarization. Because user-generated titles vary from user to user, the summarization model not only needs to mine the core information of each review but also needs to pay attention to the personal preference of the user who writes the review. The existing review summarization models only focus on the text of the reviews and do not take user preferences into account, hence the generated review headlines can't reflect the preferences of specific users. We argue that user information should be integrated into the summarization model when summarizing user reviews in the e-commerce platform. And we propose an e-commerce user review summarization model that takes user information into account. We represent and learn user preferences in distributed vector space, and incorporate user preferences into the abstractive summarization model. The experimental results show that the review summarization model based on user information fusion can significantly improve summarization performance. This shows that user information plays a key role in the automatic summarization of user reviews. It also indicates that our proposed summarization model which integrates user information is effective.~Through the analysis of samples, we find that our model can capture users' preferences better than the baseline model.

2. Proposing an Automatic Summarization Method based on Part-of-Speech Information Fusion

It is generally known that linguistic information (such as part-of-speech, syntax, etc.) can help people understand documents. Therefore, incorporating linguistic information into the document summarization model may improve the performance of the summarization model. In the development of natural language processing(NLP), applying linguistic features to NLP tasks is a common way to solve NLP problems. With the development of deep learning that based on distributed representation, the automatic summarization model, which constructed by the recurrent neural network, convolutional neural network and attention mechanism, is considered to be able to implicitly represent and learn the syntactic and semantic information of input text. This thesis proposes an automatic summarization model that integrates part-of-speech information to examine the impact of part-of-speech information on the quality of automatic summarization. We incorporate part-of-speech information into the bottom-up abstractive summarization model. Experimental results show that the explicit integration of part-of-speech information can improve the quality of text summary, but the improvement is not obvious. This also indirectly shows that the baseline system based on deep learning can implicitly express and utilize some part-of-speech information.

Pages70
Language中文
Document Type学位论文
Identifierhttp://ir.ia.ac.cn/handle/173211/23926
Collection模式识别国家重点实验室_自然语言处理
Corresponding Author刘天赏
Recommended Citation
GB/T 7714
刘天赏. 多信息融合的自动摘要方法研究[D]. 中国科学院大学. 中国科学院大学,2019.
Files in This Item:
File Name/Size DocType Version Access License
Thesis.pdf(1789KB)学位论文 开放获取CC BY-NC-SAApplication Full Text
Related Services
Recommend this item
Bookmark
Usage statistics
Export to Endnote
Google Scholar
Similar articles in Google Scholar
[刘天赏]'s Articles
Baidu academic
Similar articles in Baidu academic
[刘天赏]'s Articles
Bing Scholar
Similar articles in Bing Scholar
[刘天赏]'s Articles
Terms of Use
No data!
Social Bookmark/Share
All comments (0)
No comment.
 

Items in the repository are protected by copyright, with all rights reserved, unless otherwise indicated.