基于多源信息的自动摘要方法研究

CASIA OpenIR > 毕业生 > 博士学位论文

	基于多源信息的自动摘要方法研究
	李浩然
	2018
页数	118
学位类型	博士
中文摘要	自动摘要技术旨在从输入信息中提炼出有价值的内容，有效地帮助人们从庞杂的信息中寻找到重要的或自己感兴趣的内容，从而提高人们获取信息的效率。近年来，自动摘要技术发展迅速，涌现出很多新的方向和方法。传统的自动摘要系统的输入信息大多是单一语言的文本。在互联网高速发展的今天，人们接触到的信息来源多种多样，除了文本以外，还有图像、视频和语音。多源信息往往包含了更大的信息量，如果摘要系统能够通过分析和利用与某个事件相关的多源信息，更全面地理解事件内容，从而提高摘要的质量。因此，基于多源信息的自动摘要技术具有重要的研究价值。本文对如何利用多源信息，包含多语言信息和多模态信息生成文本摘要进行了深入研究。在多语言多文档自动摘要、基于多语言异步多模态数据的自动摘要和多模态句子摘要等方面提出了一系列创新方法。论文的主要贡献归纳如下：（1）提出了一种基于引导排序图模型的多语言多文档自动摘要方法多语言多文档自动摘要系统的输入文档是与某个事件相关的多种语言的多个文档，输出是单一语言的文本摘要。对于某一特定事件，不同国家的媒体报道该事件的角度不尽相同，往往包含不同的观点，对不同语言的文本进行综合分析更易于获取该事件更全面的信息。已有的方法是将非目标语言文档通过机器翻译自动转换为目标语言，然后应用单语言多文档自动摘要方法生成摘要。但是受到机器翻译质量的影响，这种方法生成的摘要通常可读性较差。为了解决这个问题，本文提出了一种基于引导排序的图模型，该模型通过检测不同语言的句子之间的相关性，将可读性融入到自动摘要模型中，选择性地使用目标语言文档中的句子来表达多种语言文档共享的重要信息，并可以保留非目标语言文档中，目标语言文档不能涵盖的重要译文句子，从而达到兼顾可读性和信息量的目的。本文提出了三种策略检测不同语言句子之间的相关性，分别基于余弦相似度、基于蕴含和基于翻译模型。实验结果证明，基于翻译模型的跨语言语义相关性度量的引导排序模型取得了最佳结果，在输出分别为中、英文摘要的实验上均取得了超过10\%的性能提升。（2）提出了一种异步多模态数据的自动摘要方法随着互联网高速发展，海量的多媒体信息丰富了人们的日常生活，改变了人们的获取信息的方式。如何在短时间内获取这些多媒体数据的文本摘要是一个亟待解决的课题。本文提出了一种基于抽取式的异步文本、图像、音频和视频的多模态自动摘要方法。该方法将自然语言处理、语音处理和计算机视觉技术相结合，探索了多模态数据包含的丰富信息，提高了多模态摘要的质量。异步多模态自动摘要技术的关键是解决不同模态之间的语义鸿沟。音频和图像是视频中的主要信息形式。对于音频信息，本文通过引导排序策略，选择性地使用可读性较好的转录文本，并利用音频信号推断转录文本的重要性。对于图像信息，本文利用神经网络学习文本和图像联合表示，继而通过文本图像匹配或多模态主题模型获得生成的摘要对重要视觉信息的覆盖度。最后，综合考虑所有模态，最大化文本重要性、非冗余性、可读性和视觉信息覆盖度，生成高质量的文本摘要。本文首次提出了异步多模态数据的自动摘要任务，搜集、标注并发布了一个中、英文的异步多模态数据的自动摘要数据集。本文探索了多种处理音频、视频和图像数据的方法，实验结果表明，对于语音数据，引导排序策略是必要的；对于视觉数据，基于图文匹配和基于多模态主题分布的模型是有效的。最终我们提出的基于多模态数据的模型取得了比基于纯文本的模型显著的性能提升。（3）提出了一种基于模态注意力机制的多模态句子摘要方法传统的句子摘要任务的输入是一段文本（可以看成长句子），输出是一个短摘要。与复杂叙述的文本相比，图片往往能够更直接地表达文本中的重要信息。为此，本文提出了一个多模态句子摘要任务，利用一对匹配的句子和图片生成文本摘要。该任务比句子摘要更具挑战性，因为它不仅需要有效地将视觉信息整合到传统的基于文本的句子摘要框架中去，而且需要避免引入视觉噪声。为了解决这一问题，我们提出了一种基于模态的注意力机制，该机制不仅可以对图像的不同区域和句子的不同文本单元赋予不同的权重，还可以对文本和视觉模态赋予不同的权重。此外，由于图像对一些抽象语义的表达能力较差，我们通过图像过滤器选择性地使用视觉信息来增强输入句子的语义表达。本文在公开的纯文本的句子摘要任务数据集基础上，搜集、标注并发布了一个多模态句子摘要数据集。本文在该数据集上进行了大量的对比实验，实验结果表明，本文提出的基于模态注意力机制的多模态句子摘要方法取得了比基于纯文本的方法超过7\%的提升。进一步的实验分析表明，视觉信息有助于摘要解码器的初始化，有助于生成更加抽象的摘要并生成更少的重复词。
英文摘要	Automatic summarization technique aims to analyze the input information and compress the input information into a more concise expression which guarantees the valuable content in the original input is covered. Automatic summarization technique can effectively help people to find important or interesting contents from complex information, thus improving the effectiveness of people's access to information. In recent years, automatic summarization technique has developed rapidly in the field of natural language processing, and many new directions and methods have emerged. The input information of the traditional automatic summarization system is mostly monolingual text. Nowadays, with the rapid development of the Internet, information sources are often diverse, including text, images, videos and audios. Multi-source information often contains a larger amount of information. Thus the summarization system can understand the focus of the event from different views and improve the quality of the generated summary by analyzing multi-source information. Therefore, the automatic summarization of multi-source information has great research value. This dissertaion focuses on how to using multi-source information (including multilingual and multimodal information) to generate text summaries. First, this dissertaion proposes an extractive multi-language multi-document automatic summarization method. Next, this dissertaion develops an asynchronous multimodal summarization system supporting multiple languages. Finally, this dissertaion makes a preliminary exploration at abstractive summarization for multimodal data.The main contributions of the dissertaion are summarized as follows: (1) Proposing a Guided Ranking Graph Model for Multilingual Multi-document Summarization Multilingual multi-document summarization is a task to generate the summary in target language from a collection of documents in multiple source languages. For a specific event, the text in different language (such as news reports from different countries) report the event from different views. Comprehensive analysis of texts in different languages can get more valuable information about the event. Existing approach to this task is automatically translating the non-target language documents into target language and then applying monolingual summarization methods, but the summaries generated by this method is often poorly readable due to the low quality of machine translation. To solve this problem, we propose a novel graph model based on guided edge weighting method in which both informativeness and readability of summaries are taken into consideration fully. In methodology, our model attempts to choose from the target language documents the sentences which contain important shared information across languages, and also retains the salient sentences which cannot be covered by documents in other language, which considers readability and informativeness. This dissertaion proposes three strategies to detect the correlation between sentences in different languages, based on cosine similarity, entainment detection and translation model, respectively. The experimental results show that the guided ranking model based on translation model for cross-lingual semantic correlation measurement achieves the best results, which achieves over 10\% performance improvement for both Chinese and English summarization tasks. (2) Proposing a Multi-modal Summarization Method for Asynchronous Text, Image, Audio and Video The rapid increase in multimedia data transmission over the Internet necessitates multimodal summarization from asynchronous collections of text, image, audio and video. In this work, we propose an extractive multimodal summarization method that unites the techniques of natural language processing, speech processing and computer vision to explore the rich information contained in multimodal data and to improve the quality of multimedia news summarization. The key idea is to bridge the semantic gaps between multimodal content. Audio and visual are main modalities in the video. For audio information, we design an approach to selectively use its transcription and to infer the salience of the transcription with audio signals. For visual information, we learn the joint representations of text and images using a neural network. Then, we capture the coverage of the generated summary for important visual information through text-image matching or multimodal topic modeling. Finally, all the multimodal aspects are considered to generate a textual summary by maximizing the salience, non-redundancy, readability and coverage through the budgeted optimization of submodular functions. In this dissertaion, we first propose an automatic summarization task for asynchronous multimodal data, and we collect, annotate and release an dataset of asynchronous multimodal data summarization task in both Chinese and English. This dissertaion explores a variety of methods for processing audio, video and image. Experimental results show that guided ranking strategy is necessary for audio data. For visual data, our model based on image-text matching and multimodal topic distribution are suitable for our task. Finally, our proposed model based on multimodal data achieves significant performance improvement over the model based on text. (3) Proposing a Multi-modal Sentence Summarization Method with Modality Attention The input of the traditional sentence summarization task is a piece of text (which can be regarded as a long sentence), and the output is a short summary. Compared with the complex text, images tend to express important information in the text more directly. In this paper, we introduce a multimodal sentence summarization task that produces a short summary from a pair of sentence and image. Multimodal sentence summarization task is more challenging than sentence summarization. It not only needs to effectively incorporate visual features into standard text summarization framework, but also requires to avoid noise of image. To this end, we propose a modality-based attention mechanism to pay different attention to image patches, text units, and modalities. We design image filters to selectively use visual information to enhance the semantics of the input sentence. Based on the public sentence summarization task dataset, we collects, annotate and releases a multimodal sentence summarization dataset. In this dissertaion, we perform various comparative experiments and the experimental results show that the proposed multimodal sentence summarization method based on modality attention mechanism has achieved a improvemnt over 7\% over the model based on the text. Further experimental analysis shows that visual information is helpful for initialization of the decoder and helpful to generate more abstractive summary and generate fewer duplicated words.
关键词	自动摘要多语言多文档多模态
语种	中文
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/23110
专题	毕业生_博士学位论文
推荐引用方式 GB/T 7714	李浩然. 基于多源信息的自动摘要方法研究[D]. 中国科学院自动化研究所. 中国科学院自动化研究所,2018.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
博士毕业论文_李浩然.pdf（2164KB）	学位论文		限制开放	CC BY-NC-SA