Automatic summarization technique aims to analyze the input information and compress the input information into a more concise expression which guarantees the valuable content in the original input is covered. Automatic summarization technique can effectively help people to find important or interesting contents from complex information, thus improving the effectiveness of people's access to information. In recent years, automatic summarization technique has developed rapidly in the field of natural language processing, and many new directions and methods have emerged. The input information of the traditional automatic summarization system is mostly monolingual text. Nowadays, with the rapid development of the Internet, information sources are often diverse, including text, images, videos and audios. Multi-source information often contains a larger amount of information. Thus the summarization system can understand the focus of the event from different views and improve the quality of the generated summary by analyzing multi-source information. Therefore, the automatic summarization of multi-source information has great research value.
This dissertaion focuses on how to using multi-source information (including multilingual and multimodal information) to generate text summaries. First, this dissertaion proposes an extractive multi-language multi-document automatic summarization method. Next, this dissertaion develops an asynchronous multimodal summarization system supporting multiple languages. Finally, this dissertaion makes a preliminary exploration at abstractive summarization for multimodal data.The main contributions of the dissertaion are summarized as follows:
(1) Proposing a Guided Ranking Graph Model for Multilingual Multi-document Summarization
Multilingual multi-document summarization is a task to generate the summary in target language from a collection of documents in multiple source languages. For a specific event, the text in different language (such as news reports from different countries) report the event from different views. Comprehensive analysis of texts in different languages can get more valuable information about the event. Existing approach to this task is automatically translating the non-target language documents into target language and then applying monolingual summarization methods, but the summaries generated by this method is often poorly readable due to the low quality of machine translation. To solve this problem, we propose a novel graph model based on guided edge weighting method in which both informativeness and readability of summaries are taken into consideration fully. In methodology, our model attempts to choose from the target language documents the sentences which contain important shared information across languages, and also retains the salient sentences which cannot be covered by documents in other language, which considers readability and informativeness. This dissertaion proposes three strategies to detect the correlation between sentences in different languages, based on cosine similarity, entainment detection and translation model, respectively. The experimental results show that the guided ranking model based on translation model for cross-lingual semantic correlation measurement achieves the best results, which achieves over 10\% performance improvement for both Chinese and English summarization tasks.
(2) Proposing a Multi-modal Summarization Method for Asynchronous Text, Image, Audio and Video
The rapid increase in multimedia data transmission over the Internet necessitates multimodal summarization from asynchronous collections of text, image, audio and video. In this work, we propose an extractive multimodal summarization method that unites the techniques of natural language processing, speech processing and computer vision to explore the rich information contained in multimodal data and to improve the quality of multimedia news summarization. The key idea is to bridge the semantic gaps between multimodal content. Audio and visual are main modalities in the video. For audio information, we design an approach to selectively use its transcription and to infer the salience of the transcription with audio signals. For visual information, we learn the joint representations of text and images using a neural network. Then, we capture the coverage of the generated summary for important visual information through text-image matching or multimodal topic modeling. Finally, all the multimodal aspects are considered to generate a textual summary by maximizing the salience, non-redundancy, readability and coverage through the budgeted optimization of submodular functions.
In this dissertaion, we first propose an automatic summarization task for asynchronous multimodal data, and we collect, annotate and release an dataset of asynchronous multimodal data summarization task in both Chinese and English. This dissertaion explores a variety of methods for processing audio, video and image. Experimental results show that guided ranking strategy is necessary for audio data. For visual data, our model based on image-text matching and multimodal topic distribution are suitable for our task. Finally, our proposed model based on multimodal data achieves significant performance improvement over the model based on text.
(3) Proposing a Multi-modal Sentence Summarization Method with Modality Attention
The input of the traditional sentence summarization task is a piece of text (which can be regarded as a long sentence), and the output is a short summary. Compared with the complex text, images tend to express important information in the text more directly. In this paper, we introduce a multimodal sentence summarization task that produces a short summary from a pair of sentence and image. Multimodal sentence summarization task is more challenging than sentence summarization. It not only needs to effectively incorporate visual features into standard text summarization framework, but also requires to avoid noise of image. To this end, we propose a modality-based attention mechanism to pay different attention to image patches, text units, and modalities. We design image filters to selectively use visual information to enhance the semantics of the input sentence.
Based on the public sentence summarization task dataset, we collects, annotate and releases a multimodal sentence summarization dataset. In this dissertaion, we perform various comparative experiments and the experimental results show that the proposed multimodal sentence summarization method based on modality attention mechanism has achieved a improvemnt over 7\% over the model based on the text. Further experimental analysis shows that visual information is helpful for initialization of the decoder and helpful to generate more abstractive summary and generate fewer duplicated words.