英文摘要 | With the acceleration of global information exchange and the rapid development of multimedia technology, multi-source information on the Internet is increasing day by day. Compared with single-source information, multi-source information can provide richer content. For example, scenes with pictures and texts can bring more visual impact than pure texts, and news reports on the same topic in different languages can provide more perspectives. However, existing research on automatic summarization usually focuses on monolingual text, ignoring the correlation between multimodal and multilingual information, resulting in a certain degree of information loss, which further leads to the existing methods unable to obtain high-quality summaries in multi-source information scenarios. Based on this, this paper focuses on the research of automatic summarization methods based on multi-source information fusion, aiming to integrate multimodal and multilingual information to improve summarization.
The main contributions of this paper are as follows.
1. Incorporating both intra-modal saliency and inter-modal relevance into automatic evaluation method for pictorial summaries
Towards the problem that existing automatic summarization evaluation methods cannot evaluate the pictorial summaries, this paper proposes a pictorial summary evaluation method for the first time, which takes both intra-modal saliency and inter-modal relevance into account. This method divides the evaluation problem into three aspects: the text saliency, the image saliency, and the text-image relevance. The measurement scores of these three aspects are weighted by linear regression to obtain the final overall measurement score of the pictorial summary. Experiments have shown that, compared with the existing automatic summarization evaluation methods, the correlation between the proposed evaluation method and human evaluation is higher, which illustrates that the proposed evaluation method can be better applied to evaluate pictorial summaries.
2. Incorporating multimodal attention mechanism into Generation Method of Pictorial Summaries
Towards the problem that the current automatic summarization methods can only effectively generate a single-modal summary by using textual information, this paper proposes a pictorial summary generation method which incorporates multi-modal attention mechanism. The method encodes the semantic information of both the image and the text simultaneously, and can better capture the alignment between the image and the text, thus enhancing the key information by using the semantic representation of different modalities and generating a more accurate pictorial summary. The experiments have shown that on the automatically constructed dataset, the method proposed in this paper can outperform the traditional text summarization methods in terms of both automatic evaluation method and human evaluation. The proposed method has good generality, and can be applied to scenes with no explicit alignment between images and texts, and can also be migrated to other application scenes and tasks.
3. Proposing an abstractive end-to-end multilingual summarization method
Towards the problem of error propagation in pipeline-based multilingual summarization methods, this paper proposes an abstractive end-to-end multilingual summarization method. The method integrates both the "translating" mode in the cross-lingual summarization and the "copying" mode in the monolingual summarization, achieving the construction of training data and semantic fusion of multilingual mixed input under the zero-shot scenario. Experiments have shown that compared with the traditional pipeline-based multilingual summarization methods, the method proposed in this paper can significantly improve the quality of summaries. |
修改评论