CASIA OpenIR  > 毕业生  > 硕士学位论文
Alternative TitleAutomatic Summarization of Short Texts Generated by Social Media
Thesis Advisor刘成林
Degree Grantor中国科学院大学
Place of Conferral中国科学院自动化研究所
Degree Discipline模式识别与智能系统
Keyword社交短文本自动摘要 句子打分 Key-bigram 提取 相似度度量 次模函数优化 深度学习 Social Media Short Texts Automatic Summarization Sentene Scoring Key-bigram Extraction Similarity Measuring Submodular Functions Optimization Deep Learning
Abstract文本的自动摘要技术具有重大的应用价值。 如何从文本中挑选出最相关的信息是抽取式摘要技术的重点,因此抽取式自动摘要可转化成句子选择问题,而其重点在于句子相似度度量和句子打分算法。但是由于社交媒体短文本篇幅小、噪声大、规范性差且稀疏性严重,导致传统文档摘要技术中的句子打分方法无法直接用于短文本,而且基于词袋模型和基于深度句子表示的单一粒度相似度度量不能很好地度量句子相似度。针对这些问题,本文开展了如下研究工作。 提出基于key-bigram提取的无监督微博自动摘要方法,以充分利用微博中文本片段的重复特性。首先,分别基于混合TF-IDF、TextRank和主题模型提取key-bigram以刻画微博话题下细粒度的核心子主题。然后,基于提取的key-bigram集合,提出了分别基于交叠相似度和互信息策略的句子打分(排序)算法。最后,以贪心迭代的形式抽取排名靠前、且满足一定冗余度条件的句子组成特定长度的摘要输出。在新浪微博和Twitter数据集上的实验结果同时表明,本方法能有效提升摘要的ROUGE-1值,尤其是准确率。 研究了基于融合思想改进key-bigram提取和句子排序结果的技术。为了更充分地考查bigram之间的语义关系,提出一种基于局部密度思想的key-bigram提取算法,并将TextRank与之级联,为其产生候选key-bigram集合。抽取摘要时,基于平均排名和排名稳定性对多排序结果融合。实验结果表明,融合多种排序结果能进一步提升摘要质量。 提出一种基于深度学习的多粒度相似度度量和次模函数优化相结合的短文本自动摘要方法。将抽取式摘要问题建模成带背包约束的次模函数最大化任务,联合优化摘要的覆盖度和多样性,同时用基于深度学习的多粒度相似度对目标函数加以改进。Opinosis数据集上的实验结果表明,本文提出的多粒度相似度度量方法,比基于词袋模型和单一粒度的深度句子表示计算相似度的方法更鲁棒,在ROUGE-SU4指标下超过了该数据集上目前最好的结果。
Other AbstractIn recent years, massive short texts have been generated by social media platforms such as Twitter and Sina Weibo, which bring the problem of information overload. Automatic summarization techniques can extract the valuable information from the vast volume of texts and compress it into a summary. An effective approach of extractive summarization is to select relevant and salient information from texts, which can be formulated as a sentence selection problem. The performance of sentence selection relies on sentence similarity measuring and sentence scoring. However, short texts generated by social media suffer from noises, non-standard grammar and severe sparsity. Therefore, applying traditional sentence scoring and Bag-of-Words (BoW) based representation methods to short texts directly does not work satisfactorily. We attempt to solve the problem in some ways as follows. First, we propose an unsupervised microblog summarization method based on key-bigram extraction. We extract key-bigrams based on hybrid TF-IDF, TextRank and topic model separately to discover the salient subtopics of a set of topic related posts. Then, we score sentences based on the key-bigram set by considering the overlap similarity or the mutual information between them. Top ranked sentences with redundancy removal are iteratively selected as summary. In experiments on Sina Weibo and Twitter datasets, our key-bigram-based summarizer is shown to perform superiorly in in sense of ROUGE-1 score, and especially the precision. Aiming to improve key-bigram extraction and sentence ranking results, we propose to extract key-bigrams base on local density. The distance between bigrams is measured by their topic distributions estimated by topic model, and a quick search algorithm is applied to calculate the local density of each bigram. A TextRank-based extractor is cascaded before it to generate candidate key-bigram set. As for sentence extraction, two ranking results are merged by considering the average value and stability of ranking of each sentence. Experimental results show that merging two ranking results can improve the quality of summary compared with single ranking, and performs better than single TextRank-based method. Futher, we propose a short text summarization method by combining deep learning-based multi-granularity similarity and submodular function optimization. Extractive summarization task is modeled as budgeted maximization of submodular functions, optimizing the coverag...
Other Identifier201228014628057
Document Type学位论文
Recommended Citation
GB/T 7714
吴玉芳. 社交媒体短文本自动摘要[D]. 中国科学院自动化研究所. 中国科学院大学,2015.
Files in This Item:
File Name/Size DocType Version Access License
CASIA_20122801462805(2767KB) 暂不开放CC BY-NC-SAApplication Full Text
Related Services
Recommend this item
Usage statistics
Export to Endnote
Google Scholar
Similar articles in Google Scholar
[吴玉芳]'s Articles
Baidu academic
Similar articles in Baidu academic
[吴玉芳]'s Articles
Bing Scholar
Similar articles in Bing Scholar
[吴玉芳]'s Articles
Terms of Use
No data!
Social Bookmark/Share
All comments (0)
No comment.

Items in the repository are protected by copyright, with all rights reserved, unless otherwise indicated.