社交媒体短文本自动摘要

CASIA OpenIR > 毕业生 > 硕士学位论文

	社交媒体短文本自动摘要
其他题名	Automatic Summarization of Short Texts Generated by Social Media
	吴玉芳
	2015-05-21
学位类型	工学硕士
中文摘要	文本的自动摘要技术具有重大的应用价值。如何从文本中挑选出最相关的信息是抽取式摘要技术的重点，因此抽取式自动摘要可转化成句子选择问题，而其重点在于句子相似度度量和句子打分算法。但是由于社交媒体短文本篇幅小、噪声大、规范性差且稀疏性严重，导致传统文档摘要技术中的句子打分方法无法直接用于短文本，而且基于词袋模型和基于深度句子表示的单一粒度相似度度量不能很好地度量句子相似度。针对这些问题，本文开展了如下研究工作。提出基于key-bigram提取的无监督微博自动摘要方法，以充分利用微博中文本片段的重复特性。首先，分别基于混合TF-IDF、TextRank和主题模型提取key-bigram以刻画微博话题下细粒度的核心子主题。然后，基于提取的key-bigram集合，提出了分别基于交叠相似度和互信息策略的句子打分（排序）算法。最后，以贪心迭代的形式抽取排名靠前、且满足一定冗余度条件的句子组成特定长度的摘要输出。在新浪微博和Twitter数据集上的实验结果同时表明，本方法能有效提升摘要的ROUGE-1值，尤其是准确率。研究了基于融合思想改进key-bigram提取和句子排序结果的技术。为了更充分地考查bigram之间的语义关系，提出一种基于局部密度思想的key-bigram提取算法，并将TextRank与之级联，为其产生候选key-bigram集合。抽取摘要时，基于平均排名和排名稳定性对多排序结果融合。实验结果表明，融合多种排序结果能进一步提升摘要质量。提出一种基于深度学习的多粒度相似度度量和次模函数优化相结合的短文本自动摘要方法。将抽取式摘要问题建模成带背包约束的次模函数最大化任务，联合优化摘要的覆盖度和多样性，同时用基于深度学习的多粒度相似度对目标函数加以改进。Opinosis数据集上的实验结果表明，本文提出的多粒度相似度度量方法，比基于词袋模型和单一粒度的深度句子表示计算相似度的方法更鲁棒，在ROUGE-SU4指标下超过了该数据集上目前最好的结果。
英文摘要	In recent years, massive short texts have been generated by social media platforms such as Twitter and Sina Weibo, which bring the problem of information overload. Automatic summarization techniques can extract the valuable information from the vast volume of texts and compress it into a summary. An effective approach of extractive summarization is to select relevant and salient information from texts, which can be formulated as a sentence selection problem. The performance of sentence selection relies on sentence similarity measuring and sentence scoring. However, short texts generated by social media suffer from noises, non-standard grammar and severe sparsity. Therefore, applying traditional sentence scoring and Bag-of-Words (BoW) based representation methods to short texts directly does not work satisfactorily. We attempt to solve the problem in some ways as follows. First, we propose an unsupervised microblog summarization method based on key-bigram extraction. We extract key-bigrams based on hybrid TF-IDF, TextRank and topic model separately to discover the salient subtopics of a set of topic related posts. Then, we score sentences based on the key-bigram set by considering the overlap similarity or the mutual information between them. Top ranked sentences with redundancy removal are iteratively selected as summary. In experiments on Sina Weibo and Twitter datasets, our key-bigram-based summarizer is shown to perform superiorly in in sense of ROUGE-1 score, and especially the precision. Aiming to improve key-bigram extraction and sentence ranking results, we propose to extract key-bigrams base on local density. The distance between bigrams is measured by their topic distributions estimated by topic model, and a quick search algorithm is applied to calculate the local density of each bigram. A TextRank-based extractor is cascaded before it to generate candidate key-bigram set. As for sentence extraction, two ranking results are merged by considering the average value and stability of ranking of each sentence. Experimental results show that merging two ranking results can improve the quality of summary compared with single ranking, and performs better than single TextRank-based method. Futher, we propose a short text summarization method by combining deep learning-based multi-granularity similarity and submodular function optimization. Extractive summarization task is modeled as budgeted maximization of submodular functions, optimizing the coverag...
关键词	社交短文本自动摘要句子打分 Key-bigram 提取相似度度量次模函数优化深度学习 Social Media Short Texts Automatic Summarization Sentene Scoring Key-bigram Extraction Similarity Measuring Submodular Functions Optimization Deep Learning
语种	中文
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/7762
专题	毕业生_硕士学位论文
推荐引用方式 GB/T 7714	吴玉芳. 社交媒体短文本自动摘要[D]. 中国科学院自动化研究所. 中国科学院大学,2015.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
CASIA_20122801462805（2767KB）			暂不开放	CC BY-NC-SA