面向社会媒体短文本的话题分析与演化建模

CASIA OpenIR > 毕业生 > 博士学位论文

	面向社会媒体短文本的话题分析与演化建模
	张育浩1,2
	2017-05-28
学位类型	工学博士
中文摘要	近年来，社会媒体快速发展，成为人们沟通交流、表达观点、查找与分享信息的重要的渠道。很多社会媒体数据以短文本的形式呈现，如微博、状态信息等。对社会媒体短文本进行话题分析与演化建模可以帮助人们更好的分析和理解社会媒体数据，掌握社会热点，把握舆论态势，在政府决策、公共安全和商务智能等领域具有十分重要的研究意义和应用价值。本论文面向社会媒体短文本这一重要的数据形式，对话题分析与演化建模这一文本挖掘的重要课题展开研究。本论文从话题检测、话题层级结构构建和话题演化建模三个方面对社会媒体短文本进行话题分析，并采用真实的社会媒体数据集，如Twitter数据集和微博数据集，对提出的话题分析与演化建模方法进行了有效性验证。本论文的主要贡献包括： 1. 在话题检测方面，针对以往研究工作存在的需人为确定话题数目、受短文本数据稀疏影响较大等问题，本论文结合非参贝叶斯方法和词共现建模思路，提出一种适用于社会媒体短文本、可自动确定话题数目的非参话题检测模型npCTM；为有效提升话题检测的质量，模型在建模过程中融合文档集的词汇共现信息并结合词汇类型分布，实现对背景话题和普通话题的区分；基于真实社会媒体数据集，实验验证了该方法在短文本话题检测上的有效性； 2. 在话题层级结构构建方面，针对以往研究工作在表达复杂语义、设定话题层级参数、处理短文本数据等方面存在的问题，本论文提出了一种适用于社会媒体短文本、可自动构建话题层级结构的方法TCM；该方法利用树结构表达层次结构语义明晰的特点，通过自下而上扩充树结构进行层级结构构建；以检测到的话题为基础，提出话题树的树间相似度和树内相似度的计算方法，并据此设计话题树的合并模式，通过迭代求解得到完整的话题层级结构；基于真实社会媒体数据集，实验验证了该方法在短文本话题层级结构构建上的有效性； 3. 在话题演化方面，针对以往研究工作存在的需人为设定话题数目，不同时段话题数目相同，无法根据文本内容自动调整，受短文本数据稀疏影响较大等问题，提出一种适用于社会媒体短文本、可自动确定不同时段话题数目的非参话题演化模型sdTEM；将词嵌入向量和非参贝叶斯方法结合，提出一种新的非参话题演化先验，即循环语义依赖CRP过程（rsdCRP），以帮助模型自动确定各时段的话题数目；进而将rsdCRP和适用于短文本的词共现建模思路结合来构建话题演化模型sdTEM；基于真实社会媒体数据集，实验验证了该方法在短文本话题演化建模上的有效性。
英文摘要	With the rapid growth of social media in recent years, it has become an important platform for communication, opinion expression and information seeking and sharing. Social media texts are usually short, such as microblogs and status messages. Topic analysis and evolution modeling of social media short texts can help the analysis and understanding the insights from the social media data, grasp essential information and acquire the public opinion. It is of great research significance and application value in domains such as government decision-making, public security and business intelligence. In this thesis, we focus on topic analysis and evolution modeling of social media short texts, which is an important theme in text mining. We carry out the research in three main aspects, including topic detection, topic hierarchy construction and topic evolution modeling. We adopt the social media datasets such as Twitter dataset and Sina Weibo dataset and verify the effectiveness of the topic analysis and evolution modeling methods we propose. The main contributions of this thesis are as follows: 1. In the domain of topic detection, existing methods either need to set the topic number manually, or suffer from the data sparsity of short texts. Aiming at these problems, we propose a nonparametric topic detection model npCTM, which is suitable for social media short texts and could determine the topic number automatically. We construct this model by combining the nonparametric Bayesian method and the word co-occurrence modeling. In order to improve the quality of the detected topics, our model distinguishes background topic from other topics by jointly considering the distribution of word types for each word as well as word coherence information from the short texts. We carry out an experiment on real-world social media dataset and verify the effectiveness of our method on topic detection for short texts. 2. In the domain of topic hierarchy construction, existing methods have problems in expressing complex contents, setting parameters of the hierarchy and handling short texts. Aiming at these problems, we propose a topic hierarchy construction method TCM, which is suitable for social media short texts and could construct the topic hierarchy automatically. This method takes advantage of the tree structure which is suitable for represent semantic hierarchy and build the topic hierarchical structure by combining topic trees in a bottom-up way. We use the detected topics as basic topic trees and design the tree combination modes based on the inter-tree similarity and intra-tree similarity we proposed. We then construct the topic hierarchy by combining topic trees iteratively. We carry out an experiment on real-world social media and verify the effectiveness of our method on topic hierarchy construction for short texts. 3. In the domain of topic evolution modeling, existing methods either suffer from the data sparsity of short texts, or need to set the topic number manually, which could not change during different time epoches and could not be adjusted based on the contents. Aiming at these problems, we propose a topic evolution method sdTEM, which is suitable for social media short texts and could determine the topic number in different time epoches automatically. We propose the recurrent semantic dependent Chinese restaurant process (rsdCRP) by combining word embeddings with nonparametric Bayesian method. We use rsdCRP as the prior of topic evolution, which could help determine the topic number in different time epoches automatically. We then construct the topic evolution model sdTEM by combining rsdCRP with word co-occurrence modeling which is suitable for short texts. We carry out an experiment on real-world social media and verify the effectiveness of our method on topic evolution modeling for short texts.
关键词	社会媒体分析文本挖掘话题分析与演化建模
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/14706
专题	毕业生_博士学位论文
作者单位	1.中国科学院自动化研究所 2.中国科学院大学
第一作者单位	中国科学院自动化研究所
推荐引用方式 GB/T 7714	张育浩. 面向社会媒体短文本的话题分析与演化建模[D]. 北京. 中国科学院大学,2017.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
面向社会媒体短文本的话题分析与演化建模.（1664KB）	学位论文		限制开放	CC BY-NC-SA