以用户为中心的互联网不良信息检测技术

CASIA OpenIR > 毕业生 > 博士学位论文

	以用户为中心的互联网不良信息检测技术
其他题名	User-oriented Harmful Information Detection Technology on the Internet
	朱明亮
	2010-06-03
学位类型	工学博士
中文摘要	计算机网络的应用深入人类活动的各个领域，对社会发展产生了巨大影响。然而互联网上广泛存在的各类不良信息也会对网民尤其是未成年人造成身心伤害，影响网络的正常秩序。因此迫切需要开展互联网不良信息检测技术的研究。另一方面，互联网呈现出以用户为中心的态势。用户的广泛参与促进了信息共享，同时也提出了各种个性化的需求，这要求相关技术适应这种以用户为中心的环境。本文针对以用户为中心的互联网不良信息检测技术进行了研究。用户为中心这个特点体现在系统检测过程的各个阶段，包括用户数据的索引和高层建模、用户对有各类害信息检测的个性化需求、以及网页结构的用户感知特性等。本文的主要工作和贡献有： 1. 提出了一种基于话题检测的讨论区用户数据建模方法以及一个融合用户行为的话题检测框架。我们的框架对讨论文本进行了信息度检查和位置加权，同时使用UF-ITUF模型对用户行为进行建模，最后将文本和用户行为分析结果融合以获得更精确的话题分类。 2. 提出了一个结合外部知识源的讨论区话题检测方法。引入外部知识源可以有效学习主题词的关系先验，以解决隐含或二义性主题词的问题。本文提出了LDA和Concept Mapping两种集成外部知识源的方法。 3. 提出了一种实例驱动的可个性化定制的不良信息识别算法。用户的个性化需求通过网页实例来表达，但数量往往不足。借助一个自动爬取得到的无标签网页训练集，使用半监督学习的策略将用户个性化实例扩展到较大的无标签训练集上，最后在扩展之后的训练集上训练得到一个稳定、高效的贝叶斯分类器实现目标样本的识别。 4. 提出了超链接功能分析VS-LFA算法。超链接功能分析可以克服目前基于超链接同质假设的不足。VS-LFA算法提取超链接视觉特征、结构特征和全网页统计特征，并使用SVM和随机森林分类器对目标样本进行分类，分别提供软标签和硬标签输出。 5. 提出了一种基于超链接功能分析的不良信息主动搜索算法。算法通过超链接功能分析来预测目标网页的相关性，从而优先抓取更有可能包含不良信息的那些超链接。实验证明我们的算法明显改进了现有的主题爬取算法，可以实现不良信息的快速搜索。 6. 实现了一个互联网不良信息识别原型系统。系统实现了网络数据获取与预处理、网页预分类与数据分流、可扩展的多模态信息识别结果融合以及有害信息阻断等技术细节。系统可在实际的网络环境中实时地对不良信息进行有效识别。
英文摘要	The Internet is growing and the impact of the Internet is becoming more powerful. However, all kinds of harmful information also spread through the Internet, bringing bad effects on users, especially minors. Therefore, study on harmful information detection technology is in urgent demand. Also, the Internet has shown a user-oriented trend. The users have become not only receivers but also collaborate publishers, requiring applications to adapt to such user-oriented environment. In this thesis, we study several key problems in user-oriented Internet harmful information detection, focusing on user data adaption, user demand customizability, as well as user perception of the Internet structure. The main contributions of this thesis are as follows: 1. A topic detection based online discussion data modeling method and a framework for topic detection combining content and user analysis. To solve the problem that the discussion data is loose and noisy, several extensions is introduced, including content informativeness detection, term pos-weighting, UF-ITUF modeling of user activities and a two level fusion strategy for content and user analysis. 2. A method for integrating external knowledge base into topic detection of discussion data. Semantic relationships among terms can be learned from the knowledge base to solve the term implicity or ambiguity problem. Two approaches for knowledge integration is introduced: the LDA approach and the Concept Mapping approach. 3. A customizable instance-driven harmful information detection algorithm. The customization demand of users can be expressed by web page instances. The number of user instances is usually very few, so the semi-supervised learning is introduced to extend user instances onto a large unlabeled training set, and finally a Bayes classifier is trained upon the extended training set. 4. The VS-LFA algorithm for link function analysis. Link function analysis solves the problem in traditional single-link-typed modeling. The VS-LFA algorithm extract link visual features, structural features and whole page statistics to simulate user perception of links. SVM and random forest are used to handle soft and hard classification respectively. 5. An active search algorithm for harmful information. The algorithm predicts the harmfulness of the link target by link function analysis, and download targets with higher probability to be harmful. Experiments show that our algorithm improves the topical search, so that onli...
关键词	互联网不良信息检测话题检测外部知识源个性化定制超链接功能分析信息主动搜索原型系统 Internet Harmful Information Detection Topic Detection External Knowledge Source Customization Link Function Analysis Active Search Of Information Prototype System
语种	中文
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/6288
专题	毕业生_博士学位论文
推荐引用方式 GB/T 7714	朱明亮. 以用户为中心的互联网不良信息检测技术[D]. 中国科学院自动化研究所. 中国科学院研究生院,2010.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
CASIA_20071801462910（6122KB）			限制开放	CC BY-NC-SA