CASIA OpenIR  > 毕业生  > 硕士学位论文
Alternative TitleContent-based Recognition and Filtering of Web Sensitive Information
Thesis Advisor胡卫明
Degree Grantor中国科学院研究生院
Place of Conferral中国科学院自动化研究所
Degree Discipline模式识别与智能系统
KeywordWeb 信息过滤 信息融合 过滤系统 Web Information Filtering Information Fusion Filtering System
Abstract互联网是当今最大的信息资源库之一,其信息发布的及时性与全球互联性使得其对整个社会的发展起着巨大的影响。由于互联网相关技术飞速发展,它已经影响到了日常生活的方方面面,对整个社会起着革命性的影响。网络在为人们提供各种前所未有的便利的同时,也为有害信息的广泛传播提供了便捷。这些信息特别是敏感信息对社会尤其是未成年人的影响日益引起人们的极大关注。如何净化网络环境,有效识别并过滤有害信息成了当前迫切需要解决的问题。 由于敏感信息过滤必须建立在敏感信息的高效识别基础上,因此本文将从三个方面入手来解决该问题:一是互联网敏感文本的理解识别;二是融合文本和图像进互联网敏感网页进行识别;三是实际过滤系统的设计与构造。主要的贡献和工作有: (1) 提出了一个基于语义与统计的敏感文本识别算法。通过对关键词的语义分析,把关键词集合分为三个类别。对这三类关键词给出了描述性的定义,并借鉴细胞神经网络理论来构造一个类细胞神经网络描述三类关键词之间的关系用来提取文本的敏感语义特征。最后利用统计机器学习理论来构造分类器。 (2) 提出了一个利用web结构信息进行图像与文本融合的算法。图像信息和文本信息以一种有序的方式位于同一个网页上,这种有序方式体现了丰富的语义信息。基于这些认识,把网页分为三类。经过观察与分析,只有以图像为主的网页才需要进行信息融合,同时利用web挖掘技术对web信息进行初步处理,可以把问题转化为已知类别先验的条件下,如何判断一个集合是否为敏感的决策问题。利用Bayes定理我们可以推导出一个决策公式出来。这个公式充分体现了网页的特性,实际也取得了很好的效果。 (3) 提出了一个合理的web信息过滤框架。基于对网页三个类别划分,设计了一个合理的框架,可以对三种形式的网页能够很好的过滤。克服了目前所存在的方法基本上只能过滤某种特定类型形式网页的局限性。 (4) 设计并实现了一个敏感信息过滤插件。 (5) 设计并实现了一个敏感信息主动搜索系统。
Other AbstractInternet is one of the largest information resources currently. Due to its global connectivity, people all over the world can publish and get the information they want freely. Now the technology related to Internet develops at very fast speed. It makes our lives changed greatly and brings a revolution to the whole society. People now enjoy the convenience provided by Internet. However, it also brings us harmful contents such as pornography, violence and other illegal messages. These harmful contents naturally have serious influence on the whole society, especially young people. So sensitive information recognition and filtering is of great importance, and has been one of most active research topics recently. In this thesis, we focused on three aspects: the first is web sensitive text recognition; the second is text and image fusion for web sensitive information recognition; the last is the implement of filtering systems. The main contributions of this thesis include the following issues: (1) We proposed a web sensitive text recognition algorithm by combining semantics and statistics. Based on the analysis of the semantics of sensitive text, we divided the keywords set into three subsets. Then the descriptive definitions of them are given. Finally we construct a CNN-like word net to extract the feature of text and use SVM as the classifier. (2) We proposed a novel text and image fusion algorithm by utilizing the web structure data. Text and image are arranged orderly by certain rules in a web page. This order illustrates plenty of semantic information. Based on this analysis, we divided the web pages into three classifications. We find that only the web pages in which images are dominated need fusion. Then the fusion problem is transformed in- to a set recognition problem and we use Bayes theory to deduce a formula for fusion. (3) We proposed a reasonable framework for web sensitive information filtering. The existed methods ever can only filter certain kinds of sensitive web pages. Our framework is based on the classification of web pages and it can filter all of them. (4) We implemented a plug-in of browser which can block the sensitive information timely. (5) We implemented an active search engine for web pages which contain sensitive information.
Other Identifier200328014604151
Document Type学位论文
Recommended Citation
GB/T 7714
吴偶. 基于内容的Web敏感信息识别与过滤[D]. 中国科学院自动化研究所. 中国科学院研究生院,2006.
Files in This Item:
File Name/Size DocType Version Access License
CASIA_20032801460415(14750KB) 暂不开放CC BY-NC-SAApplication Full Text
Related Services
Recommend this item
Usage statistics
Export to Endnote
Google Scholar
Similar articles in Google Scholar
[吴偶]'s Articles
Baidu academic
Similar articles in Baidu academic
[吴偶]'s Articles
Bing Scholar
Similar articles in Bing Scholar
[吴偶]'s Articles
Terms of Use
No data!
Social Bookmark/Share
All comments (0)
No comment.

Items in the repository are protected by copyright, with all rights reserved, unless otherwise indicated.