CASIA OpenIR  > 毕业生  > 硕士学位论文
互联网毒品类信息过滤研究
其他题名Research on Filtering Drug Information on the Internet
贺主
学位类型工学硕士
导师胡卫明
2010-06-03
学位授予单位中国科学院研究生院
学位授予地点中国科学院自动化研究所
学位专业模式识别与智能系统
关键词互联网有害信息过滤 文本识别 机器学习 有监督学习 半监督学习 Adaboost 支持向量机 一类支持向量机 Webpage Filtering Text Classification Machine Learning Supervised Learning Semi-supervised Learning Adaboost Svm One-class Svm
摘要本文围绕互联网毒品类信息识别这一问题,对目前国际上流行的一些机器学习的方法进行了研究,涉及到有监督学习、半监督学习和等多个方面,并结合现实中的问题进行了应用。本学位论文的内容主要有: 提出了针对海量数据的违禁药物网页识别算法。搜索引擎索引库中的网页数量十分庞大,并且不可能用手工的方式标记足够的训练数据,直接用传统的机器学习、文本分类等算法难以解决这个问题。我们充分结合了indexing和divide and conquer的策略,设计了一个多层的识别框架。实验结果表明我们的算法较好的找到了大数据集上识别准确率和运行效率的平衡点,可以实现对网上兴奋剂销售类网页的有效监控。 设计并实现了基于Adaboost算法和潜在语意标引的互联网毒品信息识别与过滤算法。我们首先介绍了Boosting方法的基本思想和重要性质,然后将实值Adaboost算法用于互联网毒品信息识别与过滤中,特别的在构造决策桩、信息融合等方面提出了一些改进。这使得我们的互联网毒品信息识别与过滤系统能够在保持较高检测率的情况下获得很低的虚警率。另外,基于Adaboost算法的有害网页识别具有相当低的计算复杂度,可以频繁的重训练以适应复杂的互联网环境,因而很有希望走向实用化。 提出了一种基于实值Adaboost的半监督学习框架,来进行毒品网页的识别与过滤。在我们的毒品网页分类的问题中,所能得到的已经标记的样本集通常来说相对较小,而手工的标记这些样本则会耗费大量的人力物力,是非常昂贵的。但是,互联网上有大量的未标记的样本,所以能利用这些未标记的样本是很重要而且很有效的一种思路。通过我们的研究可以看到,使用未标记样本进行半监督学习是十分有效的,而使用层级分类框架带来的分类结果有明显提高。通过半监督学习,可以大大减轻人工标记的负担;另一方面相比监督学习来说,可以有效的改善识别结果。由于我们的半监督主动学习框架有很多优秀的性质,因此有必要对其进行更深入的研究。 综上所述,本文在机器学习方法本身及其在互联网毒品信息识别领域中的应用等方面做了一些有益的探索。
其他摘要As the rapid growth of the World Wide Web, it plays a more and more important roll in every day’s life. The World Wide Web provides great convenience for users to obtain information. And its growth is extremely fast in China. However, there exists much harmful information on the internet, such as pornographic content. Thus, how to filter harmful web pages on the internet is quite an important issue. In general, the problem of harmful web page filtering is converted to that of web page classification, in which machine learning plays a very important roll. As far as now, filtering harmful content on the Internet became an important issue for researchers. The filtering demand is mainly information such as pornography, gambling, violence and murder. Researches on these kinds of web information have had great achievements. But websites about prohibited drugs' information haven't attracted much attention. Some of these sites are selling drugs on the internet, while some are providing information about growing or using drugs. There exist lots of drug sites. And the traffic of these sites has a rapid growth. In this thesis, based on the problem of web harmful content filtering, we study several prevailing methods in machine learning, which include supervised learning methods and semi-supervised learning methods, etc. And some of the methods have applied in real life subject. The main contributions of this thesis include the following issues: We design and implement an algorithm for filtering stimulant selling web sites on the internet. In this algorithm we use data extract from Sogou search engine. The size of the database is huge so traditional machine learning methods may appear ineffective. We use strategy of indexing and divide and conquer to solve this problem. First we use a rough filter and an refined filter both based on keywords to extract data from the database. Then we make an index for all the extracted data to improve the accessing speed. After that we use combined-rules and one class SVM to classify the webpages in the database. The results show that our method has an satisfactory performance and an reasonable speed. We design and implement a web page filtering algorithm based on Adaboost. With an improved setting, our Adaboost-based web page filtering algorithm can achieve a very low false positive rate while keeping a relatively high detection rate. Meanwhile, this algorithm owns low computational complexity which makes it possible for ...
馆藏号XWLW1459
其他标识符200628014628032
语种中文
文献类型学位论文
条目标识符http://ir.ia.ac.cn/handle/173211/7543
专题毕业生_硕士学位论文
推荐引用方式
GB/T 7714
贺主. 互联网毒品类信息过滤研究[D]. 中国科学院自动化研究所. 中国科学院研究生院,2010.
条目包含的文件
文件名称/大小 文献类型 版本类型 开放类型 使用许可
CASIA_20062801462803(921KB) 暂不开放CC BY-NC-SA请求全文
个性服务
推荐该条目
保存到收藏夹
查看访问统计
导出为Endnote文件
谷歌学术
谷歌学术中相似的文章
[贺主]的文章
百度学术
百度学术中相似的文章
[贺主]的文章
必应学术
必应学术中相似的文章
[贺主]的文章
相关权益政策
暂无数据
收藏/分享
所有评论 (0)
暂无评论
 

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。