The Internet is growing and the impact of the Internet is becoming more powerful. However, all kinds of harmful information also spread through the Internet, bringing bad effects on users, especially minors. Therefore, study on harmful information detection technology is in urgent demand. Also, the Internet has shown a user-oriented trend. The users have become not only receivers but also collaborate publishers, requiring applications to adapt to such user-oriented environment. In this thesis, we study several key problems in user-oriented Internet harmful information detection, focusing on user data adaption, user demand customizability, as well as user perception of the Internet structure. The main contributions of this thesis are as follows: 1. A topic detection based online discussion data modeling method and a framework for topic detection combining content and user analysis. To solve the problem that the discussion data is loose and noisy, several extensions is introduced, including content informativeness detection, term pos-weighting, UF-ITUF modeling of user activities and a two level fusion strategy for content and user analysis. 2. A method for integrating external knowledge base into topic detection of discussion data. Semantic relationships among terms can be learned from the knowledge base to solve the term implicity or ambiguity problem. Two approaches for knowledge integration is introduced: the LDA approach and the Concept Mapping approach. 3. A customizable instance-driven harmful information detection algorithm. The customization demand of users can be expressed by web page instances. The number of user instances is usually very few, so the semi-supervised learning is introduced to extend user instances onto a large unlabeled training set, and finally a Bayes classifier is trained upon the extended training set. 4. The VS-LFA algorithm for link function analysis. Link function analysis solves the problem in traditional single-link-typed modeling. The VS-LFA algorithm extract link visual features, structural features and whole page statistics to simulate user perception of links. SVM and random forest are used to handle soft and hard classification respectively. 5. An active search algorithm for harmful information. The algorithm predicts the harmfulness of the link target by link function analysis, and download targets with higher probability to be harmful. Experiments show that our algorithm improves the topical search, so that onli...
修改评论