Due to the lack of supervision mechanism, more and more harmful information appears on internet. Automatic removal of harmful information is very important. More and more harmful text information is promulgated in images, which makes “text image” filtering a new challenge in information filtering. Our research is focused on filtering harmful information carried by text images on the internet, which includes two aspects as char recognition in text image and text filtering. We propose a few of constructive methods for the key technologies in both char recognition in text image and text filtering. Text region locating and text extraction methods are proposed to improve the precision of char recognition in complex background images. Text topic identification and semantic orientation analysis are presented for text filtering. Concretely, following contributions are involved in this thesis: (1)A connected component based approach is provided for text locating.The presented algorithm takes advantage of character's geometrical shape features and collectivity features of characters in the text regions, and integrates these features into a classification process. Meanwhile the cascade of threshold classifiers and support vector machine are combined in this approach to recognize characters. Experimental results demonstrate that the proposed algorithm brings high precision to text locating. (2)For text extraction problem in complex background images, a color segmentation algorithm based on HSL color space is proposed to reduce the influence from different character color and the complex background.The algorithm categorizes text regions into three color types, then segment text region of different types with different HSL component. In this algorithm, the strength of each HSL component is utilized effectively.Experimental results demonstrate the effectiveness of the proposed algorithm. (3)In text filtering, text topic is used to construct users' profile, so as to identify the input text should be filtered or not. A method based on concept knowledge tree for text topic identification is presented. The method uses the semantic relation of concepts to identify the key concept of the input text, and constructs compound concept to express the text topic. Experimental results show the presented topic identification method has an encouraging performance and is applied to text filtering. (4)A method based on the context of topic word for semantic orientation analyzing is proposed to identify the negative and positive attitude appeared in the same topic text. In this method, we suppose that text semantic orientation is concerned with text topic, and text semantic orientation can be expressed by the relation between topic word and its context. The proposed method removes the effect of text topic variety. Experiments are carried out, and show the influence of text topic variation is effectively suppressed by our algorithm.
修改评论