With the rapid development of network technology, Internet Media has become one of the most important resources for people to acquire information, and greatly enriches their lives and ways of thinking. Internet Media resources include numerous images, Flash cartoons and so on. How to effectively manage and use these large numbers of resources, and how to quickly find and extract useful information from them have drew more and more attentions. Text information in Internet Media is a direct carrier of the high-level semantics. Therefore, researches on how to effectively extract text information from Internet Media resouces are significantly valuable for search, retrival, understanding and monitoring of Internet Media. This thesis intensively studies text extraction from Internet Media resources. The main contributions of this thesis are listed as follows: Firstly, we present a novel edge based color clustering method to separate the color image into homogeneous color layers. Classic color-clustering methods cannot handle the text with gradient-color. Our method jointly considers two significant features of text characters: similar color and sharp edges. The experimental results demonstrate that our method can effectively handle not only text with uniform color but also text with gradient color. Secondly, we propose a new character localization method based on the component adjacent structure and CC-clustering of texts in Internet Media resources. Generally, the aspect of Chinese characters is large and their geometrical characteristics are relatively stable while the noises vary irregularly. Based on this fact, the proposed method first assumes that all characters are Chinese characters, and then merges the connected components into protential characters according to the features of Chinese character stucture, on the premise of assuring never merge any two text components which belong to different text characters. Secondly, the features of the characters’ aspect are extracted by CC-clustering, and then the connected components are merged into candidate charecters according to these extracted features. Finally, a new noise removal method based on stroke width histogram is employed to remove all non-character connected components, and then all characters are located. Thirdly, we present a novel method for italic character detection and rectification based on markov random fields. According to great numbers of statistical results, the centroid angle of a ...
修改评论