CASIA OpenIR  > 毕业生  > 博士学位论文
文本图像过滤关键技术研究
其他题名Research on Key Technologies of Text Image Filtering
姚金良
学位类型工学博士
导师杨一平
2009-01-05
学位授予单位中国科学院研究生院
学位授予地点中国科学院自动化研究所
学位专业计算机应用技术
关键词信息过滤 文本定位 文本抽取 主题识别 倾向性分析 Information Filtering Text Region Detection Text Extraction Topic Identification Semantic Orientation Analysis
摘要由于网络信息发布缺少有效的监督机制,越来越多的不良信息出现在网络上。为了构建和谐健康的网络环境,自动的信息过滤技术具有重要研究价值。为了逃避现有的过滤系统,当前不良信息发布者越来越多地以图像的形式发布不良文本信息,这给不良信息过滤研究提出了新的挑战。本文针对网络上的文本图像信息进行不良信息过滤研究,其包括文本图像的字符识别和识别后文本过滤两个方面的研究内容。 为了提高过滤的准确性,本文对文本图像过滤的一些关键技术提出了建设性的方法。通过文本定位和文本抽取研究提高复杂背景文本图像中的字符识别性能,同时通过文本主题识别和文本倾向性分析相结合的方法来提高文本过滤的准确率。本论文的主要贡献如下: 1、 本文提出了基于连通分量的文本定位方法。该方法利用了字符几何形状特征和文本区域内字符整体特征,并将两类特征有效地融合到分类过程中。同时,本方法使用了级联弱分类器和支持向量机相结合的分类方法来确认字符。实验结果显示该方法具有较高的定位准确率。 2、 针对复杂背景图像的文本抽取问题,本文提出了一种基于HSL颜色空间的抽取方法,用于消除字符颜色不一致和复杂背景的影响。该方法将文本区域分为三种不同的颜色类型,对不同类型的文本区域,采用HSL颜色空间中相适应的颜色分量进行分割。该方法充分利用HSL各个分量的优势。实验结果表明了该方法的有效性。 3、 在文本过滤方面,本文采用主题来表示过滤模板,通过识别文本主题来确认是否需要过滤。本文提出了基于概念知识树的主题识别方法。该方法利用了概念知识树的层次关系和节点属性信息来确认文本主题的核心概念,并利用概念的语义关系构建一个复合概念来表示文本的主题。实验表明该主题识别方法具有较高的性能,并能有效地运用于文本过滤系统中。 4、 为能够准确识别具有相同主题的正面和反面文本,文本情感倾向性特性被用于文本过滤。本文提出了基于主题词上下文词汇的文本倾向性分析方法。该方法认为文本的倾向性与文本的主题相关,而且可以通过主题词的上下文词汇对主题词的相互作用来表示。基于主题词的倾向性分析能够有效的消除文本内容变化带来的困难。实验结果显示了该方法的有效性。
其他摘要Due to the lack of supervision mechanism, more and more harmful information appears on internet. Automatic removal of harmful information is very important. More and more harmful text information is promulgated in images, which makes “text image” filtering a new challenge in information filtering. Our research is focused on filtering harmful information carried by text images on the internet, which includes two aspects as char recognition in text image and text filtering. We propose a few of constructive methods for the key technologies in both char recognition in text image and text filtering. Text region locating and text extraction methods are proposed to improve the precision of char recognition in complex background images. Text topic identification and semantic orientation analysis are presented for text filtering. Concretely, following contributions are involved in this thesis: (1)A connected component based approach is provided for text locating.The presented algorithm takes advantage of character's geometrical shape features and collectivity features of characters in the text regions, and integrates these features into a classification process. Meanwhile the cascade of threshold classifiers and support vector machine are combined in this approach to recognize characters. Experimental results demonstrate that the proposed algorithm brings high precision to text locating. (2)For text extraction problem in complex background images, a color segmentation algorithm based on HSL color space is proposed to reduce the influence from different character color and the complex background.The algorithm categorizes text regions into three color types, then segment text region of different types with different HSL component. In this algorithm, the strength of each HSL component is utilized effectively.Experimental results demonstrate the effectiveness of the proposed algorithm. (3)In text filtering, text topic is used to construct users' profile, so as to identify the input text should be filtered or not. A method based on concept knowledge tree for text topic identification is presented. The method uses the semantic relation of concepts to identify the key concept of the input text, and constructs compound concept to express the text topic. Experimental results show the presented topic identification method has an encouraging performance and is applied to text filtering. (4)A method based on the context of topic word for semantic orientation analyzing is proposed to identify the negative and positive attitude appeared in the same topic text. In this method, we suppose that text semantic orientation is concerned with text topic, and text semantic orientation can be expressed by the relation between topic word and its context. The proposed method removes the effect of text topic variety. Experiments are carried out, and show the influence of text topic variation is effectively suppressed by our algorithm.
馆藏号XWLW1337
其他标识符200518014629107
语种中文
文献类型学位论文
条目标识符http://ir.ia.ac.cn/handle/173211/6139
专题毕业生_博士学位论文
推荐引用方式
GB/T 7714
姚金良. 文本图像过滤关键技术研究[D]. 中国科学院自动化研究所. 中国科学院研究生院,2009.
条目包含的文件
文件名称/大小 文献类型 版本类型 开放类型 使用许可
CASIA_20051801462910(21571KB) 暂不开放CC BY-NC-SA请求全文
个性服务
推荐该条目
保存到收藏夹
查看访问统计
导出为Endnote文件
谷歌学术
谷歌学术中相似的文章
[姚金良]的文章
百度学术
百度学术中相似的文章
[姚金良]的文章
必应学术
必应学术中相似的文章
[姚金良]的文章
相关权益政策
暂无数据
收藏/分享
所有评论 (0)
暂无评论
 

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。