网络媒体文字提取技术研究

CASIA OpenIR > 毕业生 > 博士学位论文

	网络媒体文字提取技术研究
其他题名	Text Extraction in Web Media
	刘杰
	2011-05-16
学位类型	工学博士
中文摘要	随着互联网的蓬勃发展，网络媒体已逐渐成为人们获取信息的重要来源。网络媒体包括了大量的图片、Flash网络动画等多种媒体资源，如何对这些资源进行有效地管理，如何对这些资源进行快速地检索和信息挖掘，已越来越受到人们的关注。网络媒体中的文字信息直接承载了高层语义信息，因此研究如何有效地提取网络媒体中的文字信息，对网络媒体检索、内容理解和监控等具有重要的意义。本文从颜色聚类、文字定位、斜体字的检测与校正以及Flash文字提取与应用等几个方面，对网络媒体中的文字提取技术进行了较深入的研究和探讨。论文的工作主要体现在以下几个方面： 1）本文提出了一种基于边缘的颜色聚类算法。中文网络环境中颜色渐变文字以及边缘颜色退化文字的存在，使得传统的颜色聚类算法错误地将文字分解到不同的颜色层上，导致文字无法被正确定位。考虑到文字的两个显著特点：颜色的相对稳定性和强烈的边缘，本文算法有效地结合边缘信息以及颜色信息挖掘图像中存在的等价颜色，并将等价颜色层进行融合，从而实现针对文字的有效颜色聚类。实验结果表明我们的算法不仅可以有效地处理颜色渐变文字以及边缘颜色退化文字，同样适用于处理颜色一致的文字。 2）本文分析了中文网络环境下，网络媒体中文字的特点，提出了一种基于部件邻接结构和连通域聚类的文字定位算法。在中文网络媒体内，中文字符占据了绝大多数，而且文字的几何特征通常比较稳定，而噪声连通域的几何特征则变化很多。该算法基于以上事实，首先假设所有字符为中文，然后利用中文字符部件邻接结构特点，在保证不会将分属于不同字符的部件错误合并的前提下，充分地融合连通域生成潜在文字连通域，然后利用连通域聚类，有效地挖掘图像内各类文字的几何特征，并利用这些特征进一步融合连通域形成候选文字，最后利用基于连通域跨度直方图特征过滤噪声，最终将图像中的文字准确地定位出来。 3）本文提出了质心角度的概念，并根据质心角度的统计特性，进一步提出了一种基于马尔科夫随机场的斜体字倾斜角度估计算法。通过大量的统计研究，中文字符的质心角度近似服从于以真实倾角为均值的高斯分布，同时考虑到相邻字符倾斜角度的相关性，我们利用马尔科夫随机场对文字的倾角估计问题建模，并利用迭代条件模式算法对该模型的最优值近似求解，最后利用估计出的角度实现对斜体字的判定及校正。 4）本文提出了针对Flash的文字信息提取方法。Flash网络动画是一种重要的网络媒体形式，本文在充分分析Flash核心标准的基础上，研发了Flash解析工具，并根据Flash的特点，实现了针对Flash的文字提取算法。
英文摘要	With the rapid development of network technology, Internet Media has become one of the most important resources for people to acquire information, and greatly enriches their lives and ways of thinking. Internet Media resources include numerous images, Flash cartoons and so on. How to effectively manage and use these large numbers of resources, and how to quickly find and extract useful information from them have drew more and more attentions. Text information in Internet Media is a direct carrier of the high-level semantics. Therefore, researches on how to effectively extract text information from Internet Media resouces are significantly valuable for search, retrival, understanding and monitoring of Internet Media. This thesis intensively studies text extraction from Internet Media resources. The main contributions of this thesis are listed as follows: Firstly, we present a novel edge based color clustering method to separate the color image into homogeneous color layers. Classic color-clustering methods cannot handle the text with gradient-color. Our method jointly considers two significant features of text characters: similar color and sharp edges. The experimental results demonstrate that our method can effectively handle not only text with uniform color but also text with gradient color. Secondly, we propose a new character localization method based on the component adjacent structure and CC-clustering of texts in Internet Media resources. Generally, the aspect of Chinese characters is large and their geometrical characteristics are relatively stable while the noises vary irregularly. Based on this fact, the proposed method first assumes that all characters are Chinese characters, and then merges the connected components into protential characters according to the features of Chinese character stucture, on the premise of assuring never merge any two text components which belong to different text characters. Secondly, the features of the characters’ aspect are extracted by CC-clustering, and then the connected components are merged into candidate charecters according to these extracted features. Finally, a new noise removal method based on stroke width histogram is employed to remove all non-character connected components, and then all characters are located. Thirdly, we present a novel method for italic character detection and rectification based on markov random fields. According to great numbers of statistical results, the centroid angle of a ...
关键词	颜色聚类文字定位斜体字检测及校正 Flash动画中的文字提取 Color Clustering Text Localization Italic Detection And Rectification Text Extraction In Flash Cartoon
语种	中文
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/6323
专题	毕业生_博士学位论文
推荐引用方式 GB/T 7714	刘杰. 网络媒体文字提取技术研究[D]. 中国科学院自动化研究所. 中国科学院研究生院,2011.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
CASIA_20081801462804（3026KB）			暂不开放	CC BY-NC-SA