视频文本检测算法研究

CASIA OpenIR > 毕业生 > 博士学位论文

	视频文本检测算法研究
其他题名	Caption text detection and extraction in video
	周景超
	2008-05-31
学位类型	工学博士
中文摘要	让计算机自动理解视频文件的内容，并利用得到的信息去推动视频管理、检索、数据挖掘等应用，是目前的一个研究热点。视频中的文本信息与视频内容紧密相关，为视频内容理解提供了重要线索。面对大量视频，如何快速、准确的从中提取文本信息，是一项有意义的研究课题。本文旨在设计一个通用、有效的视频文本信息提取系统，针对系统设计时遇到的问题开展研究工作。按照处理顺序，系统包括预处理、文本定位、抽取和识别等环节，本文主要研究前三个环节。 1. 预处理是指文本定位前的准备工作，包括从视频中解帧、对视频图像的质量进行评价和系统的初始化。在实际应用中，由于处理的视频图像质量差别很大，对于不同质量的图像，处理方法也不尽相同。为扩大系统的适用范围，需要考虑图像质量对处理流程的影响。本文提出了一种无参考图像质量评价算法，首先根据自然场景统计规律从图像中提取幅度衰减和位置相似性两方面特征，构成特征向量，然后训练广义回归神经网络对图像质量进行评价。 2. 文本定位是指从视频图像中准确的标识出文本块的位置。本文提出了一个快速有效的处理流程，首先对图像求边缘，针对边缘图像中经常出现的断裂和粘连现象进行修补，然后标定并筛选连通域，根据位置信息对连通域进行排序，根据几何约束对连通域进行局部搜索以构建文本块，完成粗检测，最后进行文本块融合以去除重复检测区域，进行文本块验证以去除虚警区域。 3. 文本抽取是指从文本块图像中抽取字符笔划。对于文本块图像，由于字符颜色不确定和存在背景干扰，不能将其直接送给字符识别引擎进行识别，而应加入文本抽取环节。本文提出了一套稳定的抽取方案，首先对文本块图像进行二值化和极性判断，以确定文本所在的二值图像，然后利用视频的时间冗余特点对文本块进行多帧验证和增强，最后对增强后的文本块进行二值化和连通域筛选，生成干净的二值图像，方便后续处理。文中给出了相应的数据集和实验结果，验证了算法具有较好的性能。
英文摘要	Nowadays the amount of digital videos increases explosively, and consequently it's a valuable research for computer to comprehend these multimedia documents, extract semantic information and boost some applications, such as video management, information retrieval and data mining. Caption texts embedded in videos are highly related to the video content and easier to extract than other semantic features. They serve as a important clue for video content comprehension. We aim to design a prototype system and extract video text information. According to the proceeding sequence, it includes pretreatment, text detection, extraction and recognition. Our research focuses on the former three parts. 1. Pretreatment includes video codec, image quality assessment and system initialization. In real application the system need to process videos with different image quality, and the proceeding method should be changed accordingly. In this paper we propose a no-reference image quality assessment method. Firstly, extract features from amplitude fall-off curves and positional similarity on the image following natural scene statistics, and build feature vector accordingly. Secondly, train general regression neural network to predict image quality. 2. Text detection is conducted to locate text boxes in frame images. We propose a fast and effective method. Firstly, calculate the edge image of a frame and revise it due to the phenomena of broken and conglutination. Secondly, label all CCs(connected components) on the edge image, filter out those from background partly, sort remained CCs by their position, search corresponding CCs and build text boxes depending on the geometric constraint. At last, conduct text box amalgamation to eliminate reduplicate detection results and text box verification to eliminate false alarms. 3. Text extraction is conducted to extract text strokes from text boxes. In text boxes, the color of characters can not be determined in advance and many disturbances form background exist, so the procedure of text extraction is needed. We propose a robust extraction method. Firstly, binarize the candidate text box and conduct polarity estimation to determine on which polarity the text occurs. Secondly, perform multi-frame verification and enhancement on text boxes employing their temporal redundancy in video. At last, binarize candidate text boxes, execute CC filtering to remove disturbance from background, and generate clear binary image for recognition. We describe the experiment data set and results. They confirm the effectiveness and efficiency of our methods.
关键词	视频文本定位视频文本抽取图像质量评价 Video Text Detection Video Text Extraction Image Quality Assessment
语种	中文
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/6107
专题	毕业生_博士学位论文
推荐引用方式 GB/T 7714	周景超. 视频文本检测算法研究[D]. 中国科学院自动化研究所. 中国科学院研究生院,2008.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
CASIA_20051801462803（29141KB）			暂不开放	CC BY-NC-SA