中英文混排文档的识别与检索

CASIA OpenIR > 毕业生 > 博士学位论文

	中英文混排文档的识别与检索
其他题名	Recognition and Retrieval of Mixed Chinese/English Document
	夏勇
	2007-06-06
学位类型	工学博士
中文摘要	基于OCR的图文库检索方法有非常广泛的应用前景。将纸质文档以图像方式保存，但却利用文档图像的文字识别结果进行检索，这使得纸质文档的保存与检索都非常方便。但由于OCR的识别结果并非完全正确，特别是对于有些图像质量较差的文档或多语种混排文档，识别错误会很多，这大大的影响了检索的效果。为了提高文档图像的检索性能，我们主要从两个方面入手：一是进一步提高OCR系统的性能，减少分割与识别的错误；二是深入分析OCR与文档检索的特点，建立丰富的标引信息，提供灵活的检索策略，从而提高文档的召回率，同时又要能较好的抑止噪声。本文主要的研究内容如下： 1. 中英文混排文档的分割问题。提出了一种基于多识别引擎的集成型分割与识别方法。我们将所有字符集分为相互间有交叠的6个子集合，根据分割时的不同情况，分别调用这些子集的识别引擎。另外，我们还提出了基于自适应特征与多级反馈的模型。通过该模型，使整个分割过程成为一个从易到难、由粗到细的过程，前期的分割识别结果将反馈至后期的较难判定的字符的分割与识别过程中，大大减少了字符分割的错误率。 2. 斜体字符的检测与识别问题。提出了一种简单实用的中文斜体字符检测方法，对散布的斜体字符有很好的效果。文档行首先被分割为一个一个的字符串块，然后假定这些字符串块是斜体，以一个固定的角度对其进行校正，接着基于垂直投影直方图特征来对假设进行验证。对判定为斜体的字符估计倾斜角。 3. 识别信度的评价问题和字符拒识问题。对于识别信度的评估，我们主要讨论了基于经验规则的方法、基于贝叶斯后验概率估计的方法和基于逻辑回归的方法。对于字符拒识问题，我们主要考虑了两类方法，一类是基于识别信度评估的方法，另一类是利用ONE-CLASS SVM的方法。对这两类方法进行了对比实验。 4. 文档图像的检索策略与方法。充分利用了基于OCR的检索与基于内容的图像检索的特点，将两者进行了很好的结合，提出了一种自适应的文档图像标引方法，能够对识别错误有很强的自适应能力，减少了OCR错误对检索性能的影响。标引文档采用了XML文档的形式，使得文档的保存与检索都很方便。
英文摘要	It is promising for the technique of retrieval for image-text database based on OCR. Save paper documents by image and retrieve some document in OCR text, which is convenient for saving and retrieving paper documents. However, the OCR text is not completely correct, especially lots of errors will occur if image quality is low or document is multi-linguistic, which deteriorate the performance of retrieval. In order to enhance the performance of document retrieval, two measures are adopted. First, improve OCR algorithm and decrease the errors of segment and recognition. Second, analyze the features of OCR and document retrieval and then combine the two features, construct colorful indexes and provide flexible retrieval. Our mission is not only to improve the recall ratio but also to restrain the noise. The content of this paper is as follows. 1. The segmentation of mixed Chinese/English document. Provide a method of integrated segmentation and recognition based on multiple OCR engine. The whole class space is divided into six overlapped sub-spaces, which will be used according to the acual situation. Besides, give a model of adaptive feature and multi-phase feedback. Based on this model, the segmentation becomes a process from simple to difficulty, from coarseness to preciseness, which decreases greatly the errors from segmentation. 2. Detection and recognition of italic characters. Present a simple and applicable detection of Chinese italic characters, good performance can be gotten even though the characters are scattered in document. First, the text line is segmented into string blocks. Second, assume the blocks as italic strings and then correct the block by shear transform. Verify the assumption based on the vertical projection histogram of block image. If the string block is italic, the estimation of slant angle will be done. 3. Evaluation of recognition confidence and outlier rejection. As for confidence evaluation, three methods, based on empirical rule, based on bayes theory and based on logic regression, are considered. As for outlier rejection, two methods, based on confidence evaluation and based on ONE-CLASS SVM, are considered. Contrastive experiments are conducted for the two methods. 4. Retrieval of document image. Combine the features of retrieval based on OCR and based on content, and give an adaptive index for document image, which is adaptive to recognition error and constrain the negative influence from recognition error. Indexed text is saved by XML, which is convenient to save and retrieve document.
关键词	Ocr 文档图像检索集成型分割与识别斜体字符检测字符拒识 Ocr Document Image Retrieval Integrated Segmentation And Recognition Detection Of Italic Character Outlier Rejection
语种	中文
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/6003
专题	毕业生_博士学位论文
推荐引用方式 GB/T 7714	夏勇. 中英文混排文档的识别与检索[D]. 中国科学院自动化研究所. 中国科学院研究生院,2007.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
CASIA_20041801462800（1746KB）			暂不开放	CC BY-NC-SA