英文摘要 | It is promising for the technique of retrieval for image-text database based on OCR. Save paper documents by image and retrieve some document in OCR text, which is convenient for saving and retrieving paper documents. However, the OCR text is not completely correct, especially lots of errors will occur if image quality is low or document is multi-linguistic, which deteriorate the performance of retrieval. In order to enhance the performance of document retrieval, two measures are adopted. First, improve OCR algorithm and decrease the errors of segment and recognition. Second, analyze the features of OCR and document retrieval and then combine the two features, construct colorful indexes and provide flexible retrieval. Our mission is not only to improve the recall ratio but also to restrain the noise. The content of this paper is as follows. 1. The segmentation of mixed Chinese/English document. Provide a method of integrated segmentation and recognition based on multiple OCR engine. The whole class space is divided into six overlapped sub-spaces, which will be used according to the acual situation. Besides, give a model of adaptive feature and multi-phase feedback. Based on this model, the segmentation becomes a process from simple to difficulty, from coarseness to preciseness, which decreases greatly the errors from segmentation. 2. Detection and recognition of italic characters. Present a simple and applicable detection of Chinese italic characters, good performance can be gotten even though the characters are scattered in document. First, the text line is segmented into string blocks. Second, assume the blocks as italic strings and then correct the block by shear transform. Verify the assumption based on the vertical projection histogram of block image. If the string block is italic, the estimation of slant angle will be done. 3. Evaluation of recognition confidence and outlier rejection. As for confidence evaluation, three methods, based on empirical rule, based on bayes theory and based on logic regression, are considered. As for outlier rejection, two methods, based on confidence evaluation and based on ONE-CLASS SVM, are considered. Contrastive experiments are conducted for the two methods. 4. Retrieval of document image. Combine the features of retrieval based on OCR and based on content, and give an adaptive index for document image, which is adaptive to recognition error and constrain the negative influence from recognition error. Indexed text is saved by XML, which is convenient to save and retrieve document. |
修改评论