英文摘要 | With the wide application of intelligent electronic devices with photographing function, a large number of natural scene images containing text are photographed, stored and used for information transmission. Accurately extracting high-level text information from scene images can effectively assist image content understanding, and play an increasingly significant role in image retrieval, intelligent transportation, augmented reality and other fields. However, compared with the scanned document images, the irregular text in natural scene images has more shape diversity, which poses a challenge to the end-to-end scene text recognition task. This thesis studies the end-to-end recognition of irregular text in natural scenes. The main efforts and innovations of this thesis can be summarized as follows:
(1) For general irregular scene text, an end-to-end text recognition model based on corner and character assistance is proposed in this thesis. Combining the advantages of the end-to-end method based on coordinate regression and instance segmentation, this model learns the text corner heatmap and character position heatmap at a small computational cost. The text corner heatmap is used to rectify the inaccurate text corner coordinates obtained by the regression-based detection branch. The character position heatmap is used to enhance the character center feature and assist text recognition. The detection and recognition results on two benchmarks datasets validate the effectiveness of this model.
(2) For arched text, it is difficult to depict its contour and adjust the arched features into rectangular ones. This thesis proposes an arched text end-to-end recognition model based on arc-align. The detection module of this model is responsible for locating the control points of the text boundary, that is, the starting point, midpoint and end point. The arc-align structure transforms the arched text feature into rectangular feature which serves as the input of the recognition module, and thus the detection module and the recognition module are able to achieve end-to-end training. Experiments on the proposed English coin dataset show that this model maintains the spatial information of arched text and achieves the current optimal results in both detection and recognition metrics.
(3) Aiming at the inconsistency between arched text detection and recognition, an end-to-end arched text recognition model with automatic correction is proposed in this thesis. In order to enable the recognition loss to be backpropagated to the detection branch, this model proposes to replace the original arc-align sampler with a differentiable feature sampler, so that the recognition result can automatically correct the detection result. In order to alleviate the input inconsistency of recognition branch between the training phase and testing phase, the ground-truth and the predicted text coordinates are equally selected for feature sampling. Experiments show that the automatic correction method of this model can improve both the detection and recognition metrics, and its overall performance far exceeds other state of the art methods. |
修改评论