英文摘要 | Nowadays the amount of digital videos increases explosively, and consequently it's a valuable research for computer to comprehend these multimedia documents, extract semantic information and boost some applications, such as video management, information retrieval and data mining. Caption texts embedded in videos are highly related to the video content and easier to extract than other semantic features. They serve as a important clue for video content comprehension. We aim to design a prototype system and extract video text information. According to the proceeding sequence, it includes pretreatment, text detection, extraction and recognition. Our research focuses on the former three parts. 1. Pretreatment includes video codec, image quality assessment and system initialization. In real application the system need to process videos with different image quality, and the proceeding method should be changed accordingly. In this paper we propose a no-reference image quality assessment method. Firstly, extract features from amplitude fall-off curves and positional similarity on the image following natural scene statistics, and build feature vector accordingly. Secondly, train general regression neural network to predict image quality. 2. Text detection is conducted to locate text boxes in frame images. We propose a fast and effective method. Firstly, calculate the edge image of a frame and revise it due to the phenomena of broken and conglutination. Secondly, label all CCs(connected components) on the edge image, filter out those from background partly, sort remained CCs by their position, search corresponding CCs and build text boxes depending on the geometric constraint. At last, conduct text box amalgamation to eliminate reduplicate detection results and text box verification to eliminate false alarms. 3. Text extraction is conducted to extract text strokes from text boxes. In text boxes, the color of characters can not be determined in advance and many disturbances form background exist, so the procedure of text extraction is needed. We propose a robust extraction method. Firstly, binarize the candidate text box and conduct polarity estimation to determine on which polarity the text occurs. Secondly, perform multi-frame verification and enhancement on text boxes employing their temporal redundancy in video. At last, binarize candidate text boxes, execute CC filtering to remove disturbance from background, and generate clear binary image for recognition. We describe the experiment data set and results. They confirm the effectiveness and efficiency of our methods. |
修改评论