视频中的文本检测与跟踪方法研究

CASIA OpenIR > 毕业生 > 硕士学位论文

	视频中的文本检测与跟踪方法研究
	张峻博
	2023-05-27
页数	90
学位类型	硕士
中文摘要	从图像和视频场景中获取文本信息是一项重要的研究课题。与图片相比，视频具有丰富的时序信息，并且场景的复杂度更高，经常存在运动模糊、光照变化、视角抖动等问题，使得其中的文本检测和识别更加具有挑战性。本文研究视频中的文本序列检测问题，即视频文本检测和跟踪任务。主要研究内容和成果如下： 1.构建并发布了一个大规模的双语路景视频文本数据集BiRViT-1K，同时提供了精确的标注信息，可以用于视频文本检测、跟踪、识别任务。本文在数据集上进行了文本检测和文本跟踪的基准实验，以促进视频文本处理相关领域研究工作的进展。数据集的下载地址为\url{http://www.nlpr.ia.ac.cn/databases/CASIA-BiRViT1K/}。 2.提出了一种基于鲁棒特征表示的视频文本检测方法。该方法在关注文本自身特征的同时，利用空间上相邻文本之间稳定的相对位置信息构建了拓扑特征，并且设计了一个自适应特征融合网络来动态融合文本的多类特征，构建鲁棒的文本特征表示，从而提高了模型的文本检测和文本跟踪性能。在多个视频文本数据集上的实验表明，该方法可以更加准确、稳定地检测和跟踪文本实例。 3.提出了一个基于序列Transformer的端到端视频文本检测模型。该模型将视频文本检测和跟踪任务看做是一个序列解码问题，建模文本实例的长时序上下文依赖关系，并通过序列预测方式来并行解码检测和跟踪任务。模型无需设置锚点、非极大值抑制、跟踪匹配分支等组件，极大简化了模型的框架。在多个视频文本数据集上的实验证明，该方法通过引入视频中的长时序信息提高了文本检测和跟踪的性能。同时，该模型可以无缝应用到场景文本检测任务中，首次实现了场景文本检测和视频文本检测跟踪两个任务的统一，在多个场景数据集上取得了先进的文本检测性能。
英文摘要	Extracting text information from scene images and videos is an important research topic. Compared to images, videos have rich temporal information and higher scene complexity, often with challenges such as motion blur, lighting changes, and perspective jitter, which makes text detection and recognition more challenging. This thesis studies text sequence detection in videos, namely video text detection and tracking tasks. The main research contents and achievements are as follows: 1. A large-scale bilingual road scene video text dataset, BiRViT-1K, is constructed and released. The dataset provides accurate annotation information and can be used for video text detection, tracking, and recognition tasks. This thesis conducted benchmark experiments of text detection and tracking on the dataset to promote the progress of research in the field of video text processing. The download link for the dataset is \url{http://www.nlpr.ia.ac.cn/databases/CASIA-BiRViT1K/}. 2. A video text detection method based on robust feature representation is proposed. This method not only focuses on the inherent features of the text, but also utilizes the stable relative position information between adjacent texts in space to construct topology feature, and designs an adaptive feature fusion network to dynamically fuse multiple features of the text, constructing a robust text feature representation, thereby improving the text detection and tracking performance of the model. Experiments on multiple video text datasets show that this method can detect and track text instances more accurately and stably. 3. An end-to-end video text detection model based on sequence Transformer is proposed. The model regards the video text detection and tracking tasks as a sequence decoding problem, models the long-term temporal context dependency of text instances, and decodes the detection and tracking task in parallel through sequence prediction. The model does not need to set anchor, Non Maximum Suppression, tracking matching branches and other components, which greatly simplifies the framework of the model. Experiments on multiple video text datasets show that this method improves the performance of text detection and tracking by introducing long-term temporal information in video. At the same time, the model can be seamlessly applied to the task of scene text detection, realizing the unification of scene text detection and video text detection and tracking for the first time, and achieving advanced text detection performance on multiple scene datasets.
关键词	视频文本检测文本跟踪 BiRViT-1K 鲁棒特征表示 Transformer
学科领域	人工智能 ; 模式识别
学科门类	工学 ; 工学::计算机科学与技术（可授工学、理学学位）
语种	中文
七大方向——子方向分类	图像视频处理与分析
国重实验室规划方向分类	视觉信息处理
是否有论文关联数据集需要存交	否
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/52113
专题	毕业生_硕士学位论文多模态人工智能系统全国重点实验室_模式分析与学习
通讯作者	张峻博
推荐引用方式 GB/T 7714	张峻博. 视频中的文本检测与跟踪方法研究[D],2023.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
视频中的文本检测与跟踪方法研究_签名版.（24487KB）	学位论文		限制开放	CC BY-NC-SA