从视频到语言：视频描述和标题生成方法研究

CASIA OpenIR > 多模态人工智能系统全国重点实验室 > 视频内容安全

	从视频到语言：视频描述和标题生成方法研究
	张子琦
	2022-05
页数	150
学位类型	博士
中文摘要	随着深度学习技术在计算机视觉和自然语言处理领域的深入发展，二者逐渐形成相互融合、相互促进的趋势。作为桥接视觉和语言两种模态的重要方式，视频描述任务受到越来越多研究人员的关注。通过对视频描述任务的研究，能够扩展传统视频语义表征的范围，实现视觉与语言两种模态的融合和语义对齐，具有重要的科研价值。同时，视频描述在自动驾驶、辅助视障人士及监控事件摘要等应用场景下具有潜在的应用前景。在电子商务领域及短视频内容创作等场景中，视频描述任务的扩展：视频标题生成，可实现定制化、个性化、有吸引力的视频标题的自动生成，从而增大商品和视频内容的消费，更具有实际的商业价值。本文从“如何对视频进行语言化表示”这一实际问题出发，在“解决传统视频描述任务固有问题并提出视频描述的新方法”、“扩展视频描述模型的知识范围并提出开卷视频描述的新范式”以及“扩展视频描述的实际应用场景并提出视频标题生成的新任务”这三个递进的层面上，完成了以下三个创新工作：（1）提出基于物体关系图和教师推荐学习的视频描述针对已有的视频描述方法缺乏对视频中物体关系的建模，本文提出了一种基于物体关系图的视觉编码器。通过构建部分物体关系图和整体物体关系图，在关系推理的过程中实现物体关系学习并增强细粒度物体表征。此外，本文还发现视频描述语料库词汇的长尾问题是导致与视频内容相关的词汇的训练不够充分的原因。本文提出了一种教师推荐学习的训练策略，利用大规模预训练的外部语言模型作为指导，将其中包含的语言学知识通过知识蒸馏的方式传授给视频描述模型。该方法作为一种训练策略，并不会增加推理阶段的计算量。实验表明，上述方法能够增加视频的细粒度物体表征，同时降低了模型的训练难度，在多个公开数据集上获得了同期最好的效果。（2）提出基于检索-拷贝-生成网络的开卷视频描述传统的视频描述方法的知识完全存储于模型参数中，训练完成即较难扩展。本文受启发于：人类在回答问题时根据经验查找资料，并参考相关知识的过程，提出了一种新的开卷视频描述的范式。该范式将已有输入视频、输出描述的视频描述过程，转换成输入视频、查阅相关文档、参考与视频内容相关表述、输出描述的开卷视频描述过程。为了实现上述过程，本文提出了一种新颖的检索-拷贝-生成网络。通过利用一个可插拔的视频-文本检索器，实现从外部知识库中提取检索句的功能。此外，通过引入一个带拷贝机制的描述生成器，实现从多个检索句中提取与视频内容相关表述的目的。该方法可以通过改变不同领域的检索器和检索数据库，以扩充模型的知识范围。本文通过大量的实验证明该范式的有效性，特别是在跨数据集的实验中验证了模型对不同领域知识的迁移效果。（3）提出一种中文短视频检索和标题生成基准以往的视频描述工作大部分以得到视频内容的客观描述为主，缺乏主观性和吸引性，从而限制了其实际的应用场景。本文提出了将视频描述任务扩展为视频标题生成任务。与视频描述不同的地方在于，视频标题生成任务既要得到与视频语义相关的表述，又要包含吸引人的内容，从而同时具备“可查找性”和“被点击性”两种属性。为了实现这一任务，本文构建了第一个中文短视频检索和标题生成基准，该基准包含一个高质量精细标注的210K短视频数据集和两个大规模弱标注的3M/10M预训练数据集，涵盖51个类别、50K+视频标签、537K个人工标注的标题和描述以及10M+的中文短视频。此外，提出了一种新的视频-文本预训练模型，该模型提供了一种标签驱动的视频文本对齐模块以及一个基于GPT-2的文本生成模块，能够同时实现视频-文本对齐和生成。该工作为中文短视频标题生成和视频跨模态检索的研究和应用提供了支持。
英文摘要	With the development of deep learning technology in computer vision and natural language processing, these two tasks have gradually formed a trend of integration and promotion. Video captioning has attracted researchers' attention as an important way to bridge vision and language. The study of video captioning can expand the scope of traditional video semantic representation, realize the fusion and semantic alignment of vision and language, and have essential scientific research value. Meanwhile, video captioning has potential research prospects in application scenarios such as autonomous driving, assisting visually impaired people, and monitoring event summaries. In the field of e-commerce and short video content creation, the extension of video captioning, video titling, can realize the automatic generation of customized, personalized and attractive video titles to increase the consumption of goods and video content, has practical commercial value. Starting from the practical problem of ``how to represent video in language" at the three progressive levels: 1) propose a new method to solve the inherent problems of traditional video captioning task, 2) propose a new paradigm of open-book video captioning to expand the knowledge scope of video captioning model, and 3) propose a new task for video title generation to extend the practical application scenarios of video captioning, this thesis completes three innovative work as follows: (1) Video captioning with object-relational-graph and teacher-recommended-learning. Aiming at the lack of modeling the relationship between objects in video by existing video captioning methods, this thesis proposes a visual encoder based on the object-relational-graph. By constructing the partial-object-relational-graph and the complete-object-relational-graph, object relationship learning is realized in relational reasoning, and fine-grained object representation is enhanced. In addition, this thesis also found that the long tail problem of the corpus of video captioning led to insufficient training of the vocabulary related to the video content. This thesis proposes a teacher-recommended-learning training strategy, using a large-scale pre-trained external language model as a guide to impart the linguistic knowledge to the video captioning model through knowledge distillation method. As a training strategy, this method does not increase the amount of computation in the inference phase. Experiments show that the above method can increase the fine-grained object representation of the video while improving the model's generalization and obtaining state-of-the-art results in the same period on multiple public datasets. (2) Open-book video captioning with retrieve-copy-generate network. The knowledge of the traditional video captioning method is ultimately stored in the model parameters and could not be easily extended after training. This thesis is inspired by the process of human beings finding information based on experience and referring to relevant knowledge when answering questions, and proposes a new paradigm of open-book video captioning. This paradigm converts the original video captioning process of the input video and output caption into an open-book video captioning process of input video, consulting-related documents, referring to the expression related to the video content, and output caption. In order to achieve this process, this thesis proposes a novel retrieve-copy-generate network. Extract retrieved sentences from an external knowledge base by leveraging a pluggable, pre-trained video-text retriever. In addition, a caption generator with a copy mechanism is introduced to extract expressions related to video content from multiple retrieved sentences. This approach can expand the scope of knowledge of the model by changing the retrievers and retrieval databases in different domains. Extensive experiments on several benchmarks show the validity of open-book paradigm. Especially in cross datasets experiments, which is to verify the transfer effect of the model on knowledge in different fields. (3) A new Chinese short video benchmark for video retrieval and video title generation. Most of the previous video captioning work mainly aims to get objective descriptions of video content, lack of subjectivity and attraction, limiting its practical application scenarios. Unlike video captioning, the video titling task needs to get the semantically relevant video expression and contain engaging content to have both ``findability" and ``clickability" attributes. To achieve this goal, this thesis constructs the first Chinese short video retrieval and title generation benchmark, which contains a high-quality fine-labeled 210K video dataset and two large-scale weak-labeled 3M/10M pre-training datasets covering 51 categories, 50K+ video tags, 537K manually labeled titles and captions. In addition, a new video-text pre-training model is proposed, which provides a tag-driven video text alignment module and a GPT-2 based text generation module capable of simultaneous multimodal alignment with generation. This work provides a basis for Chinese short video title generation and cross-modal retrieval research and application.
关键词	视觉与语言视频内容描述视频标题生成外部语言模型开卷视频描述中文短视频-文本基准大规模多模态预训练
语种	中文
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/48747
专题	多模态人工智能系统全国重点实验室_视频内容安全毕业生_博士学位论文
通讯作者	张子琦
推荐引用方式 GB/T 7714	张子琦. 从视频到语言：视频描述和标题生成方法研究[D]. 中国科学院自动化研究所. 中国科学院自动化研究所,2022.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
从视觉到语言_视频描述和标题生成方法研究（19170KB）	学位论文		开放获取	CC BY-NC-SA