基于多模态信息融合的视频标注及应用

CASIA OpenIR > 毕业生 > 博士学位论文

	基于多模态信息融合的视频标注及应用
其他题名	Multimodality based video annotation and its application
	张一帆
	2009-11-20
学位类型	工学博士
中文摘要	视频语义标注是当今信息检索领域的一个重要研究方向。它能够根据视频的内容把视频按照不同的语义概念标注关键字，是建立视频索引，进而实现高效视频检索的必要基础。传统的视频标注方法主要从视频单一模态出发，依靠从视频数据中提取的特征去描述视频的内容，建立特征与语义概念之间的对应关系。但是由于“语义鸿沟”的存在，使得对视频内容的分析与理解仍存在着相当的困难。尤其对于体育、电影等内容丰富的特定领域视频，人们关注的重点往往不是一些泛化的语义概念，而是一些具体的人物和事件。要对这类较高层次的语义内容进行标注、索引和检索，仅依靠底层特征是不够的。我们认为可以从视频以外的渠道获取与视频相关的文本信息，从中提取对视频内容的描述和高层语义知识，以辅助对视频内容的理解与标注。本文以体育和电影这两类视频作为研究对象，主要探讨了如何利用外部文本信息，通过多模态信息融合的方法，对视频进行语义标注，以期实现在自动或尽量少的人工参与情况下，取得与完全手工标注尽可能接近的结果。从外部渠道获取的文本一般分为两种情况：一种包含时间信息，如体育比赛的网络文字直播文本，可以与视频在时间上对齐；一种不包含时间信息，如电影的剧本，无法与视频在时间上对齐。本文针对这两种情况分别展开讨论，并在语义标注结果的基础上设计了较为新颖的视频检索和浏览方法。其主要工作和贡献如下： 1.在体育视频语义标注方面，引入网络文字直播这一文本信息，提出了一种基于多模态信息融合的事件检测和语义标注方法。在文本分析中通过文本聚类和类内排序自动生成事件关键字，以实现在文本中的事件检测和语义提取。由于网络文字直播文本中的事件记录是包含时间信息的，我们利用该信息将文本与视频对齐，以确定事件在视频中的大致位置。提出了一种基于条件随机场模型的方法以精确定位事件的边界，最终生成事件的视频片段，并用从文本中提取的人物、球队、事件类型等高层语义概念对其进行标注，达到对体育视频内容索引的目的。 2.在体育视频语义标注的基础上，提出了一种个性化的体育视频检索方法。该方法以文本查询作为检索的起点，在满足用户显式的查询意图的同时，还能够在用户与系统的交互过程中，通过分析用户选择和观看视频的情况获取其隐式的查询意图，并分别从高层语义和底层视觉特征两个方面建立用户偏好模型。根据该模型，可以对初始检索结果进行重排序，以反馈给用户更多符合其个性化要求的视频。 3.在电影中人脸标识的研究中，从电影剧本中提取人物的姓名，对电影中的人脸进行识别和标注。由于剧本中不包含时间信息，无法将其与视频在时间上进行对齐，我们提出了一种全局匹配的方法。与传统的局部匹配方法相比，该方法摆脱了对时间信息的要求，而是在全局范围内，在视频和文本两个模态中分别计算人脸和人名的统计信息，建立人脸关系网和人名关系网。然后通过图匹配和超图匹配的方法在两个网络的顶点之间建立对应关系，以实现对人脸的标识。此外，我们还进一步挖掘人物之间的关系，生成人物关系摘要，并设计了一种较为新颖的基于人物的电影浏览方式。
英文摘要	Semantic video annotation is an important research direction in the field of information retrieval. It is a technique which attempts to detect semantic concepts in the videos according to their content. This is a preliminary step for video indexing and retrieval. The traditional video annotation methods mainly rely on the video itself, extract low-level features in the video to describe the video content, and build relationship between the features and the semantics. Due to the semantic gap, they face the difficulty for high-level semantic analysis and understanding. Especially for the domain specific videos such as the sports video and the movie, the focus of the audience is mainly on the high-level semantics such as who, what, when and how. To bridge the gap between the low features and the high-level semantics, we propose to employ external knowledge for help. In this dissertation, we study on the multimodality based semantic annotation methods in the sports video and the movie. In our method, we incorporate external knowledge which can be extracted from the counterpart texts of the video such as the web-cast text for a sport game and the film script for a movie. It is shown that with the external knowledge we can automatically generate domain-specific semantic concepts and thus get annotation results which is comparable to the manually labeled ground truth. Based on the video annotation, we also present novel methods for video retrieval and browsing. The main contributions of the dissertation are as follows: 1.We present an approach for sports video event detection and semantic annotation based on analysis and alignment of web-cast text and broadcast video.We first analyze the web-cast text to cluster and detect text events in an unsupervised way.Based on the detected text event and video structure analysis,we employ a conditional random field model to align text event and video event by detecting event moment and event boundary in the video. Incorporation of web-cast text into sports video analysis significantly facilitates sports video event detection and semantic annotation. 2.Based on the sports video annotation, a personalized sports video retrieval method is presented. The video clips can be initially retrieved based on different semantic attributes by text query. For user preference acquisition, we utilize clickthrough data as a feedback from the user. Relevance feedback is applied on both the text annotation and the visual features to ...
关键词	视频标注视频检索多模态体育视频分析人脸识别 Video Annotation Video Retrieval Multimodality Sports Video Analysis Face Identification
语种	中文
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/6228
专题	毕业生_博士学位论文
推荐引用方式 GB/T 7714	张一帆. 基于多模态信息融合的视频标注及应用[D]. 中国科学院自动化研究所. 中国科学院研究生院,2009.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
CASIA_20061801462806（4988KB）			暂不开放	CC BY-NC-SA