标注受限视频人体行为理解模型与算法研究
李定
2023-05-21
Pages130
Subtype博士
Abstract

数字视频作为重要的视觉信息载体,已广泛地应用于生活的方方面面。海量
视频数据中的大多数内容都与人体行为相关,真实反映了人们的日常生产生活
状态。依靠视频数据对人体行为进行理解和分析,对于推动社会进步、创建美好
生活具有重要的意义。近年来,人体行为视频数量激增、内容日趋复杂,以往依
靠先验知识设计规则处理人体行为数据的方法已无法满足当前应用的需求,如
何自动化地挖掘人体行为视频中的潜在信息并对其理解是亟待解决的难题。在
这一背景下,基于计算机视觉的模式识别技术和机器学习算法得到了广泛的研
究和应用,人体行为理解正成为计算机视觉研究领域的热点问题。然而,现有的
人体行为理解方法需要消耗庞大的人力物力资源对训练数据进行手工精细标注,
整体标注过程步骤繁琐、成本高昂。这催生出一个重要的研究课题:如何在标注
受限情境下训练模型进行人体行为理解,利用尽可能少的标注成本实现令人满
意的行为分类和行为定位性能。
本文聚焦于标注受限视频人体行为理解这一主题,从标注成本的来源出发,
将标注受限情境具体细分为标注短缺、弱标注和无标注三种情境;依托时序行为
定位、行为片段检索和骨架行为识别三项具体任务,研究如何充分挖掘利用标
注和原始数据中的监督信息,尝试构建适用于上述情境的行为分类和定位模型。
本文的创新性研究成果主要有:
1. 提出了一种主动式半监督时序行为定位模型AL-STAL。针对随机标注法
忽略样本标注价值差异性的问题,构建了行为片段主动时序定位框架,仅依靠少
量标注数据,驱动模型主动发现高标注价值的样本,渐进式地完成样本标注和
模型训练;以备选片段类别分布熵值为基础,设计了一种样本筛选判据TPE,根
据行为分类的不确定性为不同样本划分合理的标注优先级;结合行为视频片段
的时序特性,提出了一种基于时序上下文不一致性的样本评价方式TCI,利用行
为片段上下文关系评估当前样本的标注价值;在三个时序行为定位基准数据集
(THUMOS’14、ActivityNet 1.3 和ActivityNet 1.2)上得到的实验结果显示,所提方法一方面在同等标注预算下可实现更优的定位性能,另一方面仅需更少的标注即可达到同等定位性能。
2. 提出了一种视频片段多尺度2D 表示学习模型MS-2D。针对时序边界标
注缺失时行为片段质量评价失准的问题,提出了一种弱监督行为片段评价网络,
充分挖掘备选片段上下文关联关系,仅依靠与整段视频相匹配的文本描述,即可
驱动模型检索得到符合语义的行为片段;针对视频样本“语义相似但时序尺度
相差较大”的情形,提出了一种多尺度2D 时序特征图,利用不同尺度的时序采
样尽可能多地涵盖时长多变的行为片段;为保障模型正常训练,提出了一种文
本重建引导的交叉熵损失函数RG-BCE Loss,依据重建文本的质量生成伪标签,
为行为片段评价提供监督信息;在两个行为片段检索基准数据集(Charades-STA和ActivityNet-Captions)上获得的实验结果表明,所提方法可有效提升弱监督行为片段检索性能。
3. 提出了一种跨数据流自监督骨架行为识别模型CSCLR。针对单数据流正
样本对过于相似导致代理任务易于完成的问题,提出了跨流对比学习方法,利
用不同数据流蕴含的信息差异在对比学习代理任务中引入了更困难的正样本对,
有效提升了模型对复杂运动模式的理解能力;在数据增强之外,提出了一种合成
新正样本对的特征变换策略PFT,在特征层面有效增大了正样本对之间的差异,
进一步增强了对比学习功效;CSCLR 利用线性、微调和半监督三种评测模式在
三个行为识别的基准数据集(NTU-60、NTU-120、PKU-MMD)上进行了实验验证,结果表明所提出的跨数据流自监督模型可以显著增强骨架行为特征的判别
性,有效提升下游任务识别性能。

Other Abstract

Digital videos, as an important carrier of visual information, have been widely used in our daily lives. The major content of tremendous video data are related to human action, and it also reflect the status of production and living veritably. Understanding and analyzing human action based on video data, will significantly promote social progress
and lead to a better life. Recently, the human action video content enjoys a rapid growth and an increasing complexity. Thus, previous rule-based methods are not able to meet the demand of current applications, and how to automatically mine and understand the latent information of human action data remains unsolved. In such circumstance, the machine learning algorithms and the technology of pattern recognition which is related to computer vision are put forward and persistently developed, human action understanding is becoming a hot topic in the field of computer vision. However, most of existing methods require immense resources to proceed large-scale precise annotation.
The process of annotation is tedious and of high cost. This gives rise to an important research question: how to understand human actions with limited labels, and how to achieve satisfying performance in action classification and localization with as few label cost as possible.
This thesis focuses on video human action understanding with limited labels, and divides the condition of limited labels into three specific scenarios: Shortage of labels, Weak labels and Null labels. It studies how to utilize sufficient information in limited labels and raw data to construct classification and localization models which are applicable to those scenarios, in support of temporal action localization, video action moment
retrieval and skeleton-based action recognition. The main contributions of this thesis are summarized as below:
1. A semi-supervised temporal action localization model is proposed to actively localize action instances, named AL-STAL. In the passive learning with random sampling, the discrepancy of labeling different samples is overlooked. To address this issue, an active learning framework specified for temporal action localization is constructed. Merely with a few annotations, AL-STAL is able to actively select samples with high
value of annotation, and can annotate samples and train the localizer progressively. Next, two scoring functions (TPE and TCI) are presented with different quantitative metrics, which facilitate to evaluate the informativeness of unlabeled video samples. Based on the entropy of category distribution, TPE is considered as a metric of model
uncertainty, and can be used to prioritize labelling high-informative samples. According to the characteristic of video clips, TCI is proposed based on the relationships within the temporal context. Experiments are conducted on popular benchmark datasets for temporal action localization, e.g. THUMOS’14, ActivityNet 1.3 and ActivityNet 1.2,
and both metrics for action localization and label cost saving are utilized for evaluation. Experiment results demonstrate that the proposed method can achieve better localization performance when given the same label budget, and can save large amount of label cost when given the same localization performance.
2. A novel multi-scale 2D representation model is proposed for weakly-supervised video moment retrieval, named MS-2D. Due to the lack of temporal boundary annotations, the evaluation of video segment proposals are inaccurate. To cope with this issue, a weakly-supervised segment evaluation network is proposed to mine the temporal relations
in the context. Merely with video-level annotation (text decription), MS-2D is able to retrieve the semantic-match action segment. Some examples have similar semantics, while the temporal length of them are varies greatly. By temporal sampling with multiple scales, a multi-scale 2D temporal map is proposed to cover more video segments with varied length. For model training, RG-BCE loss is proposed to generate
pseudo labels according to the quality of reconstructed text queries, which can be served as supervision for proposal evaluation. Experiment results on the two benchmark datasets (Charades-STA and ActivityNet Captions) show that the proposed method can boost the retrieval performance efficiently.
3. A novel cross-stream contrastive learning model named CSCLR is proposed for self-supervised skeleton-based action recognition, which contrasts the pairwise features extracted from different data stream. The positive pairs in the single stream are quite similar, indicating the contrastive pretext task will be easily-accomplished. To remedy this issue, CSCLR introduces hard positives by utilizing the discrepancy of latent
information in different data streams, and thus improving the ability to explore the complicated movement patterns. In addition to data augmentation, a positive feature transformation strategy is proposed to synthesize new positives, which also increases the discrepancy of positive pairs. Following the Linear, Finetuned and Semi-supervised evaluation protocol, the experiments are conducted on three popular benchmark datasets (NTU-RGBD-60, NTU-RGBD-120 and PKU-MMD). The results verify that the proposed method is able to extract more discriminative features and improve the performance in downstream tasks.

Keyword标注受限 人体行为理解 主动学习 视频片段检索 自监督学习
Language中文
Sub direction classification图像视频处理与分析
planning direction of the national heavy laboratory视觉信息处理
Paper associated data
Document Type学位论文
Identifierhttp://ir.ia.ac.cn/handle/173211/52217
Collection多模态人工智能系统全国重点实验室_人工智能与机器学习(杨雪冰)-技术团队
Recommended Citation
GB/T 7714
李定. 标注受限视频人体行为理解模型与算法研究[D],2023.
Files in This Item:
File Name/Size DocType Version Access License
Thesis.pdf(8391KB)学位论文 开放获取CC BY-NC-SA
Related Services
Recommend this item
Bookmark
Usage statistics
Export to Endnote
Google Scholar
Similar articles in Google Scholar
[李定]'s Articles
Baidu academic
Similar articles in Baidu academic
[李定]'s Articles
Bing Scholar
Similar articles in Bing Scholar
[李定]'s Articles
Terms of Use
No data!
Social Bookmark/Share
All comments (0)
No comment.
 

Items in the repository are protected by copyright, with all rights reserved, unless otherwise indicated.