CASIA OpenIR  > 毕业生  > 博士学位论文
面向未剪辑视频的时序行为检测技术研究
陈云泽
2023-08
页数112
学位类型博士
中文摘要

时序行为检测任务是动作分析的重要基础,它能够从未经剪辑的原始视频中提取出用户或下游任务感兴趣的动作片段。这一任务在智能监控、视频编辑、自动驾驶等领域都有广泛的应用。随着深度学习的快速发展,近年来时序行为检测任务取得了显著的成果。然而,在研究与应用中,时序行为检测任务仍面临一系列挑战。首先,现有的时序行为检测特征提取器只能启发式地选择某个特征层的信息,忽略了其他特征层的信息,从而导致特征信息的丢失。此外,时序行为检测模型的精度严重依赖于训练数据时序边界标注的准确度,而现实中动作的复杂性和多样性,可能导致时序边界标注存在歧义性,从而误导模型并影响其精度。最后,时序行为检测框架在训练和测试阶段的正负样本分配策略存在差异,该差异导致训练出的模型难以满足测试的要求。为了克服上述难题并全面提升时序行为检测模型的精度,本文从融合多尺度特征信息、降低歧义性标注的影响以及实现正负样本分布均衡化三个方面展开研究。基于所提出的三项时序行为检测算法,本文构建了一个面向广域场景的行为分析系统,实现了远距离大视野下的快速动作捕捉与分析。本文的主要研究内容和贡献如下:

1. 针对时序行为检测任务中多尺度特征信息丢失的问题,本文提出了一种基于特征融合的单阶段时序行为检测方法,显著提升了单阶段时序行为检测网络的性能。所提出的方法通过深度网络预测不同尺度特征融合的权重,以更好地关联不同尺度的特征。同时,本文设计了一种基于高斯分布的损失函数,用以初步评估预测时序边界的置信度。基于上述两个模块,构建了一个高精度的单阶段时序行为检测框架Tb-SSAD。Tb-SSAD显著提升了单阶段时序行为检测网络的性能,在THUMOS14数据集上将mAP@0.5性能从37.9%提高到41.6%。

2. 针对时序行为检测任务中时序边界标注存在歧义性的问题,本文提出了一种基于不确定性的时序行为检测方法,可应用于多种主流的时序行为检测框架中(包括单阶段、两阶段和无锚方法),并在每种框架中都取得了显著的性能提升。所提出的不确定性方法利用高斯分布捕捉时序边界的歧义性,以获取不确定性方差。同时,本文创建了一个类别不确定性指标,以避免对具有较高不确定性的类别进行过度修正。在测试过程中,基于不确定性方差和类别不确定性,本文分别构建了可靠性评分模块和方差投票模块,用以选择更可靠的提案和修正每个提案的边界坐标。在THUMOS14数据集上,所提出的基于不确定性的方法将两阶段时序行为检测方法mAP@0.5的性能提升了1.7%,无锚方法的性能提升了2.0%。

3. 针对时序行为检测任务中模型训练和测试过程中正负样本分配策略存在差异的问题,本文提出了一种基于课程学习分配权重的时序行为检测方法,该方法在多个数据集上均能显著提高模型的性能,同时仅增加了少量的参数和推理时间。所提出的方法建立了一个辅助分支,并采用基于课程学习的权重分配范式对其进行训练。在训练过程中,每个样本的权重是其定位和分类得分的组合。在早期训练阶段,网络更重视可靠的分类分支。而在训练阶段后期,网络则更加关注对模型性能影响更大的定位分支。在THUMOS14数据集上,基于课程学习分配权重的时序行为检测算法将基线模型的性能从55.5%提升到57.6%。同样,在ActivityNet数据集上,该算法将模型性能从34.4%提升到35.4%。而相较于基线模型,所提出的算法参数量仅增加了1%,推理时间仅增加了1.6毫秒。

4. 针对广域场景中存在的目标分辨率低、搜索速度慢和定位精度差等问题,本文提出了一个面向广域场景的行为分析系统,该系统能够实现对目标的高清成像、高效搜索、准确定位、及动作分析。该行为分析系统由广角相机和高速相机组成,利用广角相机生成全景图像的先验信息,以引导高速相机获取目标的高分辨率图像。为了快速检测并准确定位感兴趣的目标,该系统创新性地提出了广域区域概率图和不确定性搜索模块。广域区域概率图可以估计目标出现的高概率区域并粗略确定一些目标的位置,而不确定性搜索模块则利用目标检测器提供的不确定性方差动态调整采样范围,并修正检测到的目标坐标。通过在该系统中部署本文提出的三种时序行为检测算法,在多个任务及多个场景中验证了所提出系统的有效性,包括室内目标检测、室外动作识别和滑冰场时序行为检测。在滑冰场景中,该系统能够以每秒300帧的速度跟踪滑冰学员,分类和定位该学员在整个滑冰过程中步法、旋转和跳跃等关键技术动作的开始和结束时间。

英文摘要

The temporal action detection task is an important foundation for action analysis, as it can extract action segments of interest to users or downstream tasks from untrimmed raw videos. This task has widespread applications in intelligent monitoring, video editing, autonomous driving, and other fields. With the rapid development of deep learning, significant progress has been made in the task of temporal action detection in recent years. However, temporal action detection algorithms still face several challenges. Firstly, existing temporal action detection feature extractors can only heuristically select information from a certain feature layer, ignoring the information from other feature layers, which leads to the loss of feature information. Additionally, the accuracy of temporal action detection models heavily relies on the accuracy of temporal boundary annotations in the training data. However, the complexity and diversity of real-world actions may result in ambiguity in the temporal boundary annotations, thus misleading the model and affecting its accuracy. Lastly, the temporal action detection framework exhibits differences in the allocation strategy of positive and negative samples during the training and testing stages, which makes it challenging for the trained model to meet the requirements of testing. To overcome the above challenges and comprehensively enhance the accuracy of the temporal action detection model, this paper conducts research from three aspects: integrating multi-scale feature information, reducing the impact of ambiguous annotations, and achieving balanced distribution of positive and negative samples. Based on the proposed three temporal action detection algorithms, an action analysis system targeting wide-area scenes is constructed, enabling fast action capture and analysis in distant and wide-field-of-view scenarios.
The main research content of this article is as follows:
1. To address the issue of multi-scale feature information loss in temporal action detection tasks, this paper proposes a feature fusion-based one-stage method for temporal action detection, which significantly enhances the performance of one-stage temporal action detection networks. The proposed method predicts the weights for fusing different scale features using a deep network, in order to better associate features at different scales. Additionally, this paper designs a loss function based on Gaussian distribution to preliminarily evaluate the confidence of predicting temporal boundaries. Based on the above two modules, a high-precision one-stage temporal action detection framework, named Tb-SSAD, is constructed. Tb-SSAD remarkably enhances the performance of one-stage networks, increasing the mAP@0.5 performance from 37.9% to 41.6% on the THUMOS14 dataset.

2. To address the issue of ambiguity in temporal boundary annotations in temporal action detection tasks, this paper proposes an uncertainty-based method for temporal action detection. The proposed method has been successfully applied to various mainstream temporal action detection frameworks, including one-stage, two-stage, and anchor-free methods, resulting in significant performance improvements in each framework. The proposed uncertainty method utilizes Gaussian distribution to capture the ambiguity of temporal boundaries and obtain the variance of uncertainty. Furthermore, an indicator for category uncertainty is created in this study to avoid excessive adjustment for categories with high uncertainty. During the testing process, based on uncertainty variance and class uncertainty, this paper constructs a reliability scoring module and a variance voting module, respectively, to select more reliable proposals and refine the boundary coordinates of each proposal. On the THUMOS14 dataset, the proposed uncertainty-based method achieves a performance increase of 1.7% in mAP@0.5 for two-stage temporal action detection methods and a performance increase of 2.0% for anchor-free methods.

3. To address the issue of the discrepancy in the allocation strategy of positive and negative samples during model training and testing in temporal action detection tasks, this paper proposes a curriculum learning based weighted allocation method for temporal action detection. The proposed method significantly improves the performance of the model on multiple datasets while only adding a small number of parameters and inference time. The approach involves establishing an auxiliary branch and training it using a curriculum learning based weight allocation paradigm. During the training process, the weight of each training sample is a combination of its localization and classification scores. In the early stages of training, the network focuses more on reliable classification branches. In the later stages of training, the network pays more attention to localization branches that have a greater impact on model performance. On the THUMOS14 dataset, the curriculum learning based method for temporal action detection, which assigns weights to samples, improves the performance of the baseline model from 55.5% to 57.6%. Similarly, on the ActivityNet dataset, the algorithm improves the model performance from 34.4% to 35.4%. Compared to the baseline model, the proposed algorithm only increases the number of parameters by 1%, and the inference time increases by just 1.6 milliseconds.
4. To address the issues of low target resolution, slow search speed, and poor positioning accuracy in wide-area scenes, this paper proposes a action analysis system for wide-area scenes. This system enables high-definition imaging, efficient search, accurate positioning, and action analysis of targets. The temporal action detection system consists of a wide-angle camera and a high-speed camera. The wide-angle camera generates prior information in the form of panoramic images to guide the high-speed camera in capturing high-resolution images of the target. To quickly detect and accurately locate the target of interest, the system introduces innovative wide-area region probability maps and uncertainty-based search modules. The wide-area region probability maps estimate high-probability regions where the target is likely to appear and roughly determine the positions of some targets. The uncertainty-based search module dynamically adjusts the sampling range based on the variance of uncertainty provided by the object detector and optimizes the detected object coordinates. By deploying the three proposed temporal action detection algorithms in this system, the effectiveness of the proposed system has been validated in multiple tasks and scenarios, including object detection, action recognition, and temporal action detection. In a skating scenario, the system can track skating learners at a speed of 300 frames per second, classifying and locating the beginning and end times of key technical actions such as footsteps, rotations, and jumps throughout the skating process.

关键词深度学习 视频理解 行为分析 时序行为检测
语种中文
七大方向——子方向分类图像视频处理与分析
国重实验室规划方向分类视觉信息处理
是否有论文关联数据集需要存交
文献类型学位论文
条目标识符http://ir.ia.ac.cn/handle/173211/52387
专题毕业生_博士学位论文
推荐引用方式
GB/T 7714
陈云泽. 面向未剪辑视频的时序行为检测技术研究[D],2023.
条目包含的文件
文件名称/大小 文献类型 版本类型 开放类型 使用许可
面向未剪辑视频的时序行为检测技术研究.p(9019KB)学位论文 限制开放CC BY-NC-SA
个性服务
推荐该条目
保存到收藏夹
查看访问统计
导出为Endnote文件
谷歌学术
谷歌学术中相似的文章
[陈云泽]的文章
百度学术
百度学术中相似的文章
[陈云泽]的文章
必应学术
必应学术中相似的文章
[陈云泽]的文章
相关权益政策
暂无数据
收藏/分享
所有评论 (0)
暂无评论
 

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。