面向典型视频分析任务的注意力建模方法

CASIA OpenIR > 模式识别实验室

	面向典型视频分析任务的注意力建模方法
	董文恺
	2022-05-22
页数	140
学位类型	博士
中文摘要	随着互联网和多媒体技术的快速发展，以及智能手机等便携移动终端的迅速普及，视频数据呈现出爆炸式的增长。如何利用视频数据为人类的生产生活服务成为一项日益重要的研究课题。对视频中的内容进行分析，是数据智能应用中的一个基础而又必不可少的环节，并在诸如智能视频监控、自动驾驶、人机交互和活体检测等领域具有广泛的应用价值和发展前景。虽然深度学习方法极大地促进了视频分析的发展，但其在处理视频分析任务时仍然面临着一些问题和挑战。一方面，由于视频中与任务无关的信息的干扰，现有方法难以有效地利用关键信息来准确地识别目标或其行为。另一方面，由于运动模糊以及相机失焦等因素，视频中一些图像帧的观测质量较低，导致难以对其内容进行准确分析，解决这个问题的关键是如何利用视频中丰富的冗余时空信息来增强低质量图像特征的语义表达能力。作为一种生物启发的方法，注意力机制可以帮助视觉模型有选择性地关注和利用视频中与任务相关的信息。因此，以基于深度学习的视频分析方法为基础，本文针对上述挑战，结合注意力建模方法对视频分析中的行人搜索、行为识别和视频目标检测等任务展开研究。本文的主要成果如下： 1. 本文提出了一种面向两阶段行人搜索的空间注意力建模方法。针对过多的候选行人会影响行人搜索的问题，本文提出了实例导向的行人检测网络。通过互相关层将目标行人的信息融入检测网络，该网络能够利用空间注意力来关注场景中的目标行人并输出各候选区域与目标之间的相似度。为了将目标信息更高效地融入检测网络，本文提出了改进的互相关层来解决原有互相关层导致的模型参数分布不均衡的问题。网络还使用了局部关系模块和全局关系分支来分别建模场景中不同区域间的局部关系和目标与场景之间的全局关系。实验结果表明该方法可以通过减少候选行人的数量来提升行人搜索的性能，在常用行人搜索数据集上取得了同期较好的效果。 2. 本文提出了一种面向单阶段行人搜索的空间注意力建模方法。针对场景中干扰信息导致行人身份特征判别力低的问题，本文利用空间注意力来使模型更关注场景中的行人并提出了双向交互网络。该方法在现有单阶段行人搜索模型上添加了以行人图片为输入的实例感知分支。同时，为了保持两个分支对相同行人输出一致的响应，该方法还引入了两种交互损失函数来分别实现特征层面和预测层面上的一致性。实验结果表明该方法能有效地使模型关注场景中的行人信息并学习更有判别力的身份特征，从而显著提升行人搜索性能，并且在常用数据集上的准确率高于同期其他方法。 3. 本文提出了一种面向行为识别的时间注意力建模方法。针对视频中的无关信息导致视频分类错误的问题，本文提出了利用时间注意力机制来挑选视频中关键帧的解决思路。该解决思路通过基于硬注意力机制的采样方法挑选视频中的关键帧同时丢弃其余的无关帧来对视频进行分类。本文将采样关键帧的过程形式化为马尔可夫决策过程，通过深度强化学习训练用于采样的智能体。同时，为了更有效地训练智能体，还利用视频标签生成了伪关键帧标签。实验结果表明该方法能在常用的数据集上提升基于双流模型的行为识别方法的准确率。 4. 本文提出了一种面向视频目标检测的时空注意力建模方法。该方法通过时空注意力机制在视频中挑选时空信息来改善低质量图像的检测效果。该方法利用类别外部记忆模块来增强低质量图像中目标特征的高层语义表达，并通过分数传递模块进一步修正检测结果。记忆模块利用存储着的类别中心特征为目标特征提供时空信息，有效地解决了同期特征聚合方法对辅助帧采样策略敏感的问题。分数传递模块通过自注意力机制关联不同帧中的物体，将关联边界框的过程整合进网络的训练过程，解决了现有边界框关联方法中的局部最优问题。实验结果表明该方法能够显著改善低质量图像的检测效果，并且在大规模视频目标检测数据集上取得了很好的效果。
英文摘要	With the rapid development of the Internet and multimedia as well as the popularization of portable mobile terminals, a large scale of videos are generated. It has become an important research topic to exploit video data for human production and life. Analysis of video content is a basic and necessary segment in the application of data intelligence and has broad prospects for development in fields such as intelligent video surveillance, autonomous driving, human-machine interaction, face anti-spoofing, and so on. Although deep learning methods have greatly promoted the development of video analysis, video analysis still faces many difficulties and challenges. On the one hand, due to the interference from task-irrelevant information in videos, it is difficult for existing methods to effectively utilize key information to recognize objects or their actions accurately. On the other hand, due to motion blur and camera defocus, there are some images with degraded qualities which are hard to analyze accurately. The key to addressing this problem is how to utilize the rich spatio-temporal information in videos to enhance the semantic representation of deteriorated images. As a bio-inspired method, attention mechanisms can help visual models selectively focus on and leverage taskrelevant parts in videos. Therefore, inspired by the substantial progress of video analysis methods based on deep learning, we propose to cope with these challenges and achieve video analysis tasks including person search, action recognition and video object detection with different attention mechanisms. The main contributions in this dissertation are summarized as follows: 1. This dissertation proposes a spatial attention modeling method for two-stage person search. Facing the problem that a large number of proposals decreases the performance of person search, we propose an instance-guided proposal network. We exploit the appearance information of the query to reduce the number of candidate pedestrians and design an instance-guided proposal network. Incorporating information of the query through the cross correlation layer, the detection network can calculate the simi-larities between proposals in the scene image and the query. To exploit information of the query more effectively, we address the imbalance of model parameters distribution by an improved cross correlation layer. Meanwhile, a local relation block and a global relation branch are used to model the proposal-proposal relations and characterize the query-scene relations, respectively. Experimental results demonstrate that the person search performance is improved by reducing proposals and our method achieves the competitive results on wildly used person search datasets. 2. This dissertation proposes a spatial attention modeling method for one-stage person search. The noises in the scene make the identity features less discriminative. To alleviate this problem, a single-stage person search method based on a spatial attention mechanism is proposed. We think that it is necessary to use person patches to guide the model to focus more on persons in the scene. Therefore, in this method, an additional branch which takes as inputs person patches, instance-aware branch, is introduced based on the existing single-stage person search model. Meanwhile, for a positive proposal, two branches should have consistent responses to it. We introduce two interaction losses to achieve consistency on the feature-level and prediction-level, respectively. Experimental results demonstrate that the proposed method not only makes the model pay more attention to person information within bounding boxes and learn more discriminative identity features, but also increases the performance of person search and outperforms other methods on person search datasets. 3. This dissertation proposes a temporal attention modeling method for action recognition. To avoid the incorrect classification caused by noises in videos, we propose to select key frames from videos by a temporal attention mechanism. An attention-aware sampling method based on a hard attention mechanism is used to select key frames and discard irrelevant frames for video classification. We formulate the process of sampling key frames as a Markov decision process and train the sampling agent via deep reinforcement learning. To train the agent more effectively, pseudo labels are generated according to the video labels. Experimental results verify that our method can benefit existing two-stream based methods on two public action recognition datasets. 4. This dissertation proposes a spatio-temporal attention modeling method for video object detection. To improve the detection performance of deteriorated images, we propose a video object detection method based on a spatio-temporal attention mechanism. The proposed method leverages a class external memory to enhance the semantic features of target frames, and refines the detection results by a score propagation module. The memory exploits the stored class centers to provide spatio-temporal information for target features, which is robust to the support frame sampling strategies. The score propagation module associates objects in different frames by self-attention and involves this association process in the end-to-end training, which avoids the sub-optimal problem. Experimental results demonstrate that our method improves the detection performance of the deteriorated frames significantly and achieves competitive results on the largescale video object detection dataset.
关键词	视频分析注意力机制行人搜索行为识别视频目标检测
语种	中文
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/48635
专题	模式识别实验室
推荐引用方式 GB/T 7714	董文恺. 面向典型视频分析任务的注意力建模方法[D]. 中国科学院自动化研究所. 中国科学院自动化研究所,2022.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
学位论文-董文恺.pdf（11492KB）	学位论文		开放获取	CC BY-NC-SA