面向行人重识别的表征学习

CASIA OpenIR > 多模态人工智能系统全国重点实验室 > 生物识别与安全技术

	面向行人重识别的表征学习
	吴锦林
	2022-05-25
页数	146
学位类型	博士
中文摘要	随着智慧城市建设的推进，成千上万的监控设备被部署到各种的公共场合中，构成了大规模的分布式监控网络，产生出海量的视频监控数据。行人重识别技术应用计算机视觉、机器学习、以及模式识别等领域的方法，提取监控画面中行人图像的外观特征，通过比对特征的相似度，关联同一行人在不同摄像机中的轨迹图像。行人重识别在公安刑侦、人物检索、人机交互等场景中有着广阔的应用前景。相比于手工特征，基于深度学习的行人重识别方法使用深度神经网络提取判别性更强的行人特征，在公开的学术数据集上取得了突出的重识别效果。然而，在实际应用中，深度神经网络提取到的行人特征存在以下问题：（1）视角偏向性。行人重识别数据由不同视角的监控摄像机拍摄得到，不同摄像机视角的行人数据呈现非均衡分布，使得深度神经网络偏向于学习数据充足的视角下的行人特征。因此，现有方法在训练数据充足的视角下表现较好，在训练数据较少的视角下表现较差。（2）受行人表象变化影响大。行人姿态变化，被障碍物遮挡、以及运动模糊等情况导致行人表象出现较大的变化，从而影响行人重识别特征的稳定性。（3）行人表征跨场景适应性差。不同场景的拍摄背景、光照、季节以及相机配置不一致，导致不同场景采集到的行人数据存在领域差异。深度神经网络跨场景应用时，无法提取到准确的人体表征，导致模型表现大幅度下降。本文针对上述挑战对行人重识别任务展开了研究，主要工作和创新点包括以下几个方面: 提出一种视角无偏的行人表征学习方法。针对行人表征视角有偏的问题，本文改进分类损失函数和样本对损失函数，提出了多中心分类损失函数和视角均衡难样本挖掘方法，用于学习视角无偏的行人表征。多中心分类损失函数通过对每个视角设置一个类别模板，使得模型充分学习所有视角的数据，缓解行人分类损失中，因视角分布不均衡导致行人表征偏移问题。视角均衡难样本挖掘方法设计了一个视角均衡的样本特征存储库，缓解行人度量学习损失中，因不同视角之间正负样本对分布极其不均衡造成的行人表征偏移问题。多中心分类损失函数和视角均衡难样本挖掘方法的组合使用，可以提升模型在视角分布非均衡情况下的表现，在各个学术数据集上取得了领先的重识别性能。提出一种基于时序移动注意力的行人表征学习方法。针对行人表象变化导致行人表征不鲁棒的问题，本文提出了一种时序移动注意力机制，利用行人视频序列中的时空上下文信息，提取对表象变化鲁棒的行人表征。本文进一步地提出时序残差位置编码模块引导网络学习时序变化显著的信息，减弱时序冗余信息的干扰，从而提取到语义信息更丰富的行人表征。该方法最终在公开的视频行人重识别数据集上取得了视频目标重识别的领先性能。提出一种基于动态样本筛选的行人表征跨域适应方法。针对行人重识别模型进行跨域测试时，性能会严重下降的问题，本文提出了一种基于动态样本筛选的行人表征域适应方法。本文首先提出一种行人属性和行人身份联合训练的方法，提升源域模型的泛化能力，为目标域无监督训练提供一个较好的启动点。其次，本文提出一种基于伪标签动态筛选的无监督域适应方法，用于平衡目标域中无标签样本的可靠性和有效性，从而提升行人表征的跨域表现。在多个跨域行人重识别测试协议上，本方法取得了有竞争力的性能表现。提出一种基于图关联的无监督行人表征学习方法。现有域适应行人重识别方法对源域训练依赖性强，且训练效率低。针对这一问题，本文提出了一种无需源域训练的无监督行人表征学习方法。本文把行人表征学习分成同视角表征学习和跨视角表征学习两部分。首先通过时空稀疏采样的方法，获取到大量的同视角负样本用于同视角表征学习；其次提出了一种跨视角关联图，挖掘出不同视角中的潜在正样本用于跨视角表征学习。进一步地，本文提出了一种跨视角关联图在线更新方法、以及端到端无监督训练方法。在多个公开数据集上取得领先表现的同时，大幅度提升现有无监督行人重识别方法的训练效率。
英文摘要	With the advancement of urbanization, a large number of surveillance cameras have been widely deployed in various public places, forming a large-scale distributed surveillance network and providing massive video surveillance data. Person re-identification is a technology that involves computer vision, machine learning, and pattern recognition to extract the pedestrian representation feature and correlate tracks of the same pedestrian across different cameras by comparing the similarity of the pedestrian features. Person re-identification has broad application prospects in public security criminal investigation, person retrieval, human-computer interaction, and other scenarios. Compared with handcrafted features, the deep learning based person re-identification methods use deep neural networks to extract more discriminative pedestrian appearance features and achieve outstanding re-identification performance on public academic datasets. But existing person re-identification methods are still facing three major challenges in practical application: (1) View-biased. The pedestrian re-identification data are captured by surveillance cameras with different views. The unbalanced distribution of pedestrian data from different camera views makes the deep neural network biased to learn the pedestrian features from the viewpoint with sufficient data. The pedestrian data from different camera views are unevenly distributed, which makes the deep neural network biased to learn pedestrian features from the viewpoint with sufficient data. Therefore, the existing methods perform better under the viewpoints with sufficient training data and worse under the viewpoints with less training data. The performance of the existing methods is better under the viewpoint with sufficient training data and worse under the viewpoint with less training data. (2) Poorly robust to appearance perturbation. Pedestrian pose perturbation, the human body being occluded by obstacles, or motion blur lead to large changes in appearance representations, which affects the robustness of pedestrian features. (3) Poor cross-domain generalization. The backgrounds, lighting, seasons, and camera configurations of different scenes are inconsistent, resulting in domain gaps in the pedestrian data collected in different scenes. This results in a significant drop in cross-domain retrieval accuracy. This dissertation aims to address the above issues and the major contributions include: It proposes a view-unbiased representation learning method for person ReID. To address the view-biased representation learning issue, this dissertation proposes a view-unbiased pedestrian presentation learning approach, which includes a multi-center classification loss and a view-balanced hard sampling mining method. The multi-center classification loss function assigns a sub-proxy for per view per person, making full use the data of all views and alleviating the biased representation learning problem. The view-balanced hard sample mining method maintains a memory bank to evenly store the samples of all views, alleviating the biased representation which is caused by the extremely unbalanced distribution of positive and negative pairs among different views. Through the combination of the multi-center classification loss function and the view-balanced hard sampling mining method, this dissertation improves the ReID performance on view-unbalanced datasets, and finally achieves state-of-the-art performance on common academic person ReID datasets. It proposes a temporal shift attention for video-based pedestrian representation learning. To address the problem that pedestrian representations are not robust due to the perturbation in pedestrian appearances, this dissertation proposes a temporal shift attention mechanism to jointly model the spatio-temporal information in pedestrian tracklets, using the spatio-temporal context information to extract pedestrian representations that are robust to the appearance perturbation. In order to avoid wasting attention on temporal redundant information, it further proposes a temporal residual position embedding module to guide the network learning temporal saliency clues, thereby extracting pedestrian representations with richer semantic. The method finally achieves state-of-the-art performance on common video person re-identification datasets. It proposes a dynamic sampling based unsupervised domain adaptation for person ReID. There are domain gap between different datasets, and the performance of supervised person re-identification methods will be seriously degraded when it is tested across domains. In order to improve the target performance, firstly, this method analyzes the reliability and effectiveness of unlabeled data in the target domain, and then uses pedestrian attributes and pedestrian identities to jointly train source model and improve the generalization of the source model. Then, an unsupervised domain adaptation method based on the dynamic sample selecting is proposed. It achieves a trade-off between the reliability and effectiveness of unlabeled data. Benefiting from this, the domain adaptation method obtains a greater improvement on the target domain. Finally, the proposed method achieves state-of-the-arts performance on multiple cross-domain person re-identification benchmarks. It proposes a graph association method for unsupervised person ReID. For the problem of domain adaptation person re-identification, domain adaptation methods have a strong dependence on source domain training. Domain adaptation methods perform poorly or even ineffective when the similarity between the source and target domains is low. Aiming at this problem, this paper proposes a completely unsupervised pedestrian representation learning method. The method first adopts the spatio-temporal sparse sampling method to obtain a large number of negative samples of the same view for pedestrian representation learning. Then, a cross-view graph is designed, and potential cross-view positive samples are mined through graph association to improve the cross-view retrieval ability of pedestrian representation. This method completely removes the source domain dependence of existing person re-identification domain adaptation methods, achieving completely unsupervised training on the target domain. The method finally achieves state-of-the-art performance on common video person re-identification datasets and significantly improves the training efficiency of existing methods.
关键词	行人重识别视角无偏时序移动注意力领域适应无监督
语种	中文
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/48969
专题	多模态人工智能系统全国重点实验室_生物识别与安全技术毕业生_博士学位论文
推荐引用方式 GB/T 7714	吴锦林. 面向行人重识别的表征学习[D]. 自动化研究所. 中国科学院自动化研究所,2022.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
thesis~吴锦林.pdf（6908KB）	学位论文		开放获取	CC BY-NC-SA