CASIA OpenIR  > 精密感知与控制研究中心  > 人工智能与机器学习
面向行人重识别的多视角机器学习模型与算法研究
张志忠
Subtype博士
Thesis Advisor张文生
2020-05-23
Degree Grantor中国科学院大学
Place of Conferral中国科学院自动化研究所
Degree Name工学博士
Degree Discipline模式识别于智能系统
Keyword行人重识别 多视角机器学习 特征融合 度量学习 深度卷积神经网络
Abstract

经济的快速发展带来了不同区域、不同城市间人员的大规模流动,也给公共安全带来了巨大的挑战。特别是,随着安防监控系统的普及,如何对海量的监控数据进行理解与分析,正逐渐成为智能化安防的核心。在这种背景下,行人重识别任务近年来受到了广泛的关注和应用。在给定行人图像的情况下,行人重识别方法能够快速检索出行人的跨视角图像,从而解决大规模监控网络下的行人识别与检索问题,在人物追踪、商场寻人以及反恐安全等方面有着重要应用前景,对于打造智慧城市、提升安防处置能力也有着巨大的科研和应用价值。然而,受限于视频监控探头的安装高度及密度,以及光照变化、行人姿态变化、遮挡、监控数据分辨率低等因素的影响,多视角场景下的目标锁定与查找仍然十分困难,这衍生出一个重要的机器学习问题,即如何对多视角数据进行有效利用,以解决目标对象、数据之间量化关系模糊的难题。

本文聚焦于多视角机器学习模型与算法在行人重识别中的应用,研究如何利用多视角数据中蕴含的一致和差异信息,构建适用于安防场景的相似性度量模型。论文从多视角特征融合、多视角非对称度量和多视角深度损失函数的角度,开展多源信息的关联研究,尝试解决真实安防场景下的行人查找与匹配问题。本文的创新性研究成果主要有:

1. 提出了一种多线性多视角特征融合算法(Multi-linear multi-view feature fusion, MMF)。针对多视角特征中互补信息难以捕获的问题,根据特征的固有特性,提出相似性作用矩阵,挖掘和传播多种特征之间的一致互补信息;通过样本依赖与视角依赖假设,探讨多线性结构与多视角数据中蕴含的一致信息的关系,提出多线性多视角融合算法,实现索引层级的特征融合,在降低内存开销的同时显著提升匹配精度;针对优化目标,提出一种高效的迭代优化求解算法,该求解算法具有较低的计算复杂度和理论收敛性保证。在Market1501行人重识别数据集和Holidays、UKbench等图像检索数据集上的实验表明,多视角特征融合算法能够有效提升原始特征的判别性,同时降低在线匹配的计算和内存开销。

2. 提出了一种张量多视角非对称度量学习模型(Tensor multi-task learning, t-TML)。针对行人重识别中,由视角差异导致的数据分布不一致问题,提出张量多视角非对称度量学习框架,通过视角间和视角内的关联结构,学习非对称度量,对齐不同视角下的数据分布;提出无监督张量多视角度量学习模型,在不利用样本标签的情况下,能有效提升跨视角匹配精度,并运用多特征张量,灵活地融合多种视觉特征,有效地挖掘不同特征之间的互补信息。在ViPeR、CUHK01、CUHK03、Market1501等行人重识别公开数据集上进行了实验验证,结果表明所提方法的识别性能显著优于相关对比方法,所提出的无监督多视角模型、多特征张量模型,能够进一步提升行人重识别识别准确率。

3. 提出了一种多视角深度对齐度量学习模型(Wasserstein triplet loss, W-Triplet)。针对行人重识别中,目标存在偏差,不同视角下样本出现错位的情形,提出基于推土机距离的三元组损失函数,将原有的跨视角对齐问题转化为最优运输问题,通过对齐局部特征上的空间概率分布,运用正则化的推土机距离,解决样本错位问题;提出一种新的注意力机制,学习目标感兴趣区域,生成区域重要性离散概率,对最优运输问题提供监督指导;提出多分支深度网络模型,实现了全局和局部信息的融合,提升了识别准确率。在CUHK03、Market1501、DukeMtMC-Reid、MSMT17等多个行人重识别公开数据集上的实验表明,基于推土机距离的三元组损失能够帮助模型学习到目标的感兴趣区域,并依靠感兴趣区域,对齐和消除跨视角下的样本偏差,有效提升深度网络性能。

Other Abstract

The rapid economic development has brought a large-scale flow of people from cities to cites, and has also led to a huge challenge for public security. In particular, with the popularity of surveillance monitoring systems, how to understand and analyze these monitoring data has gradually become a core issue of intelligent security. In such context, person re-identification has aroused extensive attention in recent years. Given a pedestrian image, person re-identification technology can retrieve its cross-temporal and cross-scene images, and thereby solving the problem of pedestrian identification and retrieval under large-scale monitoring network, which has shown great potential for people tracking, shopping mall searching and anti-terrorism security. It also has tremendous scientific research and application value for building a smart city and enhancing security treatment capability. However, limited by the height and density of the monitoring camera, different illuminations, human poses, as well as occlusion and the low resolution of monitoring data, it is very difficult to match individuals across views. This raises an important machine learning problem, i.e., how to effectively use multi-view data to solve the problem of estimating the quantitative relationship of data and objects. 

This paper focuses on the multi-view machine learning models and algorithms with the application of person re-identification. It studies how to use the consistency and difference information contained in multi-view data to build a similarity measurement model which is suitable for security scenarios. From the perspective of multi-view feature fusion, multi-view asymmetric measurement and multi-view deep loss function, the thesis conducts research on multi-source information association and attempts to solve the problem of person searching and matching in real security scenarios. The main contributions of this paper are summarized as below:

1. This paper proposes a new multi-linear multi-view feature fusion model (Multi-linear multi-view feature fusion, MMF). To mine complementary information from multi-view features, the model learns the functional matrix according to the properties of features, and propagates similarities among multiple features. On this basis, the sample-dependence and view-dependence assumptions are used to capture the consistent information to achieve feature fusion on index level, which explores the relationship between multi-linear structure and consistent information contained in various feature representations. It reduces the memory cost while significantly improving matching accuracy. In addition, the proposed method offers an efficient solution algorithm, which has lower computational complexity and theoretical convergence guarantee. Experiments on person re-identification e.g., Market1501 and image retrieval tasks e.g., UKbench and Holidays illustrate that the multi-view feature fusion algorithm can effectively improve the discrimination of the original features, while reducing the computation and memory overhead in the on-line stage.

2. This paper proposes a new tensor multi-view asymmetric metric learning model (Tensor multi-task learning, t-TML). To reduce the distribution discrepancy caused by the multi-view data, t-TML introduces a tensor multi-view framework by taking advantage of the correlations captured not only across different views but also within the view itself. It learns the asymmetric metric and enables the model to align the data distribution. On this basis, unsupervised tensor multi-view learning is proposed, which improves the identification accuracy without any supervision. The proposed model can also be easily incorporated with multiple visual features and explore their complementary information. Extensive evaluations on ViPeR, CUHK01, CUHK03, Market1501 Re-ID benchmark datasets confirm the effectiveness of the proposed t-TML model.

3. This paper proposes a new multi-view alignment deep metric method (Wasserstein triplet loss, W-Triplet). To solve the mis-alignment issue, W-Triplet presents a new triplet loss based on Earth Mover Distance. By aligning the probability distribution in support of the local features, it transforms the cross-view alignment problem into an optimal transportation problem and utilizes a regularized Earth Mover Distance to mitigate the mis-alignment issue. Besides, a new attention mechanism is proposed, which can learn the object of interest, generate a discrete probability for local features, and provide supervision for the optimal transportation problems. A multi-branch deep network is also utilized to fuse global and local information, which improves recognition accuracy. Experiments on CUHK03、Market1501、DukeMtMC-Reid、MSMT17 public Re-ID datasets show that the Earth Mover Distance based triplet loss can help the model distinguish the object of interests, such that it is able to align and eliminate the sample bias according to the salient areas. It also shows that the proposed method can effectively promote the performance of deep network.

Subject Area计算机科学技术
MOST Discipline Catalogue工学
Pages113
Language中文
Document Type学位论文
Identifierhttp://ir.ia.ac.cn/handle/173211/39109
Collection精密感知与控制研究中心_人工智能与机器学习
Recommended Citation
GB/T 7714
张志忠. 面向行人重识别的多视角机器学习模型与算法研究[D]. 中国科学院自动化研究所. 中国科学院大学,2020.
Files in This Item:
File Name/Size DocType Version Access License
张志忠学位论文最终版.pdf(7131KB)学位论文 开放获取CC BY-NC-SAApplication Full Text
Related Services
Recommend this item
Bookmark
Usage statistics
Export to Endnote
Google Scholar
Similar articles in Google Scholar
[张志忠]'s Articles
Baidu academic
Similar articles in Baidu academic
[张志忠]'s Articles
Bing Scholar
Similar articles in Bing Scholar
[张志忠]'s Articles
Terms of Use
No data!
Social Bookmark/Share
All comments (0)
No comment.
 

Items in the repository are protected by copyright, with all rights reserved, unless otherwise indicated.