基于生物认知机制的视觉识别模型与算法研究

CASIA OpenIR > 毕业生 > 博士学位论文

	基于生物认知机制的视觉识别模型与算法研究
	席铉洋
	2017-05-25
学位类型	工学博士
中文摘要	视觉是人类感知外部世界最重要的途径，人类生活各个层级的需求都要依赖视觉感知来提供服务。长期以来，计算机视觉领域的研究人员都试图复现甚至超越人类视觉系统，而视觉识别是其中最具挑战性的视觉任务之一。视觉识别不仅需要克服各种环境因素的干扰而准确理解图像中的基本视觉要素，还要结合各种先验知识来理解视觉要素具有的深层语义。视觉识别对许多科学问题和工程问题的解决具有关键的作用。经过多年的发展，视觉识别技术研究取得了一系列重要成果，并获得了广泛地应用；但与人类视觉系统相比，在通用性、泛化性、实时性方面还有较大差距。近几年，随着图像获取、传递、分享方式的多元化和普及化，人们生活的方方面面都充斥着大量的图像和视频，广泛的应用需求对视觉识别技术提出了更高的要求。与此同时，随着实验手段和分析方法的逐步提升，生物学家对生物认知机制有了新的发现和解释，这为视觉认知功能的建模提供了新的思路。在这种背景下，本论文基于神经认知科学、神经生理学和心理物理学的研究成果，从生物机制启发的角度研究了视觉识别模型和算法，具体从框架模型设计、具体算法设计、算法性能提升这三个层次展开了研究，主要贡献包括： 1、针对HMAX(hierarchical max-pooling)模型中匹配模板提取和使用方式造成的编码效率不高、编码特征不具有遮挡鲁棒性的问题，受相关生物证据启发，提出了一种通用视觉认知模型。其模拟了初级视皮层(V1)到前下颞叶皮质(AIT)的功能，并引入了广义记忆三阶段、初始认知、主动调控和神经元集群编码等机制。模型的编码阶段改进了HMAX模型，通过两级编码提升了其编码效率和特征辨识能力。模型的回想阶段提出了基于相似概率融合的识别框架，其结合先验信息以概率融合的方式实现多特征融合。不同类型视觉识别任务上的验证实验证明了该模型的有效性和通用性，尤其是遮挡情况下鲁棒的识别性能反映了该模型对人类视觉认知过程有着更加深入的理解和模拟。 2、针对上面通用视觉认知模型在用于人脸感知时不能有效整合该感知过程特殊性的问题，加入面部“感知-记忆”机制，提出了人脸感知双通路计算模型。该模型由三个感知功能部分组成：面部结构感知部分使用级联卷积神经网络来估计面部关键部件的中心位置；面部表情感知部分提出了一种新颖的人脸表情识别方法，利用卷积深度置信网络的自学习能力来同步完成特征学习和特征选择；面部身份感知部分在基于相似概率融合的识别框架下加入了表情调控步骤和主动学习功能。实验结果证明了该模型对于不同表情下的人脸识别具有较好的鲁棒性，尤其是与基于深度学习方法的对比结果显示，该模型具有相当的性能和使用便捷的优势，适合于处理小样本的人脸识别问题。 3、针对空间增强局部二值模式直方图算法没有考虑表情因素影响的问题，借鉴上面人脸感知双通路模型框架对其进行了改进，提出了引入表情因素的改进空间增强局部二值模式直方图算法，提升了原始算法在不同表情下身份识别的性能。此外，针对上面人脸表情识别方法使用的卷积深度置信网络学习能力有限的问题，利用了深度卷积网络中间层具有提取基本特征的能力，提出了一种更加简洁的表情识别方法，其不需要大量样本来学习如何提取表情特征。实验结果证明了这两个方法改进的有效性。 4、针对以上三个模型和算法没有考虑复杂背景干扰的问题，提出了基于显著值回归的视觉显著物体检测模型，作为预处理步骤来滤除背景干扰从而提升以上模型和算法在真实环境下的使用效果。该模型通过单流全卷积神经网络来实现全图显著值回归，从而完成显著物体检测。针对网络结构特点，专门设计了平滑、鲁棒的损失函数，从整幅显著图和显著物体区域即全局和局部这两个角度来同时引导网络收敛。该模型是一个真正意义上的端到端网络，网络之外没有任何额外的预处理和后处理步骤，不仅提升了其感知能力，也极大地简化了检测过程。与多个近期顶尖性能模型相比，该模型拥有较少的参数数量，可以达到相当或者更好的检测精度，同时在处理速度上有较大提升。本论文有效地结合了相关生物认知机制和信息计算模型，提出的模型可以为视觉认知任务的结构化建模提供新的思路和参考依据，提出的算法可作为基本单元用于实现更复杂的深层认知模型或用于构建机器人视觉系统。
英文摘要	Vision is the most important way for human beings to perceive the outside world, and all levels of needs in human life depend on visual perception. For a long time, researchers in the field of computer vision have been trying to reproduce or even transcend human visual system. Visual recognition is one of the most challenging visual tasks, because it not only needs to overcome the interference of various environmental factors to accurately recognize all the basic visual elements in an image, but also to properly understand the semantic meanings of visual elements through combining a variety of prior knowledge. Visual recognition plays a key role in solving many scientific and engineering problems. After many years of development, visual recognition technology has made a series of important achievements and has been widely used in the real world. However, compared with human visual system, especially in terms of versatility, generalization and processing-speed, current visual recognition technology still has a long way to go. With the diversification and popularity of image acquisition, transmission and sharing methods in recent years, various aspects of human life are filled with a large number of images and videos. The wide application of visual recognition puts forward higher technological requirements. At the same time, with the gradual improvement of experimental techniques and analytic methods in biology, biologists have made some new discoveries and explanations of biological cognitive mechanism, which provide new ideas for modeling cognitive function in human visual system. In this context, this thesis, inspired by biological cognitive mechanisms, studies visual recognition models and algorithms on the basis of the research results of neurocognitive science, neurophysiology and psychophysics. Specifically, this thesis implements a research from three levels: framework model design, specific algorithm design, algorithm performance improvement, and the main contributions are as follows: 1.The extraction and usage of C1 matching units in hierarchical max-pooling (HMAX) model result in that HMAX's encoding is not highly efficient and HMAX's feature is not robust to occlusion. To address these problems, inspired by relevant biological evidences, a general visual recognition model is proposed. It mimics the function from primary visual cortex (V1) to anterior inferior temporal cortex (AIT) in human cortex and introduces the mechanisms of general memory, preliminary cognition, active attention adjustment and neural population coding in object recognition. The encoding stage in the proposed model improves the HMAX model by two-phase encoding which enhances its coding efficiency and feature discrimination. The recall stage in the proposed model proposes a recognition framework based on the fusion of similarity probability. It implements the multi-feature fusion through combining multi-probability fusion and prior information. The verification experiments on different types of visual recognition tasks verify the effectiveness and versatility of the model. Particularly, the robust recognition performance in the case of occlusion reflects that the model has a more in-depth understanding and simulation of human visual cognition. 2.For the problem that above-mentioned general visual recognition model cannot integrate the specificity in perceiving faces, a two-pathway face perception computation model is proposed after introducing the brain mechanisms of the perception and memory for faces. The model consists of three perception parts. In the facial structure perception part, a cascaded convolutional nerual network is applied for estimating the center locations of key facial components. In the facial expression perception part, a novel facial expression recognition method is proposed. It utilizes the self-learning ability of convolutional deep belief networks to synchronously complete feature learning and feature selection. In the facial identity perception part, an expression modulation process and active learning function are introduced into the above-mentioned recognition framework. The experimental results show that the model is robust to facial identity recognition under different expressions. In particular, the comparison with a deep-learning-based method shows that the proposed model not only can achieve comparable performance but also has the advantage of convenient usage. Overall, the proposed model is suitable for small-size training sets in facial identity recognition. 3.For the problem that the spatially enhanced local binary pattern histogram (eLBPH) algorithm does not consider the effect of the expression factor, we improve it and propose the expression-specific weighted local binary pattern histogram algorithm under the guidance of the above-mentioned face perception framework. The new algorithm raises its performance under different facial expressions. Moreover, for the problem that the study ability of convolutional deep belief networks used in the above-mentioned facial expression recognition method is limited, we propose a more concise method which utilizes the ability of middle layers in deep convolutional neural networks to extract basic visual features. The new method does not rely on large numbers of training samples to train a network for extracting facial expression feature. Finally, the validity of improving operations for the two methods is proved by comparison experiments. 4. For the problem that the above-mentioned models and algorithms do not consider the interference of complex backgrounds, a visual salient object detection model based on saliency score regression is proposed. It can be used as a pre-processing to filter out complex backgrounds and to improve their performances when they are applied in real environment. The model completes salient object detection by producing regression saliency score for each pixel in an image based on an one-stream fully convolutional network. A smooth and robust loss function is designed according to the characteristics of the network structure. It can force the network to converge from both global and local scope at the same time. The proposed model is a real end-to-end network, and there is no additional pre-processing and post-processing outside the network. This characteristics not only promotes the representation ability of fully convolutional networks for the salient object detection task, but also extremely simplifies the detection procedure. Comparing with recent state-of-the-art models, the proposed model owns a smaller storage, achieves comparable or better precision performance and gets a significant improvement in the detection speed. This thesis effectively combines relevant biological cognitive mechanisms and calculation models. The proposed models can provide new ideas and references for the structured modeling of visual cognitive tasks. The proposed algorithms can be used as basic units to complete more complex deep cognitive models or used to build a robot vision system.
关键词	生物认知机制视觉识别 Hmax 模型层级模型机器人
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/14805
专题	毕业生_博士学位论文
作者单位	中国科学院自动化研究所
第一作者单位	中国科学院自动化研究所
推荐引用方式 GB/T 7714	席铉洋. 基于生物认知机制的视觉识别模型与算法研究[D]. 北京. 中国科学院研究生院,2017.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
Thesis(2017.5.28).pd（21583KB）	学位论文		限制开放	CC BY-NC-SA