基于多模态学习的视觉神经信息编解码方法研究

CASIA OpenIR > 毕业生 > 博士学位论文

	基于多模态学习的视觉神经信息编解码方法研究
	周琼怡
	2023-05-20
页数	150
学位类型	博士
中文摘要	视觉神经信息编解码研究能够建模视觉刺激和神经活动之间的关系，对计算机科学和认知神经科学均有重要意义。其中，编码研究有助于探究人脑视觉加工机制，评估人工神经网络的类脑特性，并推动类脑视觉模型的改进；解码研究能够赋能脑机接口系统设计，构建人脑与外部世界的信息传输通路。人工智能技术的高速发展为视觉神经信息编解码的研究提供了新的思路与方向。借助深度神经网络（Deep Neural Networks，DNN）的强大表征和计算能力，视觉神经信息编解码在神经响应的预测、刺激图像的语义解码等方面取得极大的进展。但是，视觉神经信息编解码方法研究仍存在以下挑战：（1）视觉刺激和神经响应模态在数据形式和分布上存在强异质性；（2）较强的个体差异导致编解码模型在不同被试上的泛化性较差；（3）基于DNN表征的编解码模型的可解释性较弱。为克服上述困难，本文将视觉刺激与神经活动视为不同模态，开展了基于多模态学习的视觉神经信息编解码方法的研究。本文主要研究内容及创新点如下： 1. 针对视觉刺激和神经响应模态间的强异质性问题，本文提出了基于可逆归一化流的跨模态生成方法。首先，该方法将视觉刺激和神经响应映射到模态共享的隐空间，并在隐空间上设计局部和全局约束以保证视觉刺激和神经响应表征的模态对齐。其次，该方法基于归一化流设计了面向视觉神经信息编解码的跨模态生成模型，利用归一化流的可逆性保证跨模态生成中不同模态间信息传递的完整性，确保跨模态重建图像能够保留更丰富的图像细节。此外，本方法通过一次训练即可完成视觉神经信息编码和解码两个对偶任务，极大地降低了训练成本。在视网膜神经节细胞电生理和大脑视皮层功能磁共振成像（functional Magnetic Resonance Imaging，fMRI）两种神经数据上的对比结果表明，该方法在编码和解码两个任务上的综合性能均优于先前的对比方法，同时重建的图像保留了更丰富的细节。 2. 上述工作验证了模态共享表征空间在克服模态强异质性问题上具有重要作用，故寻找更优的表征空间将有助于进一步克服编解码研究中的模态强异质性问题。基于此目的，本文提出了基于神经编码的深度视觉模型类脑特性评估方法。本文设计了逐体素的编码模型和逐脑区的层加权的编码模型，引入DNN对大脑神经响应的解释水平作为DNN类脑特性的量化指标。在两个视觉皮层fMRI数据集上，本文评估了包括卷积神经网络（Convolutional Neural Networks，CNNs）和视觉Transformer（Vision Transformers，ViTs）在内的30种深度视觉模型的类脑特性，分析了模型范式、参数量、多模态信息整合和时序建模等因素对类脑特性的影响。研究发现（1）CNNs和ViTs优势互补，CNNs在初级视觉皮层上表现更好，而ViTs在高级视觉皮层上表现更好，两者均与腹侧视觉通路存在层次对应关系；（2）更大的参数量不是提升模型类脑特性的充分条件；（3）多模态信息整合和时序建模能够提升模型的类脑特性。这些结论不仅为神经编解码模型挖掘了更优的表征空间，而且为类脑视觉模型提供了设计准则。 3. 针对个体间差异大导致模型在不同被试上的泛化性较差的问题，本文提出了图文预训练模型引导的多被试语义解码方法。首先，该方法设计了基于Transformer的fMRI特征提取器，模型能够通过自注意力模块提取神经响应的全局特征。其次，该方法引入Token来编码不同被试加工视觉信息的模式。该编码方式不仅能够使模型适用于不同被试的数据，而且能够扩大训练数据规模，从而充分发挥Transformer模型强大的表征能力。最后，考虑到图文多模态模型优秀的类脑特性，该方法基于表征相似性分析计算了视觉刺激在图文预训练模型表征空间的拓扑关系，并以此引导用于语义解码的Token表征的学习，充分刻画了不同被试在不同刺激下的神经响应的关系。在两个视觉任务的fMRI数据集上的对比结果表明，该方法能够适用于多个被试的语义解码，并且其解码精度优于单被试模型的解码精度。此外，本方法也优于已有的多被试解码方法。 4. 针对现有基于DNN表征的解码算法可解释性较弱的问题，本文提出了基于无监督语义解耦的可解释的视觉神经信息解码方法。首先，本方法以无监督的方式构建了语义解耦表征空间，表征空间中的每个维度编码不同的语义概念。其次，该方法将神经表征嵌入到解耦表征空间，从而能够建立大脑响应和不同语义间可解释的线性关系。最后，该方法为图像重建设计了多层级的约束，包括像素对齐、特征对齐和语义对齐，既能保证重建图像的质量又能保证重建图像与原始图像的语义一致性。在多个视觉任务的fMRI数据集上的实验结果表明，该方法的图像重建效果优于最先进的方法；该方法能够通过对神经活动进行虚拟操控，实现重建图像特定属性的编辑，体现了本方法的可解释性。
英文摘要	Research on visual neural information encoding and decoding can model the relationship between visual stimuli and neural activities, which is of great significance to both computer science and cognitive neuroscience. Among them, encoding research helps to explore the human brain's visual processing mechanism, evaluate the brain-like properties of artificial neural networks, and promote the improvement of brain-like visual models; decoding research can empower the design of brain-computer interface systems and construct information transmission pathways between the human brain and the external world. The rapid development of artificial intelligence technology provides new ideas and directions for the study of visual neural information encoding and decoding. With the powerful representation and computing capabilities of Deep Neural Networks (DNNs), visual neural information encoding and decoding has made great progress in predicting neural responses and decoding semantic information from images. However, research on visual neural information encoding and decoding still faces the following challenges: (1) Strong heterogeneity in the data format and distribution of visual stimuli and neural responses; (2) Strong individual differences result in poor generalization of encoding and decoding models across different subjects; (3) Weak interpretability of encoding and decoding models based on DNN representation. To overcome these difficulties, this thesis regards visual stimuli and neural activities as different modalities and conducts research on visual neural information encoding and decoding based on multimodal learning. The main research content and novelties of this thesis are as follows: 1. To address the issue of strong heterogeneity between visual stimuli and neural response modalities, a cross-modal generation method based on invertible normalizing flows is proposed. Firstly, this method maps visual stimuli and neural responses to a shared latent space, and designs local and global constraints on the latent space to ensure modality alignment of visual stimuli and neural responses. Secondly, based on the normalizing flow, this method designs a cross-modal generation model for visual neural information encoding and decoding, using the reversibility of normalizing flows to ensure the integrity of information transmission between different modalities in cross-modal generation, and to preserve richer image details in cross-modal reconstruction. In addition, this method completes the dual tasks of visual neural information encoding and decoding through a single training, greatly reducing the training costs. Comparative results on two types of neural data, retinal ganglion cell electrophysiology and functional Magnetic Resonance Imaging (fMRI) of the brain visual cortex, show that the proposed method performs better than previous comparative methods in both encoding and decoding tasks, while preserving richer details in the reconstructed images. 2. The above work confirms the important role of modality-shared representation space in overcoming the problem of strong modality heterogeneity. Therefore, seeking a better representation space will help further overcome the problem of modality heterogeneity in encoding and decoding research. For this purpose, a brain-like property evaluation method for deep vision models based on neural encoding is proposed. This paper designs voxel-wise encoding models and ROI-wise layer-weighted encoding models, and introduces the level of interpretability of DNN on brain neural responses as a quantitative indicator of DNN's brain-like properties. Based on two visual cortex fMRI datasets, this paper evaluates the brain-like properties of 30 deep visual models, including Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs), and analyzes the influence of model paradigms, parameter sizes, multimodal information integration, and temporal modeling on brain-like properties. The study finds that (1) CNNs and ViTs complement each other, with CNNs performing better in the primary visual cortex and ViTs performing better in the higher visual cortex, and both have a hierarchical correspondence with the ventral visual pathway; (2) Larger parameter sizes are not a sufficient condition to improve model brain-like properties; (3) Multimodal information integration and temporal modeling can improve model brain-like properties. These conclusions not only provide a better representation space for neural encoding and decoding models, but also provide design principles for brain-like visual models. 3. To address the issue of large individual differences, a multi-subject semantic decoding method guided by vision-language pretraining models is proposed. First, the method designs a Transformer-based fMRI feature extractor, which can extract global features of neural responses through self-attention modules. Second, the method introduces tokens to encode different patterns of visual information processing by different subjects. This encoding method not only makes the model applicable to data from different subjects, but also expands the size of the training data, fully leveraging the powerful representation ability of the Transformer model. Finally, due to the excellent brain-like properties of the vision-language pretraining model, the method calculates the topological relationship of visual stimuli in the vision-language pretraining model representation space, based on representation similarity analysis. The method then guides the learning of the token representations for the semantic decoding based on the relationship, providing a comprehensive understanding of the relationship between neural responses and visual stimuli for different subjects. The comparative results on two fMRI datasets of visual tasks show that the proposed method can be applied to semantic decoding of multiple subjects, and its decoding accuracy is superior to that of single-subject models. In addition, this method also outperforms existing multi-subject decoding methods. 4. To address the issue of weak interpretability of existing DNN-based decoding algorithms, an interpretable visual neural information decoding method based on unsupervised semantic disentanglement is proposed. Firstly, the method constructs a semantic disentanglement representation space in an unsupervised manner, where each dimension of the representation space encodes different semantic concepts. Secondly, the method embeds the neural representation into the disentanglement representation space, enabling the establishment of interpretable linear relationships between brain responses and different semantics. Finally, the method designs multi-level constraints for image reconstruction, including pixel alignment, feature alignment, and semantic alignment, which ensures the quality of reconstructed images while maintaining semantic consistency with the original images. Experimental results on multiple fMRI datasets of visual tasks demonstrate that the method outperforms state-of-the-art methods in image reconstruction, and can achieve specific attribute editing of reconstructed images by virtual manipulation of neural activity, demonstrating the interpretability of this method.
关键词	视觉神经信息编解码多模态学习归一化流多被试语义解码无监督解耦表征学习
语种	中文
七大方向——子方向分类	脑机接口
国重实验室规划方向分类	认知机理与类脑学习
是否有论文关联数据集需要存交	否
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/52097
专题	毕业生_博士学位论文
推荐引用方式 GB/T 7714	周琼怡. 基于多模态学习的视觉神经信息编解码方法研究[D],2023.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
thesis_明版_答辩后修改完整版_I（21688KB）	学位论文		限制开放	CC BY-NC-SA