基于脉冲神经网络的多模态视听分类

CASIA OpenIR > 毕业生 > 硕士学位论文

	基于脉冲神经网络的多模态视听分类
	郭凌月
	2024-05
页数	84
学位类型	硕士
中文摘要	在技术迅猛发展的今天，多模态数据的应用变得越来越普遍，它们集合了图像、音频、文本等多种信息形式，极大地丰富了系统和应用的信息表达能力。与单一模态相比，多模态数据通过整合视觉、听觉等多维度感知信息，能够更深入地挖掘场景的内涵和语义，数据处理的深度和精确度都有了显著提升。然而，面对这类数据，传统深度学习方法往往伴随着高能耗的问题，迫切需要更高效的处理方案。在此背景下，脉冲神经网络（SNNs）因其模拟生物神经系统的高效性而成为研究热点。脉冲神经网络通过模仿生物神经元的脉冲行为来处理信息，区别于传统人工神经网络的连续激活模式，SNNs实现了事件驱动的计算方式。这意味着神经元只有在累积到足够的输入信号并触达阈值时才会激活，极大地减少了不必要的计算和能量消耗。此外，SNNs通过脉冲的时间间隔来编码信息，有效利用了输入信号的时间结构，降低了处理时间序列数据的能耗。SNNs的这些独特属性，包括天然的稀疏性，进一步减轻了存储和数据传输的能耗负担，使其在能源受限的环境下处理复杂任务时展现出显著的低能耗优势。尽管SNNs在图像分类、目标检测、语音识别等单模态任务中已展现出其模仿人脑时序处理的强大能力，但在多模态视听分类等领域的应用仍面临挑战。本研究提出了一系列创新的基于脉冲神经网络的算法和模型，专注于解决多模态数据中的音视觉分类问题。通过设计高效的多模态对齐和融合策略，提升了分类任务的准确性和处理效率。本研究的主要贡献和创新之处在于： 1.提出基于脉冲神经网络的视听对齐算法本文提出了一种基于脉冲神经网络的多模态视听对齐算法，旨在高效处理和理解复杂的信息场景。在自然环境下，视觉与听觉信息通常相辅相成，共同为我们提供了对周遭世界的全面理解。有效的多模态对齐不仅能够确保来自不同感官的信息在时间上的精确同步，增强数据的内在一致性，还能显著提升信息的丰富性和表达力。此外，它为深度信息融合提供了必要的基础，只有在不同模态数据正确对齐的前提下，才能进一步挖掘和揭示更加复杂的模式和关系。本算法通过脉冲自注意力机制增强模态内的特征表示，随后利用脉冲神经网络实现视觉与听觉信号之间的动态对齐。这种方法不仅提高了多模态数据处理的效率和准确性，而且由于脉冲神经网络的低能耗特性，还大大降低了计算成本。 2.提出基于脉冲神经网络的视听融合算法本文提出了一种基于脉冲神经网络的视听融合算法，结合脉冲神经网络的高效信息处理能力与Transformer中的注意力机制，实现了视觉和听觉信息的深度融合，显著提升了融合信息的相关性和表达力。通过脉冲神经网络对这两种不同模态的数据进行高效编码，并利用Transformer的注意力机制深入挖掘视觉与听觉信息间的内在联系，深度融合视觉和听觉信息，能够获得比单一模态更加丰富和全面的数据表示，从而更准确地理解和响应复杂的环境或任务。 3.提出基于脉冲神经网络的多模态视听分类模型为了解决多模态音视觉分类任务的低能耗和高准确性问题，构建了基于脉冲神经网络的视听分类模型。该模型综合应用了上述对齐和融合算法，展示了在多模态音视觉分类任务中的优越性能，特别是在保持低能耗的同时实现了高准确率的分类。并且构建了两个非数字的多模态视听数据集，CIFAR10-AV和UrbanSound8K-AV，提供了一系列真实世界的图像和音频。实验表明，本文提出的模型不仅在公共事件基础的数据集上表现优异，而且在自制真实世界数据集上也保持了较低的计算开销。
英文摘要	In the era of rapid technological advancement, the application of multimodal data, encompassing images, audio, and text, has become increasingly prevalent, significantly enriching the expressive capability of systems and applications. Compared to unimodal data, multimodal data integrates information from multiple sensory dimensions such as vision and hearing, allowing for a deeper exploration of the context and semantics of scenes, thereby significantly enhancing the depth and precision of data processing. However, traditional deep learning methods often encounter high energy consumption issues when dealing with such data, necessitating more efficient processing approaches. In this context, Spiking Neural Networks (SNNs) have emerged as a research focus due to their efficiency in mimicking the biological neural system and their low-energy characteristics. Spiking Neural Networks process information by emulating the spiking behavior of biological neurons, unlike the continuous activation mode of traditional artificial neural networks, implementing an event-driven computation method. This means that neurons are activated only when they accumulate enough input signals to reach a threshold, significantly reducing unnecessary computations and energy consumption. Moreover, SNNs encode information through the intervals between spikes, effectively utilizing the temporal structure of input signals, thereby reducing the energy consumption of processing time-series data. These unique attributes of SNNs, including their inherent sparsity, further reduce the energy burden of storage and data transmission, showcasing their significant low-energy advantages in handling complex tasks in energy-constrained environments. Although SNNs have demonstrated their powerful capabilities in mimicking the brain's temporal processing in unimodal tasks such as image classification, object detection, and speech recognition, their application in multimodal audio-visual classification still faces challenges. This study introduces a series of innovative algorithms and models based on Spiking Neural Networks, focusing on addressing the audio-visual classification problem in multimodal data. By designing efficient multimodal alignment and fusion strategies, this research aims to enhance the accuracy and efficiency of classification tasks. The main contributions and innovations of this study include: 1. An Audio-Visual Alignment Algorithm Based on Spiking Neural Networks is proposed. This paper proposes an audio-visual alignment algorithm based on Spiking Neural Networks, aimed at efficiently processing and understanding complex information scenes. In natural environments, visual and auditory information complement each other, providing a comprehensive understanding of our surroundings. Effective multimodal alignment ensures precise synchronization and alignment of information from different senses, not only enhancing the internal consistency of data but also significantly increasing the richness and expressiveness of information. Moreover, it lays the foundation for deep information fusion, allowing for further exploration and revelation of more complex patterns and relationships only after different modal data are correctly aligned. This algorithm enhances the feature representation within each modality through a Spiking Self-Attention mechanism (SSA) and then dynamically aligns visual and auditory signals using Spiking Neural Networks. This method not only improves the efficiency and accuracy of multimodal data processing but also significantly reduces computational costs due to the low-energy characteristics of Spiking Neural Networks. 2. An Audio-Visual Fusion Algorithm Based on Spiking Neural Networks is proposed. This paper presents an audio-visual fusion algorithm based on Spiking Neural Networks, combining the attention mechanism of Spiking Neural Networks with Transformers to achieve deep fusion of visual and auditory information, significantly enhancing the relevance and expressiveness of fused information. By deeply fusing visual and auditory information, a richer and more comprehensive data representation is obtained, enabling a more accurate understanding and response to complex environments or tasks. 3. A Multimodal Audio-Visual Classification Algorithm Based on Spiking Neural Networks is proposed. To address the challenges of low energy consumption and high accuracy in multimodal audio-visual classification tasks, a classification model based on Spiking Neural Networks is proposed. This model, integrating the aforementioned alignment and fusion algorithms, demonstrates superior performance in multimodal audio-visual classification tasks, especially achieving high accuracy classification while maintaining low energy consumption. Additionally, two non-digital multimodal audio-visual datasets, CIFAR10-AV and UrbanSound8K-AV, are constructed, providing a series of real-world images and audio. Experiments show that the proposed model not only performs excellently on public datasets but also maintains low computational overhead on our real-world datasets.
关键词	脉冲神经网络多模态对齐多模态融合视听分类
语种	中文
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/57634
专题	毕业生_硕士学位论文
推荐引用方式 GB/T 7714	郭凌月. 基于脉冲神经网络的多模态视听分类[D],2024.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
毕业论文_无签字.pdf（3051KB）	学位论文		限制开放	CC BY-NC-SA