基于脉冲神经网络的多模态视听分类 | |
郭凌月 | |
2024-05 | |
页数 | 84 |
学位类型 | 硕士 |
中文摘要 | 在技术迅猛发展的今天,多模态数据的应用变得越来越普遍,它们集合了图像、音频、文本等多种信息形式,极大地丰富了系统和应用的信息表达能力。与单一模态相比,多模态数据通过整合视觉、听觉等多维度感知信息,能够更深入地挖掘场景的内涵和语义,数据处理的深度和精确度都有了显著提升。然而,面对这类数据,传统深度学习方法往往伴随着高能耗的问题,迫切需要更高效的处理方案。在此背景下,脉冲神经网络(SNNs)因其模拟生物神经系统的高效性而成为研究热点。 |
英文摘要 | In the era of rapid technological advancement, the application of multimodal data, encompassing images, audio, and text, has become increasingly prevalent, significantly enriching the expressive capability of systems and applications. Compared to unimodal data, multimodal data integrates information from multiple sensory dimensions such as vision and hearing, allowing for a deeper exploration of the context and semantics of scenes, thereby significantly enhancing the depth and precision of data processing. However, traditional deep learning methods often encounter high energy consumption issues when dealing with such data, necessitating more efficient processing approaches. In this context, Spiking Neural Networks (SNNs) have emerged as a research focus due to their efficiency in mimicking the biological neural system and their low-energy characteristics. Spiking Neural Networks process information by emulating the spiking behavior of biological neurons, unlike the continuous activation mode of traditional artificial neural networks, implementing an event-driven computation method. This means that neurons are activated only when they accumulate enough input signals to reach a threshold, significantly reducing unnecessary computations and energy consumption. Moreover, SNNs encode information through the intervals between spikes, effectively utilizing the temporal structure of input signals, thereby reducing the energy consumption of processing time-series data. These unique attributes of SNNs, including their inherent sparsity, further reduce the energy burden of storage and data transmission, showcasing their significant low-energy advantages in handling complex tasks in energy-constrained environments. Although SNNs have demonstrated their powerful capabilities in mimicking the brain's temporal processing in unimodal tasks such as image classification, object detection, and speech recognition, their application in multimodal audio-visual classification still faces challenges. This study introduces a series of innovative algorithms and models based on Spiking Neural Networks, focusing on addressing the audio-visual classification problem in multimodal data. By designing efficient multimodal alignment and fusion strategies, this research aims to enhance the accuracy and efficiency of classification tasks. The main contributions and innovations of this study include: 1. An Audio-Visual Alignment Algorithm Based on Spiking Neural Networks is proposed. This paper proposes an audio-visual alignment algorithm based on Spiking Neural Networks, aimed at efficiently processing and understanding complex information scenes. In natural environments, visual and auditory information complement each other, providing a comprehensive understanding of our surroundings. Effective multimodal alignment ensures precise synchronization and alignment of information from different senses, not only enhancing the internal consistency of data but also significantly increasing the richness and expressiveness of information. Moreover, it lays the foundation for deep information fusion, allowing for further exploration and revelation of more complex patterns and relationships only after different modal data are correctly aligned. This algorithm enhances the feature representation within each modality through a Spiking Self-Attention mechanism (SSA) and then dynamically aligns visual and auditory signals using Spiking Neural Networks. This method not only improves the efficiency and accuracy of multimodal data processing but also significantly reduces computational costs due to the low-energy characteristics of Spiking Neural Networks. 2. An Audio-Visual Fusion Algorithm Based on Spiking Neural Networks is proposed. This paper presents an audio-visual fusion algorithm based on Spiking Neural Networks, combining the attention mechanism of Spiking Neural Networks with Transformers to achieve deep fusion of visual and auditory information, significantly enhancing the relevance and expressiveness of fused information. By deeply fusing visual and auditory information, a richer and more comprehensive data representation is obtained, enabling a more accurate understanding and response to complex environments or tasks. 3. A Multimodal Audio-Visual Classification Algorithm Based on Spiking Neural Networks is proposed. To address the challenges of low energy consumption and high accuracy in multimodal audio-visual classification tasks, a classification model based on Spiking Neural Networks is proposed. This model, integrating the aforementioned alignment and fusion algorithms, demonstrates superior performance in multimodal audio-visual classification tasks, especially achieving high accuracy classification while maintaining low energy consumption. Additionally, two non-digital multimodal audio-visual datasets, CIFAR10-AV and UrbanSound8K-AV, are constructed, providing a series of real-world images and audio. Experiments show that the proposed model not only performs excellently on public datasets but also maintains low computational overhead on our real-world datasets. |
关键词 | 脉冲神经网络 多模态对齐 多模态融合 视听分类 |
语种 | 中文 |
文献类型 | 学位论文 |
条目标识符 | http://ir.ia.ac.cn/handle/173211/57634 |
专题 | 毕业生_硕士学位论文 |
推荐引用方式 GB/T 7714 | 郭凌月. 基于脉冲神经网络的多模态视听分类[D],2024. |
条目包含的文件 | ||||||
文件名称/大小 | 文献类型 | 版本类型 | 开放类型 | 使用许可 | ||
毕业论文_无签字.pdf(3051KB) | 学位论文 | 限制开放 | CC BY-NC-SA |
个性服务 |
推荐该条目 |
保存到收藏夹 |
查看访问统计 |
导出为Endnote文件 |
谷歌学术 |
谷歌学术中相似的文章 |
[郭凌月]的文章 |
百度学术 |
百度学术中相似的文章 |
[郭凌月]的文章 |
必应学术 |
必应学术中相似的文章 |
[郭凌月]的文章 |
相关权益政策 |
暂无数据 |
收藏/分享 |
除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。
修改评论