CASIA OpenIR  > 模式识别国家重点实验室
基于深度学习的视听多模态融合及生成方法研究
郝王丽
Subtype博士
Thesis Advisor张兆翔
2019-05-30
Degree Grantor中国科学院自动化研究所
Place of Conferral中国科学院自动化研究所
Degree Discipline计算机应用技术
Keyword多模态感知,模态补足,视听融合,视听生成
Abstract

视频是一种重要的信息载体,相比图像或音频,它包含着更加丰富的信息表达。因此,其在计算机视觉领域获得了广泛研究,如视频描述生成,视频分割,视频检测,视频行为识别等。众所周知,视频包含视觉和听觉两个模态信息,它们不仅包含共同信息,而且包含互补信息。如果两个模态的信息可被充分融合,对应视频中所包含的海量语义信息就会得到更加深入的理解。但目前大多数的视频任务只利用了一种模态数据,忽略了其他模态数据所包含的有价值信息。此外,由于环境干扰或传感器故障,视频可能会出现模态缺失的现象。因此,研究如何充分挖掘视频中的视听模态信息是一个具有现实意义和应用价值的关键问题。本论文主要从视听模态有效融合及视听模态相互生成两个方面开展研究,具体工作如下:

1. 在视觉单模态多通道融合中,本文提出了基于时空通路融合的行为识别模型。该网络结构旨在探索表观和运动通路在不同层次的交互。具体为,双通路中块级的稠密连接可以为它们提供在特征表达层的交互。此外,双通路(每个通路可看成一个学生)以及它们最后的融合(看成教师)间的知识蒸馏可以为它们提供决策层即高层的交互。该模型可逐步提取层次时空特征且可进行端到端的训练。最后,基于两个基准数据集:UCF101和HMDB51,大量的消融实验验证了所提出方法的有效性和泛化性。

2. 在视听多模态融合中,本文提出了两种视听多模态深度特征融合的视频描述生成模型。第一个模型通过共享权值方式建立短时视听时域依赖性方式进行融合。第二个模型通过共享外部记忆单元建立长时时域依赖性方式进行融合。同时,提出了一个新的动态多模态深度特征融合模型来解决听觉模态缺失问题。基于MSR-VTT和MSVD 两个基准视频描述生成数据集,大量实验验证了在视觉和听觉LSTM之前共享权值可有效捕捉到两个模态之间的共振信息,且获得当前最好的BELU和METEOR值。此外,即使当声音模态缺失时,动态多模态特征融合模型仍然可以获得与完整模态模型相当的性能。

3. 在视听多模态生成中,本文提出了跨模态循环对抗生成网络(CMCGAN)来处理视听模态相互生成,主要针对两个特定任务,分别是乐器类别导向的跨模态生成和姿态导向的跨模态生成。具体来说,CMCGAN由四种子网络组成,分别包括听觉到视觉,视觉到听觉,听觉到听觉和视觉到视觉子网络,它们以循环结构组织。CMCGAN有如下几个显著的优势。首先,CMCGAN通过引入联合对应对抗损失函数将视听相互生成统一在一个框架中。其次,引入高斯隐变量,CMCGAN 可有效处理不同模态在维度和结构上的不对称性。再次,CMCGAN可进行端到端的训练。鉴于CMCGAN良好的跨模态生成效果,提出一个动态多模态分类网络来解决模态缺失问题。大量定量和定性的实验验证了在乐器类别及姿态类别导向的视听跨模态相互生成任务上CMCGAN均获得了最好的性能。同时,实验表明生成的模态可起到与原始模态相当的作用,验证了所提出模型在处理模态缺失问题上的优势。

Other Abstract

Video is a very important information carrier, compared to single image or audio, it contains richer information. Thus, in computer vision field, it obtains extensive research, such as video captioning, video segmentation, video detection and video action recognition. As we all know, video contains two modalities, including visual modality and audio modality. They carry both common and complementary information. If the information of these modalities can be thoroughly utilized, the performance of the corresponding task can be improved. However, most current video task only leverage the information from one single modality. In addition, since the environment interference and sensor fault, sometimes, one modality is damaged or missing. Thus, exploring how to fully mine the information underlying video is a key problem with practical significance and application value. This paper focuses on two aspects of research, and obtains the following research work:


1. Concerning the fusion of single visual modality scenario, we propose a spatiotemporal fused model for video action recognition. This network implements both knowledge distillation and dense-connectivity. Using this STDDCN architecture, we aim to explore interaction strategies between appearance and motion streams along different hierarchies. Specifically, block-level dense connections between appearance and motion pathways enable spatiotemporal interaction at the feature representation layers. Moreover, knowledge distillation among two streams (each treated as a student) and their last fusion (treated as teacher) allows both streams to interact at the high level layers. The special architecture of STDDCN allows it to gradually obtain effective hierarchical spatiotemporal features. Moreover, it can be trained end-to-end. Finally, numerous ablation studies validate the effectiveness and generalization of our model on two benchmark datasets, including UCF101 and HMDB51. Simultaneously, our model achieves promising performances.

2. Concerning the fusion of audio-visual scenario, we propose two multimodal deep feature fusion models that effectively integrate both visual and audio cues into high-order consistency representation, at the encoding or decoding stage in video captioning framework. The first establishes short visual-audio temporal dependency by sharing weights while the second builds long visual-audio temporal correlation by sharing external cross-modal memory. In addition, we develop a novel dynamic multimodal feature fusion model to handle the case of audio tracks missing in video clips during caption generation. Extensive experiments on two benchmark datasets, including MSR-VTT and MSVD, show that sharing weights between visual and audio LSTM streams can capture resonance information across two modalities effectively and obtain the state-of-art performance on both BELU and METEOR metrics. Even if the audio modality is absent, our dynamic multimodal feature fusion model can still obtain comparable results to the complete multimodal model.

3. Concerning the mutual generation of audio-visual modalities, we propose a Cross-Modal Cycle Generative Adversarial Network (CMCGAN) to handle cross-modal visual-audio mutual generation, and work on two specific scenarios, including instrument-oriented generation and pose-orientated generation respectively. Specifically, CMCGAN is composed of four kinds of subnetworks: audio-to-visual, visual-to-audio, audio-to-audio and visual-to-visual subnetworks respectively, which are organized in a cycle architecture. CMCGAN has several remarkable advantages. Firstly, CMCGAN unifies visual-audio mutual generation into a common framework by a joint corresponding adversarial loss. Secondly, through introducing a latent vector with Gaussian distribution, CMCGAN can handle dimension and structure asymmetry over visual and audio modalities effectively. Thirdly, CMCGAN can be trained end-to-end to achieve better convenience. Benefiting from CMCGAN, we also develop a dynamic multimodal classification network to handle the modality missing problem. Numerous qualitative and quantitative experiments have been conducted and validate that CMCGAN obtains the state-of-the-art cross-modal visual-audio generation results in both instrument-orientated and pose-orientated generation cases. Furthermore, our experiments show that the generated modality achieves comparable effects with those of original modality, which demonstrates the appealing effectiveness and advantages of our proposed method in handling modality absent problem.

Pages100-130
Language中文
Document Type学位论文
Identifierhttp://ir.ia.ac.cn/handle/173211/23883
Collection模式识别国家重点实验室
智能感知与计算研究中心
Recommended Citation
GB/T 7714
郝王丽. 基于深度学习的视听多模态融合及生成方法研究[D]. 中国科学院自动化研究所. 中国科学院自动化研究所,2019.
Files in This Item:
File Name/Size DocType Version Access License
Thesis1.pdf(10215KB)学位论文 开放获取CC BY-NC-SAApplication Full Text
Related Services
Recommend this item
Bookmark
Usage statistics
Export to Endnote
Google Scholar
Similar articles in Google Scholar
[郝王丽]'s Articles
Baidu academic
Similar articles in Baidu academic
[郝王丽]'s Articles
Bing Scholar
Similar articles in Bing Scholar
[郝王丽]'s Articles
Terms of Use
No data!
Social Bookmark/Share
All comments (0)
No comment.
 

Items in the repository are protected by copyright, with all rights reserved, unless otherwise indicated.