CASIA OpenIR  > 模式识别国家重点实验室
Integrating both Visual and Audio Cues for Enhanced Video Caption
Wangli Hao1; Zhaoxiang Zhang1,2,3; He Guan1
2018
Conference NameAAAI
Conference Date2018.2.1
Conference PlaceHilton New Orleans Riverside, American
Abstract

Video caption refers to generating a descriptive sentence for a specific short video clip automatically, which has achieved remarkable success recently. However, most of the existing methods focus more on visual information while ignoring the synchronized audio cues. We propose three multimodal deep fusion strategies to maximize the benefits of visual-audio resonance information. The first one explores the impact on cross-modalities feature fusion from low to high order. The second establishes the visual-audio short-term dependency by sharing weights of corresponding front-end networks. The third extends the temporal dependency to long-term through sharing multimodal memory across visual and audio modalities. Extensive experiments have validated the effectiveness of our three cross-modalities fusion strategies on two benchmark datasets, including Microsoft Research Video to Text (MSRVTT) and Microsoft Video Description (MSVD). It is worth mentioning that sharing weight can coordinate visualaudio feature fusion effectively and achieve the state-of-art performance on both BELU and METEOR metrics. Furthermore, we first propose a dynamic multimodal feature fusion framework to deal with the part modalities missing case. Experimental results demonstrate that even in the audio absence mode, we can still obtain comparable results with the aid of the additional audio modality inference module.
 

Language英语
Document Type会议论文
Identifierhttp://ir.ia.ac.cn/handle/173211/23881
Collection模式识别国家重点实验室
智能感知与计算研究中心
Corresponding AuthorZhaoxiang Zhang
Affiliation1.Center of Research on Intelligent Perception and Computing
2.Institute of Automation, University of Chinese Academy of sciences
3.CAS Center for Excellence in Brain Science and Intelligence
4.Center for Excellence in Brain Science and Intelligence Technology (CEBSIT)
Recommended Citation
GB/T 7714
Wangli Hao,Zhaoxiang Zhang,He Guan. Integrating both Visual and Audio Cues for Enhanced Video Caption[C],2018.
Files in This Item:
File Name/Size DocType Version Access License
3--Integrating both (528KB)会议论文 开放获取CC BY-NC-SAView Application Full Text
Related Services
Recommend this item
Bookmark
Usage statistics
Export to Endnote
Google Scholar
Similar articles in Google Scholar
[Wangli Hao]'s Articles
[Zhaoxiang Zhang]'s Articles
[He Guan]'s Articles
Baidu academic
Similar articles in Baidu academic
[Wangli Hao]'s Articles
[Zhaoxiang Zhang]'s Articles
[He Guan]'s Articles
Bing Scholar
Similar articles in Bing Scholar
[Wangli Hao]'s Articles
[Zhaoxiang Zhang]'s Articles
[He Guan]'s Articles
Terms of Use
No data!
Social Bookmark/Share
File name: 3--Integrating both Visual and Audio Cues for Enhanced Video Caption.pdf
Format: Adobe PDF
All comments (0)
No comment.
 

Items in the repository are protected by copyright, with all rights reserved, unless otherwise indicated.