CASIA OpenIR  > 模式识别国家重点实验室
Integrating both Visual and Audio Cues for Enhanced Video Caption
Wangli Hao1; Zhaoxiang Zhang1,2,3; He Guan1
2018
会议名称AAAI
会议日期2018.2.1
会议地点Hilton New Orleans Riverside, American
摘要

Video caption refers to generating a descriptive sentence for a specific short video clip automatically, which has achieved remarkable success recently. However, most of the existing methods focus more on visual information while ignoring the synchronized audio cues. We propose three multimodal deep fusion strategies to maximize the benefits of visual-audio resonance information. The first one explores the impact on cross-modalities feature fusion from low to high order. The second establishes the visual-audio short-term dependency by sharing weights of corresponding front-end networks. The third extends the temporal dependency to long-term through sharing multimodal memory across visual and audio modalities. Extensive experiments have validated the effectiveness of our three cross-modalities fusion strategies on two benchmark datasets, including Microsoft Research Video to Text (MSRVTT) and Microsoft Video Description (MSVD). It is worth mentioning that sharing weight can coordinate visualaudio feature fusion effectively and achieve the state-of-art performance on both BELU and METEOR metrics. Furthermore, we first propose a dynamic multimodal feature fusion framework to deal with the part modalities missing case. Experimental results demonstrate that even in the audio absence mode, we can still obtain comparable results with the aid of the additional audio modality inference module.
 

语种英语
文献类型会议论文
条目标识符http://ir.ia.ac.cn/handle/173211/23881
专题模式识别国家重点实验室
智能感知与计算研究中心
通讯作者Zhaoxiang Zhang
作者单位1.Center of Research on Intelligent Perception and Computing
2.Institute of Automation, University of Chinese Academy of sciences
3.CAS Center for Excellence in Brain Science and Intelligence
4.Center for Excellence in Brain Science and Intelligence Technology (CEBSIT)
推荐引用方式
GB/T 7714
Wangli Hao,Zhaoxiang Zhang,He Guan. Integrating both Visual and Audio Cues for Enhanced Video Caption[C],2018.
条目包含的文件 下载所有文件
文件名称/大小 文献类型 版本类型 开放类型 使用许可
3--Integrating both (528KB)会议论文 开放获取CC BY-NC-SA浏览 下载
个性服务
推荐该条目
保存到收藏夹
查看访问统计
导出为Endnote文件
谷歌学术
谷歌学术中相似的文章
[Wangli Hao]的文章
[Zhaoxiang Zhang]的文章
[He Guan]的文章
百度学术
百度学术中相似的文章
[Wangli Hao]的文章
[Zhaoxiang Zhang]的文章
[He Guan]的文章
必应学术
必应学术中相似的文章
[Wangli Hao]的文章
[Zhaoxiang Zhang]的文章
[He Guan]的文章
相关权益政策
暂无数据
收藏/分享
文件名: 3--Integrating both Visual and Audio Cues for Enhanced Video Caption.pdf
格式: Adobe PDF
此文件暂不支持浏览
所有评论 (0)
暂无评论
 

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。