Recall What You See Continually Using GridLSTM in Image Captioning
Wu, Lingxiang1; Xu, Min1; Wang, Jinqiao2; Perry, Stuart3
发表期刊IEEE TRANSACTIONS ON MULTIMEDIA
ISSN1520-9210
2020-03-01
卷号22期号:3页码:808-818
通讯作者Xu, Min(Min.Xu@uts.edu.au)
摘要The goal of image captioning is to automatically describe an image with a sentence, and the task has attracted research attention from both the computer vision and natural-language processing research communities. The existing encoder-decoder model and its variants, which are the most popular models for image captioning, use the image features in three ways: first, they inject the encoded image features into the decoder only once at the initial step, which does not enable the rich image content to be explored sufficiently while gradually generating a text caption; second, they concatenate the encoded image features with text as extra inputs at every step, which introduces unnecessary noise; and, third, they using an attention mechanism, which increases the computational complexity due to the introduction of extra neural nets to identify the attention regions. Different from the existing methods, in this paper, we propose a novel network, Recall Network, for generating captions that are consistent with the images. The recall network selectively involves the visual features by using a GridLSTM and, thus, is able to recall image contents while generating each word. By importing the visual information as the latent memory along the depth dimension LSTM, the decoder is able to admit the visual features dynamically through the inherent LSTM structure without adding any extra neural nets or parameters. The Recall Network efficiently prevents the decoder from deviating from the original image content. To verify the efficiency of our model, we conducted exhaustive experiments on full and dense image captioning. The experimental results clearly demonstrate that our recall network outperforms the conventional encoder-decoder model by a large margin and that it performs comparably to the state-of-the-art methods.
关键词Visualization Decoding Task analysis Neural networks Training Computational modeling Logic gates Image captioning GridLSTM recurrent neural network
DOI10.1109/TMM.2019.2931815
关键词[WOS]CLASSIFICATION ; ATTENTION
收录类别SCI
语种英语
WOS研究方向Computer Science ; Telecommunications
WOS类目Computer Science, Information Systems ; Computer Science, Software Engineering ; Telecommunications
WOS记录号WOS:000519576700019
出版者IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
引用统计
被引频次:28[WOS]   [WOS记录]     [WOS相关记录]
文献类型期刊论文
条目标识符http://ir.ia.ac.cn/handle/173211/38640
专题紫东太初大模型研究中心_图像与视频分析
通讯作者Xu, Min
作者单位1.Univ Technol Sydney, Global Big Data Technol Ctr, Sch Elect & Data Engn, Ultimo, NSW 2007, Australia
2.Chinese Acad Sci, Inst Automat, Natl Lab Pattern Recognit, Beijing 100190, Peoples R China
3.Univ Technol Sydney, Sch Elect & Data Engn, Perceptual Imaging Lab, Ultimo, NSW 2007, Australia
推荐引用方式
GB/T 7714
Wu, Lingxiang,Xu, Min,Wang, Jinqiao,et al. Recall What You See Continually Using GridLSTM in Image Captioning[J]. IEEE TRANSACTIONS ON MULTIMEDIA,2020,22(3):808-818.
APA Wu, Lingxiang,Xu, Min,Wang, Jinqiao,&Perry, Stuart.(2020).Recall What You See Continually Using GridLSTM in Image Captioning.IEEE TRANSACTIONS ON MULTIMEDIA,22(3),808-818.
MLA Wu, Lingxiang,et al."Recall What You See Continually Using GridLSTM in Image Captioning".IEEE TRANSACTIONS ON MULTIMEDIA 22.3(2020):808-818.
条目包含的文件
条目无相关文件。
个性服务
推荐该条目
保存到收藏夹
查看访问统计
导出为Endnote文件
谷歌学术
谷歌学术中相似的文章
[Wu, Lingxiang]的文章
[Xu, Min]的文章
[Wang, Jinqiao]的文章
百度学术
百度学术中相似的文章
[Wu, Lingxiang]的文章
[Xu, Min]的文章
[Wang, Jinqiao]的文章
必应学术
必应学术中相似的文章
[Wu, Lingxiang]的文章
[Xu, Min]的文章
[Wang, Jinqiao]的文章
相关权益政策
暂无数据
收藏/分享
所有评论 (0)
暂无评论
 

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。