Efficient Token-Guided Image-Text Retrieval With Consistent Multimodal Contrastive Training
Liu, Chong1; Zhang, Yuqi2; Wang, Hongsong3; Chen, Weihua2; Wang, Fan2; Huang, Yan4; Shen, Yi-Dong1; Wang, Liang4
发表期刊IEEE TRANSACTIONS ON IMAGE PROCESSING
ISSN1057-7149
2023
卷号32页码:3622-3633
通讯作者Wang, Hongsong(hongsongwang@seu.edu.cn) ; Chen, Weihua(kugang.cwh@alibaba-inc.com)
摘要Image-text retrieval is a central problem for understanding the semantic relationship between vision and language, and serves as the basis for various visual and language tasks. Most previous works either simply learn coarse-grained representations of the overall image and text, or elaborately establish the correspondence between image regions or pixels and text words. However, the close relations between coarse- and fine-grained representations for each modality are important for image-text retrieval but almost neglected. As a result, such previous works inevitably suffer from low retrieval accuracy or heavy computational cost. In this work, we address image-text retrieval from a novel perspective by combining coarse- and fine-grained representation learning into a unified framework. This framework is consistent with human cognition, as humans simultaneously pay attention to the entire sample and regional elements to understand the semantic content. To this end, a Token-Guided Dual Transformer (TGDT) architecture which consists of two homogeneous branches for image and text modalities, respectively, is proposed for image-text retrieval. The TGDT incorporates both coarse- and fine-grained retrievals into a unified framework and beneficially leverages the advantages of both retrieval approaches. A novel training objective called Consistent Multimodal Contrastive (CMC) loss is proposed accordingly to ensure the intra- and inter-modal semantic consistencies between images and texts in the common embedding space. Equipped with a two-stage inference method based on the mixed global and local cross-modal similarity, the proposed method achieves state-of-the-art retrieval performances with extremely low inference time when compared with representative recent approaches. Code is publicly available: github.com/LCFractal/TGDT.
关键词Index Terms-Image-text retrieval multimodal transformer multimodal contrastive training
DOI10.1109/TIP.2023.3286710
收录类别SCI
语种英语
资助项目Southeast University Start-Up Grant for New Faculty[RF1028623063] ; National Key Research and Development Program of China[2022ZD0117900] ; National Natural Science Foundation of China[62236010] ; National Natural Science Foundation of China[62276261]
项目资助者Southeast University Start-Up Grant for New Faculty ; National Key Research and Development Program of China ; National Natural Science Foundation of China
WOS研究方向Computer Science ; Engineering
WOS类目Computer Science, Artificial Intelligence ; Engineering, Electrical & Electronic
WOS记录号WOS:001024111100002
出版者IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
引用统计
被引频次:3[WOS]   [WOS记录]     [WOS相关记录]
文献类型期刊论文
条目标识符http://ir.ia.ac.cn/handle/173211/53754
专题多模态人工智能系统全国重点实验室
通讯作者Wang, Hongsong; Chen, Weihua
作者单位1.Chinese Acad Sci, Inst Software, State Key Lab Comp Sci, Beijing 100190, Peoples R China
2.Alibaba Grp, Beijing 100102, Peoples R China
3.Southeast Univ, Dept Comp Sci & Engn, Nanjing 210096, Peoples R China
4.Chinese Acad Sci CASIA, Inst Automat, Ctr Res Intelligent Percept & Comp CRIPAC, Natl Lab Pattern Recognit NLPR, Beijing 100190, Peoples R China
推荐引用方式
GB/T 7714
Liu, Chong,Zhang, Yuqi,Wang, Hongsong,et al. Efficient Token-Guided Image-Text Retrieval With Consistent Multimodal Contrastive Training[J]. IEEE TRANSACTIONS ON IMAGE PROCESSING,2023,32:3622-3633.
APA Liu, Chong.,Zhang, Yuqi.,Wang, Hongsong.,Chen, Weihua.,Wang, Fan.,...&Wang, Liang.(2023).Efficient Token-Guided Image-Text Retrieval With Consistent Multimodal Contrastive Training.IEEE TRANSACTIONS ON IMAGE PROCESSING,32,3622-3633.
MLA Liu, Chong,et al."Efficient Token-Guided Image-Text Retrieval With Consistent Multimodal Contrastive Training".IEEE TRANSACTIONS ON IMAGE PROCESSING 32(2023):3622-3633.
条目包含的文件
条目无相关文件。
个性服务
推荐该条目
保存到收藏夹
查看访问统计
导出为Endnote文件
谷歌学术
谷歌学术中相似的文章
[Liu, Chong]的文章
[Zhang, Yuqi]的文章
[Wang, Hongsong]的文章
百度学术
百度学术中相似的文章
[Liu, Chong]的文章
[Zhang, Yuqi]的文章
[Wang, Hongsong]的文章
必应学术
必应学术中相似的文章
[Liu, Chong]的文章
[Zhang, Yuqi]的文章
[Wang, Hongsong]的文章
相关权益政策
暂无数据
收藏/分享
所有评论 (0)
暂无评论
 

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。