CLIP-VG: Self-Paced Curriculum Adapting of CLIP for Visual Grounding
Xiao, Linhui1,2,3; Yang, Xiaoshan1,2,3; Peng, Fang1,2,3; Yan, Ming4; Wang, Yaowei5; Xu, Changsheng1,2,3
发表期刊IEEE TRANSACTIONS ON MULTIMEDIA
ISSN1520-9210
2024
卷号26页码:4334-4347
通讯作者Xu, Changsheng(csxu@nlpr.ia.ac.cn)
摘要Visual Grounding (VG) is a crucial topic in the field of vision and language, which involves locating a specific region described by expressions within an image. To reduce the reliance on manually labeled data, unsupervised methods have been developed to locate regions using pseudo-labels. However, the performance of existing unsupervised methods is highly dependent on the quality of pseudo-labels and these methods always encounter issues with limited diversity. In order to utilize vision and language pre-trained models to address the grounding problem, and reasonably take advantage of pseudo-labels, we propose CLIP-VG, a novel method that can conduct self-paced curriculum adapting of CLIP with pseudo-language labels. We propose a simple yet efficient end-to-end network architecture to realize the transfer of CLIP to the visual grounding. Based on the CLIP-based architecture, we further propose single-source and multi-source curriculum adapting algorithms, which can progressively find more reliable pseudo-labels to learn an optimal model, thereby achieving a balance between reliability and diversity for the pseudo-language labels. Our method outperforms the current state-of-the-art unsupervised method by a significant margin on RefCOCO/+/g datasets in both single-source and multi-source scenarios, with improvements ranging from 6.78% to 10.67% and 11.39% to 14.87%, respectively. Furthermore, our approach even outperforms existing weakly supervised methods.
关键词Grounding Reliability Adaptation models Task analysis Visualization Data models Annotations Visual grounding curriculum learning pseudo-language label and vision-language models
DOI10.1109/TMM.2023.3321501
收录类别SCI
语种英语
资助项目National Natural Science Foundation of China
项目资助者National Natural Science Foundation of China
WOS研究方向Computer Science ; Telecommunications
WOS类目Computer Science, Information Systems ; Computer Science, Software Engineering ; Telecommunications
WOS记录号WOS:001181498100046
出版者IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
引用统计
文献类型期刊论文
条目标识符http://ir.ia.ac.cn/handle/173211/56991
专题多模态人工智能系统全国重点实验室_多媒体计算
通讯作者Xu, Changsheng
作者单位1.Chinese Acad Sci CASIA, Inst Automat, State Key Lab Multimodal Artificial Intelligence S, Beijing 100190, Peoples R China
2.Peng Cheng Lab PCL, Shenzhen 518066, Peoples R China
3.Univ Chinese Acad Sci UCAS, Sch Artificial Intelligence, Beijing 100049, Peoples R China
4.DAMO Acad, Alibaba Grp, Hangzhou 311121, Peoples R China
5.Peng Cheng Lab, Shenzhen 518066, Peoples R China
第一作者单位中国科学院自动化研究所
通讯作者单位中国科学院自动化研究所
推荐引用方式
GB/T 7714
Xiao, Linhui,Yang, Xiaoshan,Peng, Fang,et al. CLIP-VG: Self-Paced Curriculum Adapting of CLIP for Visual Grounding[J]. IEEE TRANSACTIONS ON MULTIMEDIA,2024,26:4334-4347.
APA Xiao, Linhui,Yang, Xiaoshan,Peng, Fang,Yan, Ming,Wang, Yaowei,&Xu, Changsheng.(2024).CLIP-VG: Self-Paced Curriculum Adapting of CLIP for Visual Grounding.IEEE TRANSACTIONS ON MULTIMEDIA,26,4334-4347.
MLA Xiao, Linhui,et al."CLIP-VG: Self-Paced Curriculum Adapting of CLIP for Visual Grounding".IEEE TRANSACTIONS ON MULTIMEDIA 26(2024):4334-4347.
条目包含的文件
条目无相关文件。
个性服务
推荐该条目
保存到收藏夹
查看访问统计
导出为Endnote文件
谷歌学术
谷歌学术中相似的文章
[Xiao, Linhui]的文章
[Yang, Xiaoshan]的文章
[Peng, Fang]的文章
百度学术
百度学术中相似的文章
[Xiao, Linhui]的文章
[Yang, Xiaoshan]的文章
[Peng, Fang]的文章
必应学术
必应学术中相似的文章
[Xiao, Linhui]的文章
[Yang, Xiaoshan]的文章
[Peng, Fang]的文章
相关权益政策
暂无数据
收藏/分享
所有评论 (0)
暂无评论
 

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。