So Many Heads, So Many Wits: Multimodal Graph Reasoning for Text-Based Visual Question Answering

doi:10.1109/TSMC.2023.3319964

	So Many Heads, So Many Wits: Multimodal Graph Reasoning for Text-Based Visual Question Answering
	Zheng, Wenbo 1,2; Yan, Lan3,4 ; Wang, Fei-Yue5
发表期刊	IEEE TRANSACTIONS ON SYSTEMS MAN CYBERNETICS-SYSTEMS
ISSN	2168-2216
	2023-10-17
页码	12
通讯作者	Zheng, Wenbo(zwb2022@whut.edu.cn)
摘要	While texts related to images convey fundamental messages for scene understanding and reasoning, text-based visual question answering tasks concentrate on visual questions that require reading texts from images. However, most current methods add multimodal features that are independently extracted from a given image into a reasoning model without considering their inter-and intra-relationships according to three modalities (i.e., scene texts, questions, and images). To this end, we propose a novel text-based visual question answering model, multimodal graph reasoning. Our model first extracts intramodality relationships by taking the representations from identical modalities as semantic graphs. Then, we present graph multihead self-attention, which boosts each graph representation through graph-by-graph aggregation to capture the intermodality relationship. It is a case of "so many heads, so many wits" in the sense that as more semantic graphs are involved in this process, each graph representation becomes more effective. Finally, these representations are reprojected, and we perform answer prediction with their outputs. The experimental results demonstrate that our approach realizes substantially better performance compared with other state-of-the-art models.
关键词	Graph attention graph reasoning multimodal graph self-attention text-based visual question answering
DOI	10.1109/TSMC.2023.3319964
关键词[WOS]	ATTENTIONS ; LANGUAGE ; VISION
收录类别	SCI
语种	英语
资助项目	Natural Science Foundation of China[62303361] ; Natural Science Foundation of China[62302161] ; Natural Science Foundation of China[U1811463] ; Hainan Provincial Natural Science Foundation of China[623QN266] ; Fundamental Research Funds for the Central Universities[233110002] ; China National Postdoctoral Program for Innovative Talents[BX20230114] ; National Key Research and Development Program of China[2018AAA0101502] ; Natural Science Foundation of China[62303361] ; Natural Science Foundation of China[62302161] ; Natural Science Foundation of China[U1811463] ; Hainan Provincial Natural Science Foundation of China[623QN266] ; Fundamental Research Funds for the Central Universities[233110002] ; China National Postdoctoral Program for Innovative Talents[BX20230114] ; National Key Research and Development Program of China[2018AAA0101502]
项目资助者	Natural Science Foundation of China ; Hainan Provincial Natural Science Foundation of China ; Fundamental Research Funds for the Central Universities ; China National Postdoctoral Program for Innovative Talents ; National Key Research and Development Program of China ; Natural Science Foundation of China ; Hainan Provincial Natural Science Foundation of China ; Fundamental Research Funds for the Central Universities ; China National Postdoctoral Program for Innovative Talents ; National Key Research and Development Program of China
WOS研究方向	Automation & Control Systems ; Computer Science
WOS类目	Automation & Control Systems ; Computer Science, Cybernetics
WOS记录号	WOS:001090709300001
出版者	IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
引用统计
文献类型	期刊论文
条目标识符	http://ir.ia.ac.cn/handle/173211/54317
专题	多模态人工智能系统全国重点实验室多模态人工智能系统全国重点实验室_平行智能技术与系统团队
通讯作者	Zheng, Wenbo
作者单位	1.Wuhan Univ Technol, Sch Comp Sci & Artificial Intelligence, Wuhan 430070, Peoples R China 2.Wuhan Univ Technol, Sanya Sci & Educ Innovat Pk, Sanya 572000, Peoples R China 3.Hunan Univ, Coll Comp Sci & Engn, Changsha 410082, Hunan, Peoples R China 4.Natl Supercomp Ctr, Changsha 410082, Hunan, Peoples R China 5.Chinese Acad Sci, Inst Automat, State Key Lab Management & Control Complex Syst, Beijing 100190, Peoples R China
推荐引用方式 GB/T 7714	Zheng, Wenbo,Yan, Lan,Wang, Fei-Yue. So Many Heads, So Many Wits: Multimodal Graph Reasoning for Text-Based Visual Question Answering[J]. IEEE TRANSACTIONS ON SYSTEMS MAN CYBERNETICS-SYSTEMS,2023:12.
APA	Zheng, Wenbo,Yan, Lan,&Wang, Fei-Yue.(2023).So Many Heads, So Many Wits: Multimodal Graph Reasoning for Text-Based Visual Question Answering.IEEE TRANSACTIONS ON SYSTEMS MAN CYBERNETICS-SYSTEMS,12.
MLA	Zheng, Wenbo,et al."So Many Heads, So Many Wits: Multimodal Graph Reasoning for Text-Based Visual Question Answering".IEEE TRANSACTIONS ON SYSTEMS MAN CYBERNETICS-SYSTEMS (2023):12.