Modal Contrastive Learning Based End-to-End Text Image Machine Translation

doi:10.1109/TASLP.2023.3324540

CASIA OpenIR > 多模态人工智能系统全国重点实验室 > 自然语言处理

	Modal Contrastive Learning Based End-to-End Text Image Machine Translation
	Ma, Cong1,2 ; Han, Xu 1,2; Wu, Linghui1,2 ; Zhang, Yaping1,2 ; Zhao, Yang1,2 ; Zhou, Yu2,3 ; Zong, Chengqing1,2
发表期刊	IEEE/ACM Transactions on Audio, Speech, and Language Processing (IEEE/ACM TASLP)
ISSN	2329-9290
	2023-10
卷号	32 期号:32 页码:2153-2165
通讯作者	Zong, Chengqing(cqzong@nlpr.ia.ac.cn)
摘要	Text image machine translation (TIMT) aims at di- rectly translating text in the source language embedded in images into the target language. Most existing systems follow the cascaded pipeline diagram from recognition to translation, which suffers from the problem of error propagation, parameter redundancy, and information reduction. The end-to-end model has the potential to alleviate these issues via bridging the recognition and translation models. However, the challenge is the data limitation and modality gap between text and image. In this paper, we propose a novel end-to-end model, namely Modal contrastive learning based End- to-end Text Image Machine Translation (METIMT), which allevi- ates these issues through end-to-end text image machine translation architecture and modal contrastive learning. Specifically, an image encoder is designed to encode images into the same feature space of corresponding text sentences, with the guidance of an intra-modal and inter-modal contrastive learning module. To further promote the research of text image machine translation, we have constructed one synthetic and two real-world datasets. Extensive experiments show that our lighter, faster model outperforms not only existing pipeline methods but also state-of-the-art end-to-end models on both synthetic and real-world evaluation sets. Our code and dataset will be released to the public.
关键词	Transformers Machine translation Decoding Semantics Pipelines Text recognition Task analysis Text image machine translation contrastive learning text image recognition machine translation
DOI	10.1109/TASLP.2023.3324540
关键词[WOS]	RECOGNITION
收录类别	SCI
语种	英语
资助项目	National Natural Science Foundation of China
项目资助者	National Natural Science Foundation of China
WOS研究方向	Acoustics ; Engineering
WOS类目	Acoustics ; Engineering, Electrical & Electronic
WOS记录号	WOS:001197778500003
出版者	IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
七大方向——子方向分类	自然语言处理
国重实验室规划方向分类	语音语言处理
是否有论文关联数据集需要存交	否
引用统计
文献类型	期刊论文
条目标识符	http://ir.ia.ac.cn/handle/173211/57613
专题	多模态人工智能系统全国重点实验室_自然语言处理
通讯作者	Zhang, Yaping
作者单位	1.School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing 100049, P.R. China 2.State Key Laboratory of Multimodal Artificial Intelligence Systems (MAIS), Institute of Automation, Chinese Academy of Sciences, Beijing, China 3.Fanyu AI Laboratory, Zhongke Fanyu Technology Co., Ltd, Beijing 100190, P.R. China
第一作者单位	中国科学院自动化研究所
通讯作者单位	中国科学院自动化研究所
推荐引用方式 GB/T 7714	Ma, Cong,Han, Xu,Wu, Linghui,et al. Modal Contrastive Learning Based End-to-End Text Image Machine Translation[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing (IEEE/ACM TASLP),2023,32(32):2153-2165.
APA	Ma, Cong.,Han, Xu.,Wu, Linghui.,Zhang, Yaping.,Zhao, Yang.,...&Zong, Chengqing.(2023).Modal Contrastive Learning Based End-to-End Text Image Machine Translation.IEEE/ACM Transactions on Audio, Speech, and Language Processing (IEEE/ACM TASLP),32(32),2153-2165.
MLA	Ma, Cong,et al."Modal Contrastive Learning Based End-to-End Text Image Machine Translation".IEEE/ACM Transactions on Audio, Speech, and Language Processing (IEEE/ACM TASLP) 32.32(2023):2153-2165.

条目包含的文件		下载所有文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
3-TASLP-Modal_Contra（6551KB）	期刊论文	作者接受稿	开放获取	CC BY-NC-SA	浏览下载