Masked Vision-language Transformer in Fashion

doi:10.1007/s11633-022-1394-4

CASIA OpenIR > 学术期刊 > Machine Intelligence Research

	Masked Vision-language Transformer in Fashion
	Ge-Peng Ji 1; Mingchen Zhuge 1; Dehong Gao 1; Deng-Ping Fan 2; Christos Sakaridis 2; Luc Van Gool 2
发表期刊	Machine Intelligence Research
ISSN	2731-538X
	2023
卷号	20 期号:3 页码:421-434
摘要	We present a masked vision-language transformer (MVLT) for fashion-specific multi-modal representation. Technically, we simply utilize the vision transformer architecture for replacing the bidirectional encoder representations from Transformers (BERT) in the pre-training model, making MVLT the first end-to-end framework for the fashion domain. Besides, we designed masked image recon struction (MIR) for a fine-grained understanding of fashion. MVLT is an extensible and convenient architecture that admits raw multi-modal inputs without extra pre-processing models (e.g., ResNet), implicitly modeling the vision-language alignments. More importantly, MVLT can easily generalize to various matching and generative tasks. Experimental results show obvious improvements in retrieval (rank@5: 17%) and recognition (accuracy: 3%) tasks over the Fashion-Gen 2018 winner, Kaleido-BERT. The code is available at https://github.com/GewelsJI/MVLT.
关键词	Vision-language, masked image reconstruction, transformer, fashion, e-commercial
DOI	10.1007/s11633-022-1394-4
七大方向——子方向分类	其他
国重实验室规划方向分类	其他
是否有论文关联数据集需要存交	否
中文导读	https://mp.weixin.qq.com/s/t0kWjLyCgz0dagVC4cAeuA
引用统计	被引频次：5[WOS] [WOS记录] [WOS相关记录]
文献类型	期刊论文
条目标识符	http://ir.ia.ac.cn/handle/173211/55988
专题	学术期刊_Machine Intelligence Research
作者单位	1.International Core Business Unit, Alibaba Group, Hangzhou 310051, China 2.Computer Vision Lab, ETH Zürich, Zürich 8092, Switzerland
推荐引用方式 GB/T 7714	Ge-Peng Ji,Mingchen Zhuge,Dehong Gao,et al. Masked Vision-language Transformer in Fashion[J]. Machine Intelligence Research,2023,20(3):421-434.
APA	Ge-Peng Ji,Mingchen Zhuge,Dehong Gao,Deng-Ping Fan,Christos Sakaridis,&Luc Van Gool.(2023).Masked Vision-language Transformer in Fashion.Machine Intelligence Research,20(3),421-434.
MLA	Ge-Peng Ji,et al."Masked Vision-language Transformer in Fashion".Machine Intelligence Research 20.3(2023):421-434.

条目包含的文件		下载所有文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
MIR-2022-05-168Sprin（2779KB）	期刊论文	出版稿	开放获取	CC BY-NC-SA	浏览下载