Knowledge Commons of Institute of Automation,CAS
Masked Vision-language Transformer in Fashion | |
Ge-Peng Ji1; Mingchen Zhuge1; Dehong Gao1; Deng-Ping Fan2; Christos Sakaridis2; Luc Van Gool2 | |
发表期刊 | Machine Intelligence Research
![]() |
ISSN | 2731-538X |
2023 | |
卷号 | 20期号:3页码:421-434 |
摘要 | We present a masked vision-language transformer (MVLT) for fashion-specific multi-modal representation. Technically, we simply utilize the vision transformer architecture for replacing the bidirectional encoder representations from Transformers (BERT) in the pre-training model, making MVLT the first end-to-end framework for the fashion domain. Besides, we designed masked image recon struction (MIR) for a fine-grained understanding of fashion. MVLT is an extensible and convenient architecture that admits raw multi-modal inputs without extra pre-processing models (e.g., ResNet), implicitly modeling the vision-language alignments. More importantly, MVLT can easily generalize to various matching and generative tasks. Experimental results show obvious improvements in retrieval (rank@5: 17%) and recognition (accuracy: 3%) tasks over the Fashion-Gen 2018 winner, Kaleido-BERT. The code is available at https://github.com/GewelsJI/MVLT. |
关键词 | Vision-language, masked image reconstruction, transformer, fashion, e-commercial |
DOI | 10.1007/s11633-022-1394-4 |
七大方向——子方向分类 | 其他 |
国重实验室规划方向分类 | 其他 |
是否有论文关联数据集需要存交 | 否 |
中文导读 | https://mp.weixin.qq.com/s/t0kWjLyCgz0dagVC4cAeuA |
引用统计 | |
文献类型 | 期刊论文 |
条目标识符 | http://ir.ia.ac.cn/handle/173211/55988 |
专题 | 学术期刊_Machine Intelligence Research |
作者单位 | 1.International Core Business Unit, Alibaba Group, Hangzhou 310051, China 2.Computer Vision Lab, ETH Zürich, Zürich 8092, Switzerland |
推荐引用方式 GB/T 7714 | Ge-Peng Ji,Mingchen Zhuge,Dehong Gao,et al. Masked Vision-language Transformer in Fashion[J]. Machine Intelligence Research,2023,20(3):421-434. |
APA | Ge-Peng Ji,Mingchen Zhuge,Dehong Gao,Deng-Ping Fan,Christos Sakaridis,&Luc Van Gool.(2023).Masked Vision-language Transformer in Fashion.Machine Intelligence Research,20(3),421-434. |
MLA | Ge-Peng Ji,et al."Masked Vision-language Transformer in Fashion".Machine Intelligence Research 20.3(2023):421-434. |
条目包含的文件 | 下载所有文件 | |||||
文件名称/大小 | 文献类型 | 版本类型 | 开放类型 | 使用许可 | ||
MIR-2022-05-168Sprin(2779KB) | 期刊论文 | 出版稿 | 开放获取 | CC BY-NC-SA | 浏览 下载 |
除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。
修改评论