CASIA OpenIR  > 学术期刊  > Machine Intelligence Research
Masked Vision-language Transformer in Fashion
Ge-Peng Ji1; Mingchen Zhuge1; Dehong Gao1; Deng-Ping Fan2; Christos Sakaridis2; Luc Van Gool2
Source PublicationMachine Intelligence Research
ISSN2731-538X
2023
Volume20Issue:3Pages:421-434
Abstract

We present a masked vision-language transformer (MVLT) for fashion-specific multi-modal representation. Technically, we simply utilize the vision transformer architecture for replacing the bidirectional encoder representations from Transformers (BERT) in the pre-training model, making MVLT the first end-to-end framework for the fashion domain. Besides, we designed masked image recon struction (MIR) for a fine-grained understanding of fashion. MVLT is an extensible and convenient architecture that admits raw multi-modal inputs without extra pre-processing models (e.g., ResNet), implicitly modeling the vision-language alignments. More importantly, MVLT can easily generalize to various matching and generative tasks. Experimental results show obvious improvements in retrieval (rank@5: 17%) and recognition (accuracy: 3%) tasks over the Fashion-Gen 2018 winner, Kaleido-BERT. The code is available at https://github.com/GewelsJI/MVLT.

KeywordVision-language, masked image reconstruction, transformer, fashion, e-commercial
DOI10.1007/s11633-022-1394-4
Sub direction classification其他
planning direction of the national heavy laboratory其他
Paper associated data
Chinese guidehttps://mp.weixin.qq.com/s/t0kWjLyCgz0dagVC4cAeuA
Citation statistics
Cited Times:5[WOS]   [WOS Record]     [Related Records in WOS]
Document Type期刊论文
Identifierhttp://ir.ia.ac.cn/handle/173211/55988
Collection学术期刊_Machine Intelligence Research
Affiliation1.International Core Business Unit, Alibaba Group, Hangzhou 310051, China
2.Computer Vision Lab, ETH Zürich, Zürich 8092, Switzerland
Recommended Citation
GB/T 7714
Ge-Peng Ji,Mingchen Zhuge,Dehong Gao,et al. Masked Vision-language Transformer in Fashion[J]. Machine Intelligence Research,2023,20(3):421-434.
APA Ge-Peng Ji,Mingchen Zhuge,Dehong Gao,Deng-Ping Fan,Christos Sakaridis,&Luc Van Gool.(2023).Masked Vision-language Transformer in Fashion.Machine Intelligence Research,20(3),421-434.
MLA Ge-Peng Ji,et al."Masked Vision-language Transformer in Fashion".Machine Intelligence Research 20.3(2023):421-434.
Files in This Item: Download All
File Name/Size DocType Version Access License
MIR-2022-05-168Sprin(2779KB)期刊论文出版稿开放获取CC BY-NC-SAView Download
Related Services
Recommend this item
Bookmark
Usage statistics
Export to Endnote
Google Scholar
Similar articles in Google Scholar
[Ge-Peng Ji]'s Articles
[Mingchen Zhuge]'s Articles
[Dehong Gao]'s Articles
Baidu academic
Similar articles in Baidu academic
[Ge-Peng Ji]'s Articles
[Mingchen Zhuge]'s Articles
[Dehong Gao]'s Articles
Bing Scholar
Similar articles in Bing Scholar
[Ge-Peng Ji]'s Articles
[Mingchen Zhuge]'s Articles
[Dehong Gao]'s Articles
Terms of Use
No data!
Social Bookmark/Share
File name: MIR-2022-05-168Springer.pdf
Format: Adobe PDF
All comments (0)
No comment.
 

Items in the repository are protected by copyright, with all rights reserved, unless otherwise indicated.