CASIA OpenIR  > 学术期刊  > Machine Intelligence Research
Vision Transformers with Hierarchical Attention
Yun Liu1;  Yu-Huan Wu2;  Guolei Sun3;   Le Zhang4; Ajad Chhatkuli3;  Luc Van Gool3
发表期刊Machine Intelligence Research
ISSN2731-538X
2024
卷号21期号:4页码:670-683
摘要This paper tackles the high computational/space complexity associated with multi-head self-attention (MHSA) in vanilla vision transformers. To this end, we propose hierarchical MHSA (H-MHSA), a novel approach that computes self-attention in a hierarchical fashion. Specifically, we first divide the input image into patches as commonly done, and each patch is viewed as a token. Then, the proposed H-MHSA learns token relationships within local patches, serving as local relationship modeling. Then, the small patches are merged into larger ones, and H-MHSA models the global dependencies for the small number of the merged tokens. At last, the local and global attentive features are aggregated to obtain features with powerful representation capacity. Since we only calculate attention for a limited number of tokens at each step, the computational load is reduced dramatically. Hence, H-MHSA can efficiently model glob al relationships among tokens without sacrificing fine-grained information. With the H-MHSA module incorporated, we build a family of hierarchical-attention-based transformer networks, namely HAT-Net. To demonstrate the superiority of HAT-Net in scene understanding, we conduct extensive experiments on fundamental vision tasks, including image classification, semantic segmentation, object detection and instance segmentation. Therefore, HAT-Net provides a new perspective for vision transformers. Code and pretrained models are available at https://github.com/yun-liu/HAT-Net.
关键词Vision transformer hierarchical attention global attention local attention scene understanding
DOI10.1007/s11633-024-1393-8
引用统计
文献类型期刊论文
条目标识符http://ir.ia.ac.cn/handle/173211/58566
专题学术期刊_Machine Intelligence Research
作者单位1.Institute for Infocomm Research (I2R), A*STAR, Singapore 138632, Singapore
2.Institute of High Performance Computing (IHPC), A*STAR, Singapore 138632, Singapore
3.Computer Vision Lab, ETH Zürich, Zürich 8092, Switzerland
4.School of Information and Communication Engineering, University of Electronic Science and Technology of China (UESTC), Chengdu 611731, China
推荐引用方式
GB/T 7714
Yun Liu, Yu-Huan Wu, Guolei Sun,et al. Vision Transformers with Hierarchical Attention[J]. Machine Intelligence Research,2024,21(4):670-683.
APA Yun Liu, Yu-Huan Wu, Guolei Sun,  Le Zhang,Ajad Chhatkuli,& Luc Van Gool.(2024).Vision Transformers with Hierarchical Attention.Machine Intelligence Research,21(4),670-683.
MLA Yun Liu,et al."Vision Transformers with Hierarchical Attention".Machine Intelligence Research 21.4(2024):670-683.
条目包含的文件 下载所有文件
文件名称/大小 文献类型 版本类型 开放类型 使用许可
MIR-2023-09-178.pdf(1358KB)期刊论文出版稿开放获取CC BY-NC-SA浏览 下载
个性服务
推荐该条目
保存到收藏夹
查看访问统计
导出为Endnote文件
谷歌学术
谷歌学术中相似的文章
[Yun Liu]的文章
[ Yu-Huan Wu]的文章
[ Guolei Sun]的文章
百度学术
百度学术中相似的文章
[Yun Liu]的文章
[ Yu-Huan Wu]的文章
[ Guolei Sun]的文章
必应学术
必应学术中相似的文章
[Yun Liu]的文章
[ Yu-Huan Wu]的文章
[ Guolei Sun]的文章
相关权益政策
暂无数据
收藏/分享
文件名: MIR-2023-09-178.pdf
格式: Adobe PDF
所有评论 (0)
暂无评论
 

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。