Poster
Efficient Vision Transformers with Partial Attention
Xuan-Thuy Vo · Duy-Linh Nguyen · Adri Priadana · Kang-Hyun Jo
# 174
Strong Double Blind |
As a core of Vision Transformer (ViT), self-attention has high flexibility in modeling long-range dependencies because every query attends to all spatial locations. Although ViT achieves promising performance in visual tasks, self-attention's complexity is quadratic with token lengths. This leads to challenging problems when transferring ViT models to dense prediction tasks that require high input resolutions. Previous arts have tried to solve this problem by introducing sparse attention such as spatial reduction attention, and window attention. One common point of these methods is that all image/window tokens are joined during computing attention weights. In this paper, we find out that there exist high similarities between attention weights and incur computation redundancy. To address this issue, this paper proposes novel attention, called partial attention, that learns spatial interactions more efficiently, by reducing redundant information in attention maps. Each query in our attention only interacts with a small set of relevant tokens. Based on partial attention, we design an efficient and general vision Transformer, named PartialFormer, that attains good trade-offs between accuracy and computational costs across vision tasks. For example, on ImageNet-1K, PartialFormer-B3 outperforms Swin-T by 1.7% Top-1 accuracy while saving 25% GFLOPs, and Focal-T by 0.8% while saving 30% GFLOPs.