In this paper, we present AttnZero, the first framework for automatically discovering efficient attention modules tailored for Vision Transformers (ViTs). While traditional self-attention in ViTs suffers from quadratic computation complexity, linear attention offers a more efficient alternative with linear complexity approximation. However, existing hand-crafted linear attention suffers from performance degradation. To address these issues, our AttnZero constructs search spaces and employs evolutionary algorithms to discover potential linear attention formulations. Specifically, our search space consists of six kinds of computation graphs and advanced activation, normalize, and binary operators. To enhance generality, we derive results of candidate attention applied to multiple advanced ViTs as the multi-objective for the evolutionary search. To expedite the search process, we utilize program checking and rejection protocols to filter out unpromising candidates swiftly. Additionally, we develop Attn-Bench-101, which provides precomputed performance of 2,000 attentions in the search spaces, enabling us to summarize attention design insights. Experimental results demonstrate that the discovered AttnZero module generalizes well to different tasks and consistently achieves improved performance across various ViTs. For instance, the tiny model of DeiT|PVT|Swin|CSwin trained with AttnZero on ImageNet reaches 74.9\%|78.1\%|82.1\%|82.9\% top-1 accuracy. Codes are available in the Appendix.
Live content is unavailable. Log in and register to view live content