Poster
SPARO: Selective Attention for Robust and Compositional Transformer Encodings for Vision
Ankit Vani · Bac Nguyen · Samuel Lavoie · Ranjay Krishna · Aaron Courville
# 44
Selective attention is an intrinsic property of human cognition, allowing us to focus on relevant stimuli while filtering out distractions. The information bottleneck principle suggests a similar adaptation for generalizing in machine learning, but representation learning frameworks typically do not have prior knowledge of downstream tasks. In this work, we present SPARO, a read-out mechanism for transformers that provides an inductive bias for learning representations comprised of different ways of performing selective attention. Concretely, it structures representations as a concatenation of outputs from separate single-head attention operations with embedded queries. SPARO improves generalization of CLIP on zero-shot recognition, robustness, retrieval, and compositionality benchmarks, and improves linear probe accuracy on ImageNet. It also improves the ImageNet linear probe and k-nearest neighbor accuracies of DINO. We showcase the ability to post-hoc intervene and select concepts for downstream tasks from the SPARO representation, which can offer further improvements. We provide insights behind the design of SPARO, including ablation experiments, analysis of its representational robustness, and visualization of the attended concepts.