Skip to yearly menu bar Skip to main content


Oral

Denoising Vision Transformers

Jiawei Yang · Katie Luo · Jiefeng Li · Congyue Deng · Leonidas Guibas · Dilip Krishnan · Kilian Weinberger · Yonglong Tian · Yue Wang

[ ] [ Visit Oral 5C: Representation Learning ] [ Paper ]
Thu 3 Oct 1 a.m. — 1:10 a.m. PDT

Abstract:

We delve into a crucial yet often overlooked challenge inherent to Vision Transformers (ViTs): feature maps of these models exhibit grid-like artifacts, which hurt the performance of ViTs in downstream dense prediction tasks such as segmentation, depth prediction, and object discovery. We trace this fundamental issue down to the positional embeddings at the input stage. we propose a two-stage denoising approach, termed Denoising Vision Transformers (DVT). In the first stage, we separate the clean features from those contaminated by positional artifacts by enforcing cross-view feature consistency with neural fields on a per-image basis. This per-image optimization process extracts artifact-free features from raw ViT outputs, providing clean feature estimates for offline applications. In the second stage, we train a lightweight Transformer block to predict clean features from raw ViT outputs, leveraging the derived estimates of the clean features as supervision. Our DVT does not require re-training the existing pre-trained ViTs, and is immediately applicable to any Vision Transformer architecture. We evaluate our method on a variety of representative ViTs (DINO, DeiT-III, EVA02, CLIP, DINOv2, DINOv2-reg) and demonstrate that our DVT consistently and significantly improves existing state-of-the-art general-purpose models in semantic and geometric tasks across multiple datasets. We hope our study will encourage a re-evaluation of ViT design, especially regarding the naive use of positional embeddings. Our code and models will be released.

Chat is not available.