Skip to yearly menu bar Skip to main content


Oral

SINDER: Repairing the Singular Defects of DINOv2

Haoqi Wang · Tong Zhang · Mathieu Salzmann

[ ] [ Visit Oral 5C: Representation Learning ] [ Paper ]
Thu 3 Oct 12:50 a.m. — 1 a.m. PDT

Abstract:

Vision Transformer models trained on large-scale datasets, although effective, often exhibit artifacts in the patch token they extract. While such defects can be alleviated by re-training the entire model with additional classification tokens, the underlying reasons for the presence of these tokens remain unclear. In this paper, we conduct a thorough investigation of this phenomenon, combining theoretical analysis with empirical observations. Our findings reveal that these artifacts originate from the pre-trained network itself, specifically stemming from the leading left singular vector of the network's weights. Furthermore, to mitigate these defects, we propose a novel fine-tuning smooth regularization that rectifies structural deficiencies using only a small dataset, thereby avoiding the need for complete re-training. We validate our method on various downstream tasks, including unsupervised segmentation, classification, and supervised segmentation, demonstrating its effectiveness in improving model performance. Our code and checkpoints will be released.

Chat is not available.