Skip to yearly menu bar Skip to main content


Poster

Free-ATM: Harnessing Free Attention Masks for Representation Learning on Diffusion-Generated Images

David Junhao Zhang · Mutian Xu · Jay Zhangjie Wu · Chuhui Xue · Wenqing Zhang · Xiaoguang Han · Song Bai · Mike Zheng Shou

[ ]
Fri 4 Oct 1:30 a.m. PDT — 3:30 a.m. PDT

Abstract:

This paper studies visual representation learning with diffusion-generated synthetic images. We start by uncovering that diffusion models' cross-attention layers inherently provide annotation-free attention masks aligned with corresponding text inputs on generated images. We then investigate the problems of three prevalent representation learning methods i.e., contrastive learning, masked modeling, and vision-language pretraining) on diffusion-generated synthetic data and introduce customized solutions by fully exploiting the aforementioned free attention masks, namely Free-ATM. Comprehensive experiments demonstrate Free-ATM's ability to enhance the performance of various representation learning frameworks when utilizing synthetic data. This improvement is consistent across diverse downstream tasks including image classification, detection, segmentation and image-text retrieval. Meanwhile, by utilizing Free-ATM, we can accelerate the pretraining on synthetic images significantly and close the performance gap between representation learning on synthetic data and real-world scenarios.

Live content is unavailable. Log in and register to view live content