Skip to yearly menu bar Skip to main content


Oral

Parrot Captions Teach CLIP to Spot Text

Yiqi Lin · Conghui He · Alex Jinpeng Wang · Bin Wang · Li Weijia · Mike Zheng Shou

[ ] [ Visit Oral 3A: Datasets And Benchmarking ] [ Paper ]
Wed 2 Oct 12:30 a.m. — 12:40 a.m. PDT
[ Slides

Abstract:

Despite CLIP being the foundation model in numerous vision-language applications, the CLIP suffers from a severe text spotting bias. Such bias causes CLIP models to `Parrot' the visual text embedded within images while disregarding the authentic visual semantics. We uncover that in the most popular image-text dataset LAION-2B, the captions also densely parrot (spell) the text embedded in images. Our analysis shows that around 50% of images are embedded with visual text content and around 30% of captions words are concurrently embedded in the visual content. Based on such observation, we thoroughly inspect the different released versions of CLIP models and verify that the visual text is a dominant factor in measuring the LAION-style image-text similarity for these models. To examine whether these parrot captions shape the text spotting bias, we train a series of CLIP models with LAION subsets curated by different parrot-caption-oriented criteria. We show that training with parrot captions easily shapes such bias but harms the expected visual-language representation learning in CLIP models across various vision-language downstream tasks. This suggests that it is urgent to revisit either the design of CLIP-like models or the existing image-text dataset curation pipeline built on CLIP score filtering.

Chat is not available.