Skip to yearly menu bar Skip to main content


Poster

CLIP-DPO: Vision-Language Models as a Source of Preference for Fixing Hallucinations in LVLMs

Yassine Ouali · Adrian Bulat · Brais Martinez · Georgios Tzimiropoulos

Strong blind review: This paper was not made available on public preprint services during the review process Strong Double Blind
[ ]
Thu 3 Oct 7:30 a.m. PDT — 9:30 a.m. PDT

Abstract:

We present CLIP-DPO, a preference optimization method that leverages pretrained V-L (Vision-Language) embeddings models, such as CLIP, for DPO-based optimization of Vision LLMs. Starting from the initial pool of supervised fine-tuning data, we generate a diverse set of predictions, which are then ranked based on their CLIP image-text similarities to obtain a set of positive and negative pairs for DPO-based training. We show that this simple approach offers notable performance gains over a diverse set of benchmarks and vision-language tasks.

Live content is unavailable. Log in and register to view live content